← Back to Documentation

How SEOgent Works — Architecture & Workflow

SEOgent is an automated SEO auditing platform. You give it a URL, it crawls the site, analyzes every page against 30+ checks, and returns structured results that both humans and AI agents can act on.

This doc covers what happens under the hood — from initial crawl to final report — so you understand exactly what SEOgent is checking and how to use the results.


The Scan Lifecycle

Every scan moves through four stages:

PendingCrawlingAnalyzingCompleted

  1. Pending — The scan is queued. SEOgent runs site-level checks (robots.txt, sitemap) and discovers which URLs to crawl.
  2. Crawling — The crawler fetches pages, collects HTML, records status codes, and optionally validates links and measures performance.
  3. Analyzing — Each crawled page is run through the SEO analysis engine, which scores it against 30+ checks.
  4. Completed — All pages are analyzed. Site-wide issues (like duplicate titles) are detected, and results are ready.

If credits run out mid-scan, you'll still get partial results for every page analyzed up to that point.


URL Discovery — How SEOgent Finds Pages

SEOgent supports two crawl modes that determine how it builds the list of URLs to scan.

Discover Mode (default)

Discover mode automatically finds pages on your site. The process:

  1. SEOgent checks your robots.txt for a Sitemap: directive
  2. If a sitemap is found, it parses the XML (including nested sitemap indexes) and extracts URLs
  3. If no sitemap exists, the homepage is used as the seed
  4. The crawler then follows internal links to discover additional pages beyond the sitemap
  5. Discovery continues until the page limit is reached or no new URLs are found

This is the recommended mode for most sites. Set a max_pages limit to control scope — useful for large sites where you want to audit a representative sample rather than every page.

Manual Mode

Manual mode skips discovery entirely. You provide an explicit list of URLs and SEOgent crawls only those pages. This is useful for:


Site-Level Checks

Before crawling begins, SEOgent runs checks against your site's configuration. These aren't tied to any single page — they apply to the entire domain.

robots.txt Analysis

SEOgent fetches and parses your robots.txt to check:

Check What It Looks For
robots.txt exists Is the file accessible at /robots.txt?
No blanket block Does User-agent: * / Disallow: / block all crawlers?
Search engines allowed Are Googlebot and Bingbot explicitly blocked?
Sitemap directive Does robots.txt reference a sitemap?
Reasonable crawl delay Is the crawl-delay set above 10 seconds?
AI bot access Are GPTBot, ClaudeBot, ChatGPT-User, Google-Extended, or PerplexityBot blocked?

The AI bot access check is particularly relevant for GEO (Generative Engine Optimization). If your robots.txt blocks AI crawlers, your content won't appear in AI-generated answers regardless of how well-optimized it is.

Sitemap Validation

SEOgent verifies that your sitemap is accessible. It checks the URL from robots.txt first, then probes /sitemap.xml and /sitemap_index.xml as fallbacks.

Duplicate Detection (after analysis completes)

Once all pages are analyzed, SEOgent scans for:

These are grouped by the duplicated value, showing which URLs share it. Duplicate titles and descriptions dilute your search presence and confuse both users and search engines about which page to rank.


Page-Level SEO Checks

Every crawled page is analyzed against checks organized into categories. Each check has a weight (how much it affects the overall score) and a severity (error vs. warning).

Metadata

Check What It Looks For Weight
Title tag exists Page has a <title> tag 10
Title length Between 30–60 characters 5
Meta description exists Page has <meta name="description"> 10
Meta description length Between 70–160 characters 5
Canonical URL <link rel="canonical"> is present 5
Viewport meta tag Mobile-responsive viewport is set 5
Language attribute <html lang="..."> is set 3
Open Graph tags og:title, og:description, og:image present 3

Content Quality

Check What It Looks For Weight
H1 heading Page has exactly one H1 10
Single H1 No more than one H1 on the page 5
Heading hierarchy No skipped levels (e.g., H1 → H3 without H2) 4
Content length At least 300 words of body text 8
Internal links Page links to other pages on the same domain 5
Image alt attributes All images have descriptive alt text 8
Image lazy loading Images use loading="lazy" (checked when 2+ images) 3
Orphan anchors <a> tags without href attributes 3

Technical Configuration

Check What It Looks For Weight
HTTPS Page is served over a secure connection 5
HTML5 DOCTYPE <!DOCTYPE html> is present 2
Character encoding <meta charset> or Content-Type header 2
No noindex Page isn't accidentally blocked from indexing 8
No redirect chains Fewer than 2 redirects to reach the page 7
HTTP status code Page returns 2xx, not 4xx or 5xx 10
SEO-friendly URL Short path, hyphens not underscores, few query params 3
Hreflang tags International language targeting tags are present 2

Structured Data (JSON-LD)

Check What It Looks For Weight
JSON-LD exists At least one <script type="application/ld+json"> block 5
Valid JSON All JSON-LD blocks contain parseable JSON 5
Has @type Every block defines a schema type 3
Rich-result eligible Uses types Google supports (Article, Product, FAQPage, etc.) 2

Answer Engine Optimization (AEO)

These checks evaluate how well your content is structured for AI-generated answers and featured snippets. They only run on pages with 300+ words of content.

Check What It Looks For Weight
Answer blocks FAQ accordions, question headings, or definition lists 5
FAQ/HowTo schema FAQPage, HowTo, or Question/Answer schema types in JSON-LD 5
Data tables Structured comparison tables with headers and multiple rows 3

Answer blocks matter because search engines and AI models extract direct answers from well-structured content. A page with a clear <h2>What is X?</h2> followed by a concise paragraph is far more likely to be quoted in an AI answer or featured snippet than a wall of unstructured text.

Dead Links & Broken Images (optional)

When link checking is enabled, SEOgent verifies every link and image on each page:

Check What It Looks For Weight
No dead links Internal and external links return valid responses 8
No broken images Images return 2xx status codes (not 4xx/5xx) 7

The crawler makes HEAD requests to external URLs and cross-references internal links against all crawled pages. Results distinguish between internal dead links (pages on your site that 404) and external dead links (third-party URLs that no longer work).


Scoring

Each check carries a weight between 2 and 10. The page score is calculated as:

score = (sum of passed check weights / total check weights) × 100

A page that passes all checks scores 100. High-weight checks like "has title" (10) and "has H1" (10) have more impact than low-weight checks like "has hreflang" (2).

Checks that fail as warnings (like title length being slightly off) still reduce the score, but they're presented separately from hard errors (like a missing title entirely) so you can prioritize fixes.


Performance Scanning (optional)

When performance scanning is enabled, SEOgent measures Core Web Vitals and Lighthouse metrics for your pages.

What It Measures

Metric What It Is
LCP (Largest Contentful Paint) How long until the main content is visible
FCP (First Contentful Paint) Time to first rendered content
CLS (Cumulative Layout Shift) Visual stability — how much the page shifts during load
INP (Interaction to Next Paint) Responsiveness to user input
TTFB (Time to First Byte) Server response time
TBT (Total Blocking Time) How long the main thread is blocked
SI (Speed Index) How quickly visible content fills the viewport

Intelligent Page Grouping

Running Lighthouse on every page of a large site would be wasteful. Most pages built from the same template perform similarly — /products/blue-widget and /products/red-widget likely share the same layout, CSS, and JavaScript.

SEOgent uses pattern matching to group URLs by their template structure, then tests one representative page per group. Here's how it works:

URL Detected Pattern Group
/ / Homepage
/products/blue-widget /products/{slug} Product page
/products/red-widget /products/{slug} Product page (same group)
/blog/2025/01/my-post /blog/{year}/{month}/{slug} Blog post
/blog/2025/03/other-post /blog/{year}/{month}/{slug} Blog post (same group)
/users/42 /users/{id} User profile
/users/187 /users/{id} User profile (same group)

The pattern matcher recognizes:

The result: a site with 500 pages might only need 8–12 performance tests. The homepage is always included, and groups are sorted alphabetically for consistency.


How Agents and Humans Use the Results

SEOgent's structured output is designed to be actionable for both human developers and AI coding agents.

For AI Agents (CLI & API)

The SEOgent CLI and REST API return machine-readable JSON that agents can parse and act on directly. A typical agent workflow:

  1. Scan — The agent starts a scan via seogent scan https://example.com or POST /api/scans
  2. Poll — It checks progress until the scan completes
  3. Triage — Results come back with every check categorized and scored. The agent filters to failed checks and warnings, sorted by weight
  4. Fix — The agent reads the relevant source files and applies fixes. For example:
    • Missing title? Add a <title> tag to the layout
    • Images without alt text? Add descriptive alt attributes
    • No JSON-LD? Generate a schema block based on page content
    • Broken links? Remove or update the href
  5. Verify — The agent runs a manual-mode scan on just the fixed URLs to confirm the issues are resolved

Every check result includes a key, category, weight, and status — so agents can programmatically decide what to fix first and how to verify the fix worked.

For Humans (Dashboard & Reports)

The web dashboard groups results into clear categories:

Addressing SEO, AEO, and GEO

SEOgent's checks map to three overlapping optimization strategies:

SEO (Search Engine Optimization) — The core checks. Titles, descriptions, headings, internal links, canonical URLs, structured data, and technical health. These directly affect how search engines crawl, index, and rank your pages.

AEO (Answer Engine Optimization) — The answer block and schema checks. FAQ accordions, question headings, definition lists, data tables, and FAQ/HowTo schema give AI models and featured-snippet algorithms structured content to extract answers from. A page that clearly answers "What is X?" with a concise paragraph under an H2 is far more likely to be cited.

GEO (Generative Engine Optimization) — The AI bot access check and structured data checks. If your robots.txt blocks ClaudeBot or GPTBot, your content is invisible to AI training and retrieval systems. JSON-LD structured data helps AI models understand what your page is about, not just what text it contains.

Webhook Notifications

For automated workflows, you can provide a webhook URL when starting a scan. SEOgent sends a POST request when the scan completes, including the scan ID, domain, average score, and a summary of results. This lets you trigger downstream actions — Slack notifications, CI pipeline steps, or agent-based fix workflows — without polling.


Credit System

Scans are priced per page. Standard HTML analysis is the base cost, with optional add-ons for link checking and performance scanning. Performance scans use intelligent page grouping, so a 500-page site might only incur performance costs for 8–12 representative pages. See Pricing for current rates.