How SEOgent Works — Architecture & Workflow

SEOgent is an automated SEO auditing platform. You give it a URL, it crawls the site, analyzes every page against 30+ checks, and returns structured results that both humans and AI agents can act on.

This doc covers what happens under the hood — from initial crawl to final report — so you understand exactly what SEOgent is checking and how to use the results.

The Scan Lifecycle

Every scan moves through four stages:

Pending → Crawling → Analyzing → Completed

Pending — The scan is queued. SEOgent runs site-level checks (robots.txt, sitemap) and discovers which URLs to crawl.
Crawling — The crawler fetches pages, collects HTML, records status codes, and optionally validates links and measures performance.
Analyzing — Each crawled page is run through the SEO analysis engine, which scores it against 30+ checks.
Completed — All pages are analyzed. Site-wide issues (like duplicate titles) are detected, and results are ready.

If credits run out mid-scan, you'll still get partial results for every page analyzed up to that point.

URL Discovery — How SEOgent Finds Pages

SEOgent supports two crawl modes that determine how it builds the list of URLs to scan.

Discover Mode (default)

Discover mode automatically finds pages on your site. The process:

SEOgent checks your robots.txt for a Sitemap: directive
If a sitemap is found, it parses the XML (including nested sitemap indexes) and extracts URLs
If no sitemap exists, the homepage is used as the seed
The crawler then follows internal links to discover additional pages beyond the sitemap
Discovery continues until the page limit is reached or no new URLs are found

This is the recommended mode for most sites. Set a max_pages limit to control scope — useful for large sites where you want to audit a representative sample rather than every page.

Manual Mode

Manual mode skips discovery entirely. You provide an explicit list of URLs and SEOgent crawls only those pages. This is useful for:

Auditing a specific section of a site (e.g., just your blog or product pages)
Re-scanning pages where you've made fixes
Testing staging URLs before deployment

Site-Level Checks

Before crawling begins, SEOgent runs checks against your site's configuration. These aren't tied to any single page — they apply to the entire domain.

robots.txt Analysis

SEOgent fetches and parses your robots.txt to check:

Check	What It Looks For
robots.txt exists	Is the file accessible at `/robots.txt`?
No blanket block	Does `User-agent: * / Disallow: /` block all crawlers?
Search engines allowed	Are Googlebot and Bingbot explicitly blocked?
Sitemap directive	Does robots.txt reference a sitemap?
Reasonable crawl delay	Is the crawl-delay set above 10 seconds?
AI bot access	Are GPTBot, ClaudeBot, ChatGPT-User, Google-Extended, or PerplexityBot blocked?

The AI bot access check is particularly relevant for GEO (Generative Engine Optimization). If your robots.txt blocks AI crawlers, your content won't appear in AI-generated answers regardless of how well-optimized it is.

Sitemap Validation

SEOgent verifies that your sitemap is accessible. It checks the URL from robots.txt first, then probes /sitemap.xml and /sitemap_index.xml as fallbacks.

Duplicate Detection (after analysis completes)

Once all pages are analyzed, SEOgent scans for:

Duplicate titles — Two or more pages sharing the same <title> tag
Duplicate meta descriptions — Two or more pages sharing the same meta description

These are grouped by the duplicated value, showing which URLs share it. Duplicate titles and descriptions dilute your search presence and confuse both users and search engines about which page to rank.

Page-Level SEO Checks

Every crawled page is analyzed against checks organized into categories. Each check has a weight (how much it affects the overall score) and a severity (error vs. warning).

Metadata

Check	What It Looks For	Weight
Title tag exists	Page has a `<title>` tag	10
Title length	Between 30–60 characters	5
Meta description exists	Page has `<meta name="description">`	10
Meta description length	Between 70–160 characters	5
Canonical URL	`<link rel="canonical">` is present	5
Viewport meta tag	Mobile-responsive viewport is set	5
Language attribute	`<html lang="...">` is set	3
Open Graph tags	og:title, og:description, og:image present	3

Content Quality

Check	What It Looks For	Weight
H1 heading	Page has exactly one H1	10
Single H1	No more than one H1 on the page	5
Heading hierarchy	No skipped levels (e.g., H1 → H3 without H2)	4
Content length	At least 300 words of body text	8
Internal links	Page links to other pages on the same domain	5
Image alt attributes	All images have descriptive alt text	8
Image lazy loading	Images use `loading="lazy"` (checked when 2+ images)	3
Orphan anchors	`<a>` tags without `href` attributes	3

Technical Configuration

Check	What It Looks For	Weight
HTTPS	Page is served over a secure connection	5
HTML5 DOCTYPE	`<!DOCTYPE html>` is present	2
Character encoding	`<meta charset>` or `Content-Type` header	2
No noindex	Page isn't accidentally blocked from indexing	8
No redirect chains	Fewer than 2 redirects to reach the page	7
HTTP status code	Page returns 2xx, not 4xx or 5xx	10
SEO-friendly URL	Short path, hyphens not underscores, few query params	3
Hreflang tags	International language targeting tags are present	2

Structured Data (JSON-LD)

Check	What It Looks For	Weight
JSON-LD exists	At least one `<script type="application/ld+json">` block	5
Valid JSON	All JSON-LD blocks contain parseable JSON	5
Has @type	Every block defines a schema type	3
Rich-result eligible	Uses types Google supports (Article, Product, FAQPage, etc.)	2

Answer Engine Optimization (AEO)

These checks evaluate how well your content is structured for AI-generated answers and featured snippets. They only run on pages with 300+ words of content.

Check	What It Looks For	Weight
Answer blocks	FAQ accordions, question headings, or definition lists	5
FAQ/HowTo schema	FAQPage, HowTo, or Question/Answer schema types in JSON-LD	5
Data tables	Structured comparison tables with headers and multiple rows	3

Answer blocks matter because search engines and AI models extract direct answers from well-structured content. A page with a clear <h2>What is X?</h2> followed by a concise paragraph is far more likely to be quoted in an AI answer or featured snippet than a wall of unstructured text.

Dead Links & Broken Images (optional)

When link checking is enabled, SEOgent verifies every link and image on each page:

Check	What It Looks For	Weight
No dead links	Internal and external links return valid responses	8
No broken images	Images return 2xx status codes (not 4xx/5xx)	7

The crawler makes HEAD requests to external URLs and cross-references internal links against all crawled pages. Results distinguish between internal dead links (pages on your site that 404) and external dead links (third-party URLs that no longer work).

Scoring

Each check carries a weight between 2 and 10. The page score is calculated as:

score = (sum of passed check weights / total check weights) × 100

A page that passes all checks scores 100. High-weight checks like "has title" (10) and "has H1" (10) have more impact than low-weight checks like "has hreflang" (2).

Checks that fail as warnings (like title length being slightly off) still reduce the score, but they're presented separately from hard errors (like a missing title entirely) so you can prioritize fixes.

Performance Scanning (optional)

When performance scanning is enabled, SEOgent measures Core Web Vitals and Lighthouse metrics for your pages.

What It Measures

Metric	What It Is
LCP (Largest Contentful Paint)	How long until the main content is visible
FCP (First Contentful Paint)	Time to first rendered content
CLS (Cumulative Layout Shift)	Visual stability — how much the page shifts during load
INP (Interaction to Next Paint)	Responsiveness to user input
TTFB (Time to First Byte)	Server response time
TBT (Total Blocking Time)	How long the main thread is blocked
SI (Speed Index)	How quickly visible content fills the viewport

Intelligent Page Grouping

Running Lighthouse on every page of a large site would be wasteful. Most pages built from the same template perform similarly — /products/blue-widget and /products/red-widget likely share the same layout, CSS, and JavaScript.

SEOgent uses pattern matching to group URLs by their template structure, then tests one representative page per group. Here's how it works:

URL	Detected Pattern	Group
`/`	`/`	Homepage
`/products/blue-widget`	`/products/{slug}`	Product page
`/products/red-widget`	`/products/{slug}`	Product page (same group)
`/blog/2025/01/my-post`	`/blog/{year}/{month}/{slug}`	Blog post
`/blog/2025/03/other-post`	`/blog/{year}/{month}/{slug}`	Blog post (same group)
`/users/42`	`/users/{id}`	User profile
`/users/187`	`/users/{id}`	User profile (same group)

The pattern matcher recognizes:

Slugs — hyphenated strings like my-great-post
Numeric IDs — pure digits like 42
UUIDs — standard 8-4-4-4-12 hex format
Date segments — years, months, and days
Hash tokens — alphanumeric strings like YouTube video IDs
Sibling detection — if /category/esports and /category/gaming both exist, it infers {slug} even without hyphens

The result: a site with 500 pages might only need 8–12 performance tests. The homepage is always included, and groups are sorted alphabetically for consistency.

How Agents and Humans Use the Results

SEOgent's structured output is designed to be actionable for both human developers and AI coding agents.

For AI Agents (CLI & API)

The SEOgent CLI and REST API return machine-readable JSON that agents can parse and act on directly. A typical agent workflow:

Scan — The agent starts a scan via seogent scan https://example.com or POST /api/scans
Poll — It checks progress until the scan completes
Triage — Results come back with every check categorized and scored. The agent filters to failed checks and warnings, sorted by weight
Fix — The agent reads the relevant source files and applies fixes. For example:
- Missing title? Add a <title> tag to the layout
- Images without alt text? Add descriptive alt attributes
- No JSON-LD? Generate a schema block based on page content
- Broken links? Remove or update the href
Verify — The agent runs a manual-mode scan on just the fixed URLs to confirm the issues are resolved

Every check result includes a key, category, weight, and status — so agents can programmatically decide what to fix first and how to verify the fix worked.

For Humans (Dashboard & Reports)

The web dashboard groups results into clear categories:

Site-level summary — robots.txt issues, sitemap status, duplicate titles/descriptions, and AI bot accessibility
Page scores — every scanned page with a 0–100 score and letter grade (A–F)
Issue breakdown — failed checks and warnings grouped by category (metadata, content, technical, structured data, AEO)
Performance metrics — Core Web Vitals for each tested page group
Top issues — the most common problems across the entire site, ranked by frequency

Addressing SEO, AEO, and GEO

SEOgent's checks map to three overlapping optimization strategies:

SEO (Search Engine Optimization) — The core checks. Titles, descriptions, headings, internal links, canonical URLs, structured data, and technical health. These directly affect how search engines crawl, index, and rank your pages.

AEO (Answer Engine Optimization) — The answer block and schema checks. FAQ accordions, question headings, definition lists, data tables, and FAQ/HowTo schema give AI models and featured-snippet algorithms structured content to extract answers from. A page that clearly answers "What is X?" with a concise paragraph under an H2 is far more likely to be cited.

GEO (Generative Engine Optimization) — The AI bot access check and structured data checks. If your robots.txt blocks ClaudeBot or GPTBot, your content is invisible to AI training and retrieval systems. JSON-LD structured data helps AI models understand what your page is about, not just what text it contains.

Webhook Notifications

For automated workflows, you can provide a webhook URL when starting a scan. SEOgent sends a POST request when the scan completes, including the scan ID, domain, average score, and a summary of results. This lets you trigger downstream actions — Slack notifications, CI pipeline steps, or agent-based fix workflows — without polling.

Credit System

Scans are priced per page. Standard HTML analysis is the base cost, with optional add-ons for link checking and performance scanning. Performance scans use intelligent page grouping, so a 500-page site might only incur performance costs for 8–12 representative pages. See Pricing for current rates.