AEO Presence Report

Is your website in ChatGPT's training data?

Enter a URL. ACME.BOT scans CommonCrawl and FineWeb — the public datasets behind today's largest LLMs — and returns a per-URL report.

Scans CommonCrawl (2B+ URLs) and FineWeb (200M+ URLs) · Free preview, no signup required.

Sample report

What you get back

Every URL from your domain found in CommonCrawl and FineWeb, organized by dataset with crawl date and page depth.
stripe.com
48 of 120 URLs present
URLCommonCrawlFineWebLast seen
/2026-03
/payments2026-03
/docs/api2026-02
/customers2026-01
/connect2025-12
/blog/launch
/pricing2026-03
/changelog2026-02
  • 1

    Per-URL breakdown

    Not merely the domain's verdict, but each and every matched URL. Find out what pages were allowed.

  • 2

    Dataset coverage

    The FineWeb and CommonCrawl columns indicate which search engine index each URL appears in. FineWeb is the strongest training signal.

  • 3

    Freshness timestamps

    Month a URL was last crawled. Helpful when looking for content that seems to decay out of the index.

Why it matters

Why training-data presence matters in 2026

What ChatGPT, Perplexity, and Gemini say about your brand starts with whether your pages are in their training data. Without that, models guess — and often get it wrong.

Upstream cause of AI answers

First, AI assistants establish an understanding of your business based on training data, adding real-time search on top. But bias from training data still trickles down.

Citations follow presence

Among all pages that Perplexity and ChatGPT Search cite, pages on CommonCrawl and FineWeb show up disproportionately more. The first gap to close - is absence.

Strategic diagnostic

Know where you stand and how much AEO content you need. Getting presence is a minimum. Getting citations is the goal.

How it works

How does the scan work?

ACME.BOT searches FineWeb and CommonCrawl records, notes where and when each URL was crawled, and lists every match.
1

You submit a domain

The root URL or full path. You shouldn't need to login to view this screen.

2

ACME.BOT queries the indexed datasets

We utilize predominantly CommonCrawl and FineWeb datasets. CommonCrawl contains nearly 2 billion URLs that are refreshed on a monthly basis. FineWeb is a subset of the CommonCrawl corpus that contains about 200 million LLM-filtered URLs.

3

You get a per-URL report

View dataset, date last seen, and matching URL. You can export data to CSV, or open this page's data in ACME.BOT's Page Health View and track it over time.

Presence vs citations

Training-data presence vs. AI citations

Presence is what the model learned; citations are what it quotes today. Presence feeds citations — freshness and structure decide which passages get pulled.
What it is
Presence (this report)
URLs found in training datasets
Citations (inside ACME.BOT)
Passages quoted by ChatGPT, Perplexity, and Gemini today
What it tells you
Presence (this report)
What the models learned
Citations (inside ACME.BOT)
What the models actually repeat
How to improve it
Presence (this report)
Get crawled: robots.txt, sitemap, fresh content, backlinks
Citations (inside ACME.BOT)
Publish AEO-structured content; refresh when citations drop

ACME.BOT tracks both — presence here, citations inside the product.

What to do next

What to do after you see the report

Can't find URLs in training data? Allow AI crawlers, publish new content regularly, and structure pages for easy AI extraction.

If most URLs are missing

Submit an updated sitemap, audit AI crawler access in robots.txt (GPTBot, CCBot, PerplexityBot), and get inbound links — CommonCrawl prioritizes link-discovered pages.

If older URLs are present but new ones aren't

CommonCrawl does run monthly, but it still needs content to be linked to and promoted to get discovered. Refresh and continue promoting.

If URLs are present but AI answers are still wrong

Merely being present may not be enough — freshness and structure decide what gets quoted. That's why AEO-optimized content matters.

If you want to track this over time

Each URL is assigned a Page Health View showing its search ranking, presence, and AI mentions by article.

Dataset coverage

Which datasets does ACME.BOT scan?

Two public corpora: CommonCrawl (used by GPT-3, GPT-4, Llama, and most major LLMs) and FineWeb (the filtered subset used for modern frontier training).
2B+
CommonCrawl

Monthly refreshed, the largest open web crawl. Used as a major training dataset for GPT-3, GPT-4, Claude, and Llama.

200M+
FineWeb

HuggingFace's quality-filtered CommonCrawl subset, with heuristics frontier labs use. A solid proxy for training runs of today.

No absolute guarantees — but the strongest public signal

Presence in CommonCrawl or FineWeb doesn't guarantee inclusion in any specific LLM's training set — most frontier labs mix these public datasets with unpublished proprietary data. But CommonCrawl is the backbone of nearly every open-corpus training run, and FineWeb mirrors the filters frontier labs apply internally. Absence from both is almost always a red flag; presence across both is the strongest public evidence that a model had access to your content.
FAQ

Frequently asked questions

Is my website in ChatGPT's training data?
What's the difference between CommonCrawl and FineWeb?
How often are these datasets updated?
Can I remove my content from existing training datasets?
How do I get my site included in future crawls?
Why are some of my pages present but not others?
Does training-data presence guarantee AI citations?
How is this different from Profound or Indexly?
Is this free?
How accurate is the scan?

See what AI learned from your site.
Free preview. Full report inside ACME.BOT.

One agent runs the AEO Presence Report, then turns the gaps into a content plan.