Is your website in ChatGPT's training data?
Scans CommonCrawl (2B+ URLs) and FineWeb (200M+ URLs) · Free preview, no signup required.
What you get back
| URL | CommonCrawl | FineWeb | Last seen |
|---|---|---|---|
| / | 2026-03 | ||
| /payments | 2026-03 | ||
| /docs/api | 2026-02 | ||
| /customers | — | 2026-01 | |
| /connect | 2025-12 | ||
| /blog/launch | — | — | — |
| /pricing | 2026-03 | ||
| /changelog | — | 2026-02 |
- 1
Per-URL breakdown
Not merely the domain's verdict, but each and every matched URL. Find out what pages were allowed.
- 2
Dataset coverage
The FineWeb and CommonCrawl columns indicate which search engine index each URL appears in. FineWeb is the strongest training signal.
- 3
Freshness timestamps
Month a URL was last crawled. Helpful when looking for content that seems to decay out of the index.
Why training-data presence matters in 2026
Upstream cause of AI answers
First, AI assistants establish an understanding of your business based on training data, adding real-time search on top. But bias from training data still trickles down.
Citations follow presence
Among all pages that Perplexity and ChatGPT Search cite, pages on CommonCrawl and FineWeb show up disproportionately more. The first gap to close - is absence.
Strategic diagnostic
Know where you stand and how much AEO content you need. Getting presence is a minimum. Getting citations is the goal.
How does the scan work?
You submit a domain
The root URL or full path. You shouldn't need to login to view this screen.
ACME.BOT queries the indexed datasets
We utilize predominantly CommonCrawl and FineWeb datasets. CommonCrawl contains nearly 2 billion URLs that are refreshed on a monthly basis. FineWeb is a subset of the CommonCrawl corpus that contains about 200 million LLM-filtered URLs.
You get a per-URL report
View dataset, date last seen, and matching URL. You can export data to CSV, or open this page's data in ACME.BOT's Page Health View and track it over time.
Training-data presence vs. AI citations
| Presence (this report) | Citations (inside ACME.BOT) | |
|---|---|---|
| What it is | URLs found in training datasets | Passages quoted by ChatGPT, Perplexity, and Gemini today |
| What it tells you | What the models learned | What the models actually repeat |
| How to improve it | Get crawled: robots.txt, sitemap, fresh content, backlinks | Publish AEO-structured content; refresh when citations drop |
ACME.BOT tracks both — presence here, citations inside the product.
What to do after you see the report
If most URLs are missing
Submit an updated sitemap, audit AI crawler access in robots.txt (GPTBot, CCBot, PerplexityBot), and get inbound links — CommonCrawl prioritizes link-discovered pages.
If older URLs are present but new ones aren't
CommonCrawl does run monthly, but it still needs content to be linked to and promoted to get discovered. Refresh and continue promoting.
If URLs are present but AI answers are still wrong
Merely being present may not be enough — freshness and structure decide what gets quoted. That's why AEO-optimized content matters.
If you want to track this over time
Each URL is assigned a Page Health View showing its search ranking, presence, and AI mentions by article.
Which datasets does ACME.BOT scan?
Monthly refreshed, the largest open web crawl. Used as a major training dataset for GPT-3, GPT-4, Claude, and Llama.
HuggingFace's quality-filtered CommonCrawl subset, with heuristics frontier labs use. A solid proxy for training runs of today.