June 8, 20268 min

Which sources does ChatGPT actually cite (the data)

ChatGPT, Perplexity and Google AI do not cite at random. What the studies say: Bing, earned media, schema. With the figures.

AEOChatGPTPerplexityearned mediacitations IA

Which sources does ChatGPT actually cite

AI answer engines (ChatGPT, Perplexity, Google AI Overviews) cite, in the vast majority of cases, authoritative third-party sources, press, directories, comparison sites, communities, encyclopedias, and not the website of the brand concerned. The figures converge: the Seer Interactive study (2024) shows that over 87 % of SearchGPT and ChatGPT citations correspond to the Bing top 10, and Peec AI data (March 2026) indicates that about 85 % of brand mentions come from third-party pages, not the brand's own domain. The implication is direct: positioning well for AI citation requires as much off-site work (external authority) as on-site work (format and markup). Any approach that plays only on your own site quickly plateaus. Here is what the data says, source by source, then the honest nuance that prevents over-interpretation.

What ChatGPT goes looking for: the Bing index

ChatGPT does not read the web live, it relies on a search layer to answer current questions, and that layer is built on the Bing index. The Seer Interactive study, published in 2024 on the analysis of hundreds of queries, measured that over 87 % of the citations returned by SearchGPT and ChatGPT correspond to pages present in the Bing top 10 results. AI citation and Bing ranking are therefore strongly correlated.

The operational reading is simple. If you are not in the top ten Bing results on your customers' queries, your probability of being named by ChatGPT drops sharply. And the Bing ranking depends on your perceived authority, therefore on the links and mentions other sites make about you. It always comes back to the same point: external authority precedes citation. Working on your Bing ranking is not a technical detail, it is a condition of entry.

The earned media bias, measured sector by sector

A study published on arXiv in 2025 (reference 2509.08919) quantified the origin of AI answer engine citations by sector, and the result is massive: between 73 % and 92 % of cited sources are earned media, that is third-party content (press, articles by other players, directories, comparison sites), and not the brand's site. The breakdown by sector is telling: about 92 % in consumer electronics, about 82 % in automotive, about 73 % in software.

The same work notes a useful contrast: Google remains more balanced, with about 33 % of citations to brand content and about 54 % to earned media. In other words, the purely generative engines lean even more toward third-party sources than classic Google search. For a brand, this means that betting solely on its site amounts to fighting for the narrowest slice of the pie. Most citations are decided on pages you do not own, but which you can influence by triggering credible mentions.

The most cited domains: communities and established media

Peec AI data from March 2026, built on the analysis of about 30 million sources, draws a clear map of the domains most cited by AI engines. At the top: Reddit, YouTube, LinkedIn, Wikipedia and Forbes. These are community platforms and established media, not company sites. The same dataset confirms that about 85 % of brand mentions come from third-party pages, and not from the brand's own domain.

Even more striking: about 48 % of citations come from community platforms (Reddit, forums, discussion spaces). Nearly one citation in two is therefore decided where real opinions are exchanged, not in polished product pages. This does not mean you should spam Reddit, which would be counterproductive and often penalised. It means that your presence in your market's conversations, honest and useful, weighs on what the models consider a reliable source. AI citation rewards the trace you leave in the ecosystem, not only the storefront you control.

Perplexity: schema matters, backlinks much less

The case of Perplexity adds a valuable technical nuance. Available analyses show that the presence of schema.org JSON-LD markup is associated with a markedly higher top 3 citation rate. Giving the engine an explicit, structured reading of your questions and answers increases its confidence and its ability to reuse you in the generated answer.

Conversely, backlinks hardly predict citations on Perplexity: the majority of pages actually cited have few referring domains. This finding surprises those who reason in classic SEO logic, where the link profile remains a central signal. On Perplexity, the engine seems to favour content relevance and structure over popularity measured in links. The lesson is not that links are useless elsewhere (they feed the Bing authority on which ChatGPT depends), but that depending on the target engine, the levers are not the same. You do not play Perplexity the way you play ChatGPT. We detail the mechanisms specific to this engine in our guide getting cited by Perplexity.

The honest nuance: bottom-of-funnel changes things

It would be dishonest to stop at "it all happens on Reddit and Wikipedia". Search Engine Land has documented an important nuance: on bottom-of-funnel queries, of the "best software for X" or "which solution for Y" type, the weight of Reddit and Wikipedia drops markedly. On these purchase-intent queries, engines rely more on specialised niche publications and on the depth of market players' own content.

This rebalances the picture. Your site is not useless, far from it: on queries where a prospect compares precise solutions, dense, structured, expert own content can be cited directly. But this does not contradict the general finding. On-site content is necessary and not sufficient. You need both: a site that answers end-of-journey questions in depth, and an external authority that makes you exist on the broader queries where third-party sources dominate. Understanding why your site alone is not enough is the starting point: we explain it in why your business does not appear in ChatGPT.

What the figures impose as strategy

Lever	What the data says	Expected effect
On-site answer-first format	Number 1 citation factor on Perplexity, depth key in bottom-of-funnel	Direct extraction by the model
JSON-LD schema	Correlated with a better top 3 citation rate (Perplexity)	More reliable machine reading
External authority / earned media	73 to 92 % of citations by sector (arXiv 2509.08919)	Presence on broad queries
Community presence	About 48 % of citations on community platforms (Peec AI)	Source judged reliable by models
Backlinks	Hardly predictive on Perplexity	Indirect effect via Bing

The winning strategy combines two fronts. On your site: an answer-first format (direct answer first, development afterwards), clean schema markup, dense substance content on the real questions of your market. Off your site: external authority work, press, sector lists and comparison sites, honest presence in communities. Neither front is enough on its own.

We sell no citation guarantee. No one outside the model publishers controls the final output, and the figures above are averages that vary by sector, engine and query. What we work on is a probability of being cited, by methodically laying down the signals the data designates as decisive, then measuring it every month with dated screenshots. Our AI visibility diagnosis measures where you stand on these signals, for free.

Frequently asked questions

Does ChatGPT invent its sources?

ChatGPT can produce inaccurate or invented citations (hallucinations), but its active citations in search mode rely mostly on real pages from the Bing index: the Seer Interactive study (2024) measures that over 87 % of these citations correspond to the Bing top 10. The risk of invention mainly concerns answers generated without an active search layer. When the engine cites a URL in search mode, it generally points to an existing source, even if it still needs to be verified.

Do backlinks count for AI citation?

Not directly, or much less than in classic SEO. On Perplexity, backlinks hardly predict citations, and the majority of cited pages have few referring domains. Links keep an indirect role: they feed the perceived authority that drives the Bing ranking, from which ChatGPT draws over 87 % of its citations according to Seer Interactive (2024). Do not bet your AI citation strategy on link acquisition alone, it is an indirect lever, not the central factor.

Should you be present on Reddit and Wikipedia?

It is useful, without being an absolute obligation. Peec AI data (March 2026, about 30 million sources) places Reddit and Wikipedia among the most cited domains, and about 48 % of citations come from community platforms. An honest and useful presence in these spaces strengthens your trace in the ecosystem. Be careful: on bottom-of-funnel queries (best software for X), the weight of Reddit and Wikipedia drops in favour of niche publications, according to Search Engine Land. Adapt to your target queries, never spam these platforms.

Does JSON-LD schema really change anything?

Yes on Perplexity, where the presence of schema.org JSON-LD markup is associated with a markedly higher top 3 citation rate. Schema gives the engine an explicit reading of your questions and answers, which makes its reuse of your content more reliable. But schema does not create authority: it makes readable a content that must already be good and corroborated elsewhere. It is a necessary signal, not a sufficient one on its own.

Is my own site enough to be cited?

No, in the vast majority of cases. The data converges: 73 to 92 % of citations by sector are earned media (arXiv 2509.08919, 2025), and about 85 % of brand mentions come from third-party pages, not the brand's own domain (Peec AI, March 2026). Your site remains necessary, especially on bottom-of-funnel queries where the depth of your own content weighs, but it is not sufficient. You must combine it with external authority work to exist across all queries.

François Kerjean · NovAI← Back to Journal