How is AI share of voice different from SEO rank tracking?

Rank tracking measures where a URL sits in a list of ten blue links. AI engines return one synthesized answer with embedded citations whose position and existence change run to run. A brand ranking #1 in classic search can be absent from the answer a user actually reads, so SOV measures that second layer instead.

Can you measure Google AI Overviews with an API?

No. Google AI Overviews has no public API as of mid-2026. The only way to sample it is real-browser automation that triggers live Google queries and parses the result, which is slow and pushes Google's terms of service. ChatGPT, Perplexity Sonar, Gemini, and Claude all expose programmatic citation paths; AI Overviews does not.

How many prompts and runs do you need for a reliable measurement?

Most teams run 100 to 300 stratified prompts, three times per engine per cycle, in a single batched time window. A single reading is a sample of size one because LLM answers are non-deterministic. Report the median plus an interquartile range, and expect roughly plus or minus 5 to 7 points of error at a 5% true SOV with 100 prompts.

AI Share of Voice: How to Measure It Across AI Engines

Q: What is AI share of voice?

AI share of voice (AI SOV) is the weighted, prompt-stratified share of a brand's citations and mentions inside AI answer engines like ChatGPT, Perplexity, and Google AI Overviews. It weights each appearance by citation position and prompt importance. It is a visibility metric, not a rank, sentiment, or revenue metric.

SparkToro's 2024 zero-click study found that only 374 of every 1,000 U.S. Google searches end in a click to the open web. Pew Research reports Google users click less often when an AI summary appears. And Gartner predicts search engine volume will drop 25% by 2026 as chatbots absorb queries.

So your rank tracker can show you at position #1 while the paragraph a buyer actually reads never names you. AI share of voice is how you measure that second, increasingly more important layer of visibility.

This guide is the practitioner version: the formula, the prompt-set design, the per-engine citation tracking, and the statistical discipline that keeps the numbers defensible.

What is AI share of voice?

AI share of voice (AI SOV), also called Share of Model, is the weighted, prompt-stratified share of a brand's citations and mentions inside AI answer engines, with position decay applied. It is a visibility metric for a world where the synthesized answer is the surface, not a ranked list of links.

It is not a rank tracker, not a sentiment score, and not a revenue metric. It answers one question: when a user asks an engine about your category, how often does the answer mention or cite you, and how prominently?

TL;DR. Classic rank tracking watches positions 1 to 10 for a keyword. AI engines return one synthesized answer with embedded citations that shift every run. To measure visibility now, you score citations and mentions across a stratified prompt set, weight them by position, sample multiple runs per engine, and report the trend, not a single snapshot.

Key takeaways

A single SOV reading is a sample of size one. LLM answers are non-deterministic; run each prompt at least 3 times per engine and report the median with an interquartile range.
Score both citation SOV (linked URLs) and combined SOV (citations plus body mentions). They answer different business questions.
The prompt set is the most consequential design choice. Stratify into branded, category, comparison, problem-led, and long-tail, with at least 30 prompts per stratum.
Google AI Overviews has no public API. Every other major engine does. Plan your tooling around that asymmetry.
The highest-ROI output is a discovery-gap audit: prompts where a competitor is cited and you are not, paired with the top-cited source to replicate.

Why does classic rank tracking break for AI answers?

Rank tracking breaks along five axes, and none of them are bugs in the old tools. They are mismatches between what those tools measure and what an AI answer is.

Synthesis, not ranking. A SERP returns ten ordered URLs. A ChatGPT or Gemini answer compresses retrieved sources into one paragraph that may cite three URLs, none of which would have appeared in the classic top 10. A tracker watching positions 1 to 10 is watching an irrelevant surface.

Non-determinism. The same prompt asked twice can return materially different answers. The drivers are retrieval-index freshness, decoding temperature, per-account personalization, and model version. The academic Benchmarking Large Language Model Volatility paper documented that production-grade LLM outputs vary substantially across runs even with greedy decoding, enough to reorder a benchmark.

Citation as foreign key. A SERP row has one URL. An answer can carry zero to a dozen citations, rendered as inline footnotes (Perplexity, Claude), numbered superscripts (ChatGPT web search), or right-rail chips (AI Overviews). Each must be resolved URL to domain to brand.

Fluid position. The first citation in an answer body is the most visible; a sidebar link may not be visible at all. Position becomes a visibility-weighted ordinal, not a fixed slot.

Engine-by-engine rendering. "Be cited for CRM recommendations" becomes five separate measurement problems because each engine retrieves, ranks, and renders citations differently.

How do you calculate AI share of voice?

You sum a brand's citation appearances across the prompt set, weight each by position decay and prompt importance, then normalize so every brand's share is comparable. The dominant mid-2026 formula, which surfaces in Profound's citation-pattern research and the Princeton Generative Engine Optimization paper, reduces to this:

text

SOV(brand) = avg over prompts of [ avg over engines of
             ( sum over citations k of  c(k, brand) / r_k ) ]

Here c(k, brand) is 1 if the k-th citation is your brand, and r_k is a position-decay weight. Two curves are common. A 1/k curve gives the first citation roughly 10x the value of the tenth. A 1/log(1+k) curve gives it about 2.4x. For executive reporting, the log curve is more defensible because it does not collapse tail signals to noise.

Then normalize: divide your brand's weighted sum by the total weighted signal across every brand that appears, so the shares sum to 100%. A 2026 critique of AI-visibility dashboards found that several tools report shares that do not sum to 100%, which means they are scoring against a closed pool and hiding the missing 70% to 90% of each prompt's signal.

A worked example

Take a 100-prompt set run against ChatGPT and Perplexity. Your brand is cited in 60 prompts at positions 1 through 5, with prompt counts of 18, 14, 11, 9, and 4, and mentioned without a link in 8 more.

Using 1/log(1+k) weights, the position-1 prompts contribute 18 × (1 / ln 2) ≈ 25.97 units, position 2 contributes ≈ 12.74, and so on down to ≈ 2.23 at position 5. The brand's weighted sum lands near 27.23 per engine.

If the total weighted signal across all 12 brands that ever appear is ~430, your citation SOV ≈ 6.3%. Fold in the 8 mention-only prompts and combined SOV ≈ 7.0%.

Report both numbers; citation-only hides mention wins, and combined over-weights unlinked noise.

How do you build a representative prompt set?

The prompt set determines whether your SOV reflects real buyer behavior or your own imagination. Stratify it into five categories, with at least 30 prompts each and 100 to 300 total.

Stratum	Example	What it tests
Branded	"Is Acme CRM any good?"	Does the engine know you exist?
Category	"Best CRM for a 50-person SaaS company?"	Are you in the recommendation set?
Comparison	"Acme vs HubSpot vs Salesforce"	Competitive share in head-to-heads
Problem-led	"How do I keep my pipeline clean?"	Are you tied to the job-to-be-done?
Long-tail	Paraphrased real user queries	Coverage beyond head terms

Source prompts from first-party telemetry first: site-search logs, Google Search Console "People Also Ask" data, sales-call and support-ticket text. These have the highest construct validity. Use engine-suggested completions and vendor playbooks as seeds, not truth.

And avoid head-term flooding. "CRM software", "best CRM", and "top CRM software" are three prompts to you but one intent to the model. Variance across near-duplicates is noise. Lock the set in a versioned file so month-over-month numbers stay comparable.

How do you track citations per engine?

Each engine has its own citation surface, API, and terms-of-service posture. Four of the five major engines expose a programmatic path; AI Overviews does not.

Engine	Programmatic path	Where citations live	ToS note
ChatGPT (web search)	OpenAI Responses API, `web_search` tool	`url_citation` annotations in the output array	Consumer scraping restricted; API is the supported path
Perplexity	Sonar API	`citations`(inline) and `search_results` arrays	Sonar Pro recommended for production sampling
Google Gemini	Gemini API, `google_search` tool	`groundingMetadata.groundingChunks[*].web.uri`	API permitted for measurement, rate-limited
Claude	Anthropic Messages API, `web_search_20250305`	`content[].citations[]`, 5 location types	Citations API went GA Jan 2025
Google AI Overviews	None	Right-rail chips, real-browser DOM only	Automated query sampling prohibited at scale

Anthropic's Citations API reported a 15%+ recall improvement over pre-citation RAG when it shipped, per Simon Willison's analysis. For AI Overviews, the only sampling method is real-browser automation (Playwright plus a residential proxy, or a vendor that does it for you), which is why every GEO methodology page flags AIO as the most volatile surface in their reports.

Peec tracks an 86.7% AIO trigger rate on informational queries, so it fires often but not always.

What is a discovery-gap audit, and why does it matter most?

A discovery-gap audit finds every (prompt, engine) pair where the engine cites at least one competitor you track and your brand is neither cited nor mentioned. It is the single output that converts a measurement into an action.

Score each gap by priority: prompt weight times the inverse rank of the competitors cited, gated on your brand's absence. Prompts that matter, and where competitors dominate, rise to the top of the content backlog. For each gap, capture the top-cited source URL. That page is what your content team studies and matches.

The remediation has evidence behind it. The Princeton/IIIT-Delhi GEO paper found that adding citations, quotations, statistics, and authoritative-source language raised a page's visibility in AI answers by up to ~40% in controlled evaluation, with the largest lift from quotation and statistic additions.

In the wild, expect 5% to 15% per prompt with high variance, not the headline 40%.

How do you keep the numbers defensible?

Treat every reading as one sample from a noisy process. Five disciplines matter.

Run each prompt at least 3 times per engine per cycle, with a sleep between runs, and report the median with an interquartile range. Batch every run in a single time window; index freshness can change a Perplexity or ChatGPT answer in under an hour, so a Monday-morning run and a Tuesday-evening run are not comparable.

Standardize on one account tier per engine and log it. ChatGPT Free, Plus, and Pro can return different answers for the same prompt, and the differences are undocumented.

Report per-stratum and per-engine SOV, never just a global average; a brand can sit at 30% on branded prompts and 2% on category prompts, and the global number hides exactly the gap worth fixing.

On sample size, a brand with 5% true SOV technically needs ~456 prompt-engine observations for a ±2-point margin at 95% confidence. Most teams relax to 100 to 200 prompts and accept ±5 to 7 points.

That is fine, as long as the report says so. Be candid about the blind spots too: English-language bias, paid-tier variation, recency effects, and the fact that a citation is neither a click nor a conversion.

Which tool should you use?

All the trackers compute the same core primitive. They differ on prompt sets, position weights, and engine coverage. The 2026 dashboard critique warns that many vendors score against their own undisclosed prompt sets, so cross-tool comparisons are hazardous unless you supply your own prompts.

Need	Pick	Why
Entry, single brand, ~100 prompts	Peec or Otterly (~€89/mo)	Five engines, discovery-gap report included
500+ prompts, cross-engine	Profound or Scrunch (enterprise)	Deepest coverage; Scrunch adds persona maps
Free intuition-building	Knowatoa AI Search Grader, SparkToro	Zero budget, basic grading
Engineering-first / full control	geo-analyzer, geo-aeo-tracker	Open-source, no dashboard, you own the pipeline

Pricing is bifurcated: entry tools run roughly $89 to $200 a month, while Profound, AthenaHQ, Evertune, and LLMrefs sit at "Contact us." Rankscale and Nightwatch extend existing SEO subscriptions if you want one bill.

What this means for you

You can stand up a defensible monthly program in 4 to 8 hours a week, one brand and five engines, for $200 to $1,000 in tooling.

Week 1: stratify and lock a 100-to-300-prompt set from first-party telemetry. Week 2: configure engine APIs, standardize the account tier, run three baseline passes in one window, and archive the raw JSON so reanalysis is free. Week 3: parse responses into cited and mentioned brands, score with the 1/log(1+k) curve, and build the priority-ranked discovery-gap report. Week 4: ship a one-page report with global, per-stratum, and per-engine SOV, hand the top-5 gaps to a content owner with the source URLs, and queue the next cycle.

Report the trend, not the snapshot. A single reading is sample size one, and the program's value is the line over time. The brands that wait are not pausing the metric. While they wait, the engines are writing their share of voice for them.

How to measure AI share of voice across ChatGPT, Perplexity, and Google AI Overviews