SparkToro's 2024 zero-click study found that only 374 of every 1,000 U.S. Google searches end in a click to the open web. Pew Research reports Google users click less often when an AI summary appears. And Gartner predicts search engine volume will drop 25% by 2026 as chatbots absorb queries.
So your rank tracker can show you at position #1 while the paragraph a buyer actually reads never names you. AI share of voice is how you measure that second, increasingly more important layer of visibility.
This guide is the practitioner version: the formula, the prompt-set design, the per-engine citation tracking, and the statistical discipline that keeps the numbers defensible.
What is AI share of voice?
AI share of voice (AI SOV), also called Share of Model, is the weighted, prompt-stratified share of a brand's citations and mentions inside AI answer engines, with position decay applied. It is a visibility metric for a world where the synthesized answer is the surface, not a ranked list of links.
It is not a rank tracker, not a sentiment score, and not a revenue metric. It answers one question: when a user asks an engine about your category, how often does the answer mention or cite you, and how prominently?
TL;DR. Classic rank tracking watches positions 1 to 10 for a keyword. AI engines return one synthesized answer with embedded citations that shift every run. To measure visibility now, you score citations and mentions across a stratified prompt set, weight them by position, sample multiple runs per engine, and report the trend, not a single snapshot.
Key takeaways
- A single SOV reading is a sample of size one. LLM answers are non-deterministic; run each prompt at least 3 times per engine and report the median with an interquartile range.
- Score both citation SOV (linked URLs) and combined SOV (citations plus body mentions). They answer different business questions.
- The prompt set is the most consequential design choice. Stratify into branded, category, comparison, problem-led, and long-tail, with at least 30 prompts per stratum.
- Google AI Overviews has no public API. Every other major engine does. Plan your tooling around that asymmetry.
- The highest-ROI output is a discovery-gap audit: prompts where a competitor is cited and you are not, paired with the top-cited source to replicate.
Why does classic rank tracking break for AI answers?
Rank tracking breaks along five axes, and none of them are bugs in the old tools. They are mismatches between what those tools measure and what an AI answer is.
Synthesis, not ranking. A SERP returns ten ordered URLs. A ChatGPT or Gemini answer compresses retrieved sources into one paragraph that may cite three URLs, none of which would have appeared in the classic top 10. A tracker watching positions 1 to 10 is watching an irrelevant surface.
Non-determinism. The same prompt asked twice can return materially different answers. The drivers are retrieval-index freshness, decoding temperature, per-account personalization, and model version. The academic Benchmarking Large Language Model Volatility paper documented that production-grade LLM outputs vary substantially across runs even with greedy decoding, enough to reorder a benchmark.
Citation as foreign key. A SERP row has one URL. An answer can carry zero to a dozen citations, rendered as inline footnotes (Perplexity, Claude), numbered superscripts (ChatGPT web search), or right-rail chips (AI Overviews). Each must be resolved URL to domain to brand.
Fluid position. The first citation in an answer body is the most visible; a sidebar link may not be visible at all. Position becomes a visibility-weighted ordinal, not a fixed slot.
Engine-by-engine rendering. "Be cited for CRM recommendations" becomes five separate measurement problems because each engine retrieves, ranks, and renders citations differently.
How do you calculate AI share of voice?
You sum a brand's citation appearances across the prompt set, weight each by position decay and prompt importance, then normalize so every brand's share is comparable. The dominant mid-2026 formula, which surfaces in Profound's citation-pattern research and the Princeton Generative Engine Optimization paper, reduces to this:
SOV(brand) = avg over prompts of [ avg over engines of
( sum over citations k of c(k, brand) / r_k ) ]
Here c(k, brand) is 1 if the k-th citation is your brand, and r_k is a position-decay weight. Two curves are common. A 1/k curve gives the first citation roughly 10x the value of the tenth. A 1/log(1+k) curve gives it about 2.4x. For executive reporting, the log curve is more defensible because it does not collapse tail signals to noise.
Then normalize: divide your brand's weighted sum by the total weighted signal across every brand that appears, so the shares sum to 100%. A 2026 critique of AI-visibility dashboards found that several tools report shares that do not sum to 100%, which means they are scoring against a closed pool and hiding the missing 70% to 90% of each prompt's signal.
A worked example
Take a 100-prompt set run against ChatGPT and Perplexity. Your brand is cited in 60 prompts at positions 1 through 5, with prompt counts of 18, 14, 11, 9, and 4, and mentioned without a link in 8 more.
Using 1/log(1+k) weights, the position-1 prompts contribute 18 × (1 / ln 2) ≈ 25.97 units, position 2 contributes ≈ 12.74, and so on down to ≈ 2.23 at position 5. The brand's weighted sum lands near 27.23 per engine.
If the total weighted signal across all 12 brands that ever appear is ~430, your citation SOV ≈ 6.3%. Fold in the 8 mention-only prompts and combined SOV ≈ 7.0%.
Report both numbers; citation-only hides mention wins, and combined over-weights unlinked noise.
How do you build a representative prompt set?
The prompt set determines whether your SOV reflects real buyer behavior or your own imagination. Stratify it into five categories, with at least 30 prompts each and 100 to 300 total.
| Stratum | Example | What it tests |
|---|---|---|
| Branded | "Is Acme CRM any good?" | Does the engine know you exist? |
| Category | "Best CRM for a 50-person SaaS company?" | Are you in the recommendation set? |
| Comparison | "Acme vs HubSpot vs Salesforce" | Competitive share in head-to-heads |
| Problem-led | "How do I keep my pipeline clean?" | Are you tied to the job-to-be-done? |
| Long-tail | Paraphrased real user queries | Coverage beyond head terms |
Source prompts from first-party telemetry first: site-search logs, Google Search Console "People Also Ask" data, sales-call and support-ticket text. These have the highest construct validity. Use engine-suggested completions and vendor playbooks as seeds, not truth.
And avoid head-term flooding. "CRM software", "best CRM", and "top CRM software" are three prompts to you but one intent to the model. Variance across near-duplicates is noise. Lock the set in a versioned file so month-over-month numbers stay comparable.
How do you track citations per engine?
Each engine has its own citation surface, API, and terms-of-service posture. Four of the five major engines expose a programmatic path; AI Overviews does not.
| Engine | Programmatic path | Where citations live | ToS note |
|---|---|---|---|
| ChatGPT (web search) | OpenAI Responses API, web_search tool |
url_citation annotations in the output array |
Consumer scraping restricted; API is the supported path |
| Perplexity | Sonar API | citations(inline) and search_results arrays |
Sonar Pro recommended for production sampling |
| Google Gemini | Gemini API, google_search tool |
groundingMetadata.groundingChunks[*].web.uri |
API permitted for measurement, rate-limited |
| Claude | Anthropic Messages API, web_search_20250305 |
content[*].citations[*], 5 location types |
Citations API went GA Jan 2025 |
| Google AI Overviews | None | Right-rail chips, real-browser DOM only | Automated query sampling prohibited at scale |
Anthropic's Citations API reported a 15%+ recall improvement over pre-citation RAG when it shipped, per Simon Willison's analysis. For AI Overviews, the only sampling method is real-browser automation (Playwright plus a residential proxy, or a vendor that does it for you), which is why every GEO methodology page flags AIO as the most volatile surface in their reports.
Peec tracks an 86.7% AIO trigger rate on informational queries, so it fires often but not always.
What is a discovery-gap audit, and why does it matter most?
A discovery-gap audit finds every (prompt, engine) pair where the engine cites at least one competitor you track and your brand is neither cited nor mentioned. It is the single output that converts a measurement into an action.
Score each gap by priority: prompt weight times the inverse rank of the competitors cited, gated on your brand's absence. Prompts that matter, and where competitors dominate, rise to the top of the content backlog. For each gap, capture the top-cited source URL. That page is what your content team studies and matches.
The remediation has evidence behind it. The Princeton/IIIT-Delhi GEO paper found that adding citations, quotations, statistics, and authoritative-source language raised a page's visibility in AI answers by up to ~40% in controlled evaluation, with the largest lift from quotation and statistic additions.
In the wild, expect 5% to 15% per prompt with high variance, not the headline 40%.
How do you keep the numbers defensible?
Treat every reading as one sample from a noisy process. Five disciplines matter.
Run each prompt at least 3 times per engine per cycle, with a sleep between runs, and report the median with an interquartile range. Batch every run in a single time window; index freshness can change a Perplexity or ChatGPT answer in under an hour, so a Monday-morning run and a Tuesday-evening run are not comparable.
Standardize on one account tier per engine and log it. ChatGPT Free, Plus, and Pro can return different answers for the same prompt, and the differences are undocumented.
Report per-stratum and per-engine SOV, never just a global average; a brand can sit at 30% on branded prompts and 2% on category prompts, and the global number hides exactly the gap worth fixing.
On sample size, a brand with 5% true SOV technically needs ~456 prompt-engine observations for a ±2-point margin at 95% confidence. Most teams relax to 100 to 200 prompts and accept ±5 to 7 points.
That is fine, as long as the report says so. Be candid about the blind spots too: English-language bias, paid-tier variation, recency effects, and the fact that a citation is neither a click nor a conversion.
Which tool should you use?
All the trackers compute the same core primitive. They differ on prompt sets, position weights, and engine coverage. The 2026 dashboard critique warns that many vendors score against their own undisclosed prompt sets, so cross-tool comparisons are hazardous unless you supply your own prompts.
| Need | Pick | Why |
|---|---|---|
| Entry, single brand, ~100 prompts | Peec or Otterly (~€89/mo) | Five engines, discovery-gap report included |
| 500+ prompts, cross-engine | Profound or Scrunch (enterprise) | Deepest coverage; Scrunch adds persona maps |
| Free intuition-building | Knowatoa AI Search Grader, SparkToro | Zero budget, basic grading |
| Engineering-first / full control | geo-analyzer, geo-aeo-tracker | Open-source, no dashboard, you own the pipeline |
Pricing is bifurcated: entry tools run roughly $89 to $200 a month, while Profound, AthenaHQ, Evertune, and LLMrefs sit at "Contact us." Rankscale and Nightwatch extend existing SEO subscriptions if you want one bill.
What this means for you
You can stand up a defensible monthly program in 4 to 8 hours a week, one brand and five engines, for $200 to $1,000 in tooling.
Week 1: stratify and lock a 100-to-300-prompt set from first-party telemetry. Week 2: configure engine APIs, standardize the account tier, run three baseline passes in one window, and archive the raw JSON so reanalysis is free. Week 3: parse responses into cited and mentioned brands, score with the 1/log(1+k) curve, and build the priority-ranked discovery-gap report. Week 4: ship a one-page report with global, per-stratum, and per-engine SOV, hand the top-5 gaps to a content owner with the source URLs, and queue the next cycle.
Report the trend, not the snapshot. A single reading is sample size one, and the program's value is the line over time. The brands that wait are not pausing the metric. While they wait, the engines are writing their share of voice for them.
Sources
- SparkToro 2024 Zero-Click Search Study
- Pew Research: Google users click less when an AI summary appears
- Gartner: search volume to drop 25% by 2026
- Benchmarking Large Language Model Volatility (arXiv:2311.15180)
- GEO: Generative Engine Optimization (arXiv:2311.09735)
- Profound: AI Platform Citation Patterns
- AI Visibility Dashboards Grade Their Own Homework
- Anthropic's new Citations API (Simon Willison)
- Perplexity Sonar API docs
- Gemini API grounding docs
- OpenAI Responses / deep research API
- MaxAEO: AI Share of Voice
- Peec AI
- geo-analyzer (GitHub)
