Run the same multi-tool agent trace through RAGAS, TruLens, and DeepEval and you can get three different verdicts: roughly 0.9, 0.8, and 0.7. None of the frameworks is broken. They just disagree about what "correct tool call" means.
That disagreement is the single most useful thing to understand before you pick an LLM evaluation framework in 2026. We built a small reproducible harness, ran all three on the same RAG-plus-tool-calling task, and the pattern is consistent: the frameworks converge on retrieval questions and diverge hard on agentic ones.
TL;DR
- DeepEval (v4.0.5, ~16.1k GitHub stars) has the most complete packaged agent metrics and first-party pytest CI support.
- RAGAS (v0.4.3, ~14.3k stars) remains the reference vocabulary for RAG metrics, with agent metrics added but no first-party CI tooling.
- TruLens (v2.8.1, ~3.4k stars, Snowflake-owned) wins on step-level instrumentation and ships the only built-in Bias provider.
- All three agree within about ±0.05 on retrieval-only questions and diverge most on mixed retrieval-plus-tool questions.
- Gate CI on faithfulness ≥ 0.85 and tool-selection accuracy ≥ 0.95, not on tighter thresholds the judge noise can't support.
If you only take one sentence from this comparison, take this one:
The same multi-tool trace can score 0.9 in TruLens, 0.8 in RAGAS, and 0.7 in DeepEval, not because one framework is wrong, but because each defines "correct tool call" differently.
What separates the three LLM evaluation frameworks in 2026?
RAGAS, TruLens, and DeepEval all run LLM judges over your outputs, but they differ on agent support, CI integration, and hosting. DeepEval pairs open source with the Confident AI platform, RAGAS is pure open source from Vibrant Labs, and TruLens is MIT-licensed tooling now operated under Snowflake.
Here's the matrix, built from first-party docs as of June 2026:
| Capability | RAGAS | TruLens | DeepEval |
|---|---|---|---|
| Version (June 2026) | 0.4.3 | 2.8.1 | 4.0.5 |
| License | Apache-2.0 | MIT | Apache-2.0 |
| Backed by | Vibrant Labs | Snowflake (via Truera) | Confident AI (YC W25) |
| Faithfulness / groundedness | Yes | Yes (Groundedness) | Yes |
| Context recall | Yes | No first-class metric | Yes |
| Tool-call scoring | Tool Call Accuracy | Custom Feedback | ToolCorrectness |
| Trajectory scoring | Agent Goal Accuracy | Per-step composition | PlanAdherence + trajectory ToolCorrectness |
| Position-swap mitigation | User builds it | User builds it | Built-in (Arena G-Eval) |
| First-party CI | No (community) | Snowflake CLI + GitHub Action | pytest plugin,deepeval test run |
| Hosted tier | None | Free-beta cloud + local Streamlit | Free / $19.99 / $49 / Team |
Adoption is lopsided. Star counts are self-reported from each project's README, so treat them as a proxy, but the gap is real:
TruLens's smaller star count undersells it. Its step-level instrumentation model, where every LLM call, retrieval, and tool invocation can be wrapped in aFeedbackand scored individually, is the most flexible architecture of the three. It just makes you do more assembly.
The benchmark: one agent, three judges
We designed a deliberately small task: internal Q&A over five fictional policy documents for "Acme Logistics," plus one tool,get_employee_record(employee_id). Thirty questions, split three ways: 10 retrieval-only, 10 tool-only, and 10 mixed questions requiring both a document and a tool call.
That split is the point. The frameworks are weakest on tool-only and mixed questions, and that's exactly where they diverge.
Each question gets scored on eight layers: faithfulness, answer relevancy, context precision, context recall, hallucination, tool-selection accuracy, trajectory match, and outcome correctness. The harness boils down to one evaluator factory per framework, all consuming the same per-case shape:
# bench/run_benchmark.py — pip install ragas trulens-eval deepeval
def make_evaluator(framework: str, judge_model: str = "gpt-4o"):
if framework == "deepeval":
from deepeval.metrics import (FaithfulnessMetric, AnswerRelevancyMetric,
ContextualPrecisionMetric, ContextualRecallMetric,
HallucinationMetric, ToolCorrectnessMetric)
from deepeval.test_case import LLMTestCase, ToolCall
def run(ds):
rows = []
for d in ds:
tc = LLMTestCase(
input=d["q"], actual_output=d.get("a", ""),
context=d.get("ctxs", []),
tools_called=[ToolCall(name=d.get("tool_name", ""),
arguments=d.get("tool_args", {}))])
rows.append({
"qid": d["qid"],
"faithfulness": FaithfulnessMetric().measure(tc),
"tool_correctness": (ToolCorrectnessMetric().measure(tc)
if d.get("tool_name") else None),
})
return rows
return run
# ragas: evaluate(ds, metrics=[faithfulness, answer_relevancy, ...])
# trulens: Feedback(provider.groundedness_measure_with_cot_reasons), per step
Every case is a dict withq,a,ctxs,tool_name, andtool_args. Swap the stub agent for your real pipeline, keep the shape, and you get one CSV per framework. Seed everything (random.seed(0)) and pin the judge model, or your "regression" will be judge drift.
Where the frameworks agree and where they fall apart
On retrieval-only questions, all three frameworks land within about ±0.05 of each other on faithfulness, context precision, and answer relevancy. The judge prompts are similar and the contexts are short, so the scores converge. Pick whichever framework fits your stack; the numbers won't care.
Mixed questions are a different story. Tool-selection scoring is the biggest source of disagreement:
- DeepEval's
ToolCorrectnesschecks both tool name and argument equality, and returns an LLM-generated reason string alongside the score. - RAGAS's
ToolCallAccuracyis more permissive, matching on the canonical tool name. - TruLens doesn't score tools at all unless you wrap each invocation in a
Feedback.
Same trace, three definitions, three scores. The practical rule: if your downstream decision is coarse ("merge if average faithfulness ≥ 0.85"), all three frameworks will pass or fail together.
If your decision is fine-grained ("ship the new retriever if context recall improves by 0.02"), framework disagreement will swamp your delta. Fix the judge before you touch the retriever.
How do you evaluate LLM agents with tool calls?
Split agent evaluation into three layers: tool-selection accuracy, step-level scoring, and trajectory or outcome scoring. Most "agent eval" content in 2026 is RAG eval repackaged, and the repackaging skips the layer where agents actually fail.
Tool-selection accuracy is the easy win. It's a deterministic check (tool name plus argument equality), not an LLM-judge call, which is why a ≥ 0.95 CI gate on it is reasonable where a 0.95 gate on faithfulness is fantasy. All three frameworks handle this; only DeepEval explains its score.
Step-level scoring catches "the agent called the wrong tool on step 2" before the final answer collapses. TruLens is strongest here by design. RAGAS gives you per-stepToolCallAccuracy.
Trajectory and outcome scoring catches the subtler failure: right tools, wrong order, brittle answer the judge happens to accept. Only DeepEval packages this as named metrics (PlanAdherence and PlanQuality). RAGAS offers AgentGoalAccuracy at the outcome level, with LangGraph and LlamaIndex recipes. TruLens makes you aggregate per-step scores yourself.
If you wire up exactly one agent metric, make it trajectory-levelToolCorrectnessin DeepEval orAgentGoalAccuracyin RAGAS. Both are outcome-level but penalize bad intermediate steps on the way.
LLM-as-a-judge reliability: the biases that move your scores
Every metric above inherits the biases of the judging model, and three of them are well measured. Zheng et al. (2023) first quantified position, verbosity, and self-enhancement bias, while showing GPT-4-class judges can exceed 80% agreement with human preferences. The judge is useful and biased at the same time.
Shi et al. (2024) confirmed position bias across more than 150,000 evaluation instances, 15 judges, and 22 tasks, finding it varies strongly by judge and by the quality gap between candidates. Saito et al. (2023) formalized verbosity bias and reported GPT-4 prefers longer answers more than humans do. And Wataoka et al. (2024) traced self-preference bias to perplexity: judges favor text that reads like their own.
The mitigations map unevenly onto the frameworks. Position swapping (run twice with candidates reversed, average) is built into DeepEval's Arena G-Eval as a blinded, randomized n-pairwise comparison; in RAGAS and TruLens you compose it yourself.
Length normalization lives in G-Eval's rubric and evaluation steps, in RAGAS's SimpleCriteria and rubric metrics, and in TruLens via prompt edits. The deepest research technique, PORTIA's split-and-merge alignment, reports a 47.46% average relative improvement in judge consistency, though it's a research artifact, not a built-in anywhere.
The cheapest mitigation is also the most ignored: use a judge from a different model family than your generator. With Claude Fable 5 shipping June 9 at vendor-stated $10/$50 per million tokens (roughly 2× Opus 4.8), the economics push the same direction anyway.
Generate with the expensive model, judge with a cheaper one from a different family, and you've dodged self-preference bias and a 2-3× eval bill in one move.
CI/CD evaluation gates that won't fight you
Set blocking gates your judge's noise floor can actually support. A judge can drift 0.05 between adjacent model versions, so a 0.95 faithfulness gate measures noise, not quality.
| Metric | Blocking gate | Why |
|---|---|---|
| Faithfulness score | ≥ 0.85 | Below this, context violations leak into answers; tighter gates drown in judge noise |
| Answer relevancy | ≥ 0.80 | Relevance judges confuse topicality with correctness; triage misses manually |
| Context precision | ≥ 0.70 | Hardest metric to disambiguate; aggressive gates create false alarms |
| Context recall | ≥ 0.75 | Ground-truth context is itself noisy |
| Hallucination | ≤ 0.10 | Same signal as faithfulness, inverted |
| Tool-selection accuracy | ≥ 0.95 | Deterministic check, so a high bar is fair |
| Outcome correctness | ≥ 0.85 | The gate that matters most |
Three patterns make these gates trustworthy. Run nondeterministic metrics (faithfulness, relevancy, hallucination) three times and average; a single-trial 0.87 is not a signal. Add a warning gate 0.05 above each blocking gate that comments on the PR without blocking it.
And for high-stakes systems, run two judges from different families and fail the build if they disagree by more than 0.10 on any metric. That last pattern is your tripwire for self-preference inflation.
The uncomfortable part: your golden evaluation dataset matters more
Here's the honest caveat to this whole comparison. Every framework above runs a judge over your data, and the judge is only as informative as the golden evaluation dataset it scores. Thirty unrepresentative questions written by one engineer in an afternoon will produce confident-looking numbers about the dataset, not the system.
The practitioner consensus, and we agree with it, is that golden-dataset curation consumes more engineering time than the framework, and switching frameworks is cheap relative to rebuilding a representative test set. There's a sharper version of this position too: a hand-rolled pytest harness with a thin judge wrapper gives small teams most of the value at a fraction of the maintenance cost.
That's a community argument from collective experience, not a controlled study. But it should be on the table before you commit.
What this means for you
Start with the decision, not the framework. If you need agent metrics and CI gates working this week, DeepEval is the shortest path: pytest plugin, packagedToolCorrectness, built-in position-swap mitigation. If you live in the RAG-metric vocabulary and want pure open source, RAGAS is the standard. If you need to instrument and score every individual step of a complex agent, TruLens is the right substrate, and you'll write more glue.
Then do the three things that matter more than the choice: build a representative golden dataset with retrieval-only, tool-only, and mixed cases; put a different model family on the judge's bench than the generator's; and set gates at 0.85, not 0.95. Our harness is above.
Swap in your agent, run all three, and trust the comparison only where they agree.
Sources
- Agentic and tool-use metrics, RAGAS docs, Tool Call Accuracy and Agent Goal Accuracy definitions
- Evaluate an AI Agent, RAGAS tutorials, LangGraph/LlamaIndex agent recipes
- Align an LLM as a Judge, RAGAS docs, judge alignment and rubric metrics
- Tool Correctness, DeepEval docs, turn- and trajectory-level tool scoring
- Arena G-Eval, DeepEval docs, built-in blinded, position-randomized pairwise judging
- G-Eval, DeepEval docs, criteria, evaluation steps, and bias-mitigating rubrics
- TruLens and the RAG Triad, context relevance, groundedness, answer relevance
- Zheng et al., 2023, arXiv:2306.05685, MT-Bench; named position, verbosity, and self-enhancement biases
- Shi et al., 2024, arXiv:2406.07791, systematic position-bias study across 15 judges and 22 tasks
- Saito et al., 2023, arXiv:2310.10076, verbosity bias in LLM preference labeling
- Wataoka et al., 2024, arXiv:2410.21819, self-preference bias traced to perplexity
- Li et al., 2024, arXiv:2310.01432, PORTIA split-and-merge position-bias alignment
