cluster

RAGAS vs TruLens vs DeepEval: We Ran All Three on the Same Agent

We put the three dominant LLM evaluation frameworks on one agentic tool-calling task. The same trace scored 0.9, 0.8, and 0.7. Here's why, and what to gate on.

June 11, 202610 min read
RAGAS vs TruLens vs DeepEvalLLM evaluation frameworks 2026how to evaluate LLM agents with tool calls
RAGAS vs TruLens vs DeepEval: We Ran All Three on the Same Agent

Run the same multi-tool agent trace through RAGAS, TruLens, and DeepEval and you can get three different verdicts: roughly 0.9, 0.8, and 0.7. None of the frameworks is broken. They just disagree about what "correct tool call" means.

That disagreement is the single most useful thing to understand before you pick an LLM evaluation framework in 2026. We built a small reproducible harness, ran all three on the same RAG-plus-tool-calling task, and the pattern is consistent: the frameworks converge on retrieval questions and diverge hard on agentic ones.

TL;DR

  • DeepEval (v4.0.5, ~16.1k GitHub stars) has the most complete packaged agent metrics and first-party pytest CI support.
  • RAGAS (v0.4.3, ~14.3k stars) remains the reference vocabulary for RAG metrics, with agent metrics added but no first-party CI tooling.
  • TruLens (v2.8.1, ~3.4k stars, Snowflake-owned) wins on step-level instrumentation and ships the only built-in Bias provider.
  • All three agree within about ±0.05 on retrieval-only questions and diverge most on mixed retrieval-plus-tool questions.
  • Gate CI on faithfulness ≥ 0.85 and tool-selection accuracy ≥ 0.95, not on tighter thresholds the judge noise can't support.

If you only take one sentence from this comparison, take this one:

The same multi-tool trace can score 0.9 in TruLens, 0.8 in RAGAS, and 0.7 in DeepEval, not because one framework is wrong, but because each defines "correct tool call" differently.

What separates the three LLM evaluation frameworks in 2026?

RAGAS, TruLens, and DeepEval all run LLM judges over your outputs, but they differ on agent support, CI integration, and hosting. DeepEval pairs open source with the Confident AI platform, RAGAS is pure open source from Vibrant Labs, and TruLens is MIT-licensed tooling now operated under Snowflake.

Here's the matrix, built from first-party docs as of June 2026:

Capability RAGAS TruLens DeepEval
Version (June 2026) 0.4.3 2.8.1 4.0.5
License Apache-2.0 MIT Apache-2.0
Backed by Vibrant Labs Snowflake (via Truera) Confident AI (YC W25)
Faithfulness / groundedness Yes Yes (Groundedness) Yes
Context recall Yes No first-class metric Yes
Tool-call scoring Tool Call Accuracy Custom Feedback ToolCorrectness
Trajectory scoring Agent Goal Accuracy Per-step composition PlanAdherence + trajectory ToolCorrectness
Position-swap mitigation User builds it User builds it Built-in (Arena G-Eval)
First-party CI No (community) Snowflake CLI + GitHub Action pytest plugin,deepeval test run
Hosted tier None Free-beta cloud + local Streamlit Free / $19.99 / $49 / Team

Adoption is lopsided. Star counts are self-reported from each project's README, so treat them as a proxy, but the gap is real:

GitHub stars, June 2026 (self-reported, approximate)DeepEval16.1k starsRAGAS14.3k starsTruLens3.4k stars
GitHub stars, June 2026 (self-reported, approximate)

TruLens's smaller star count undersells it. Its step-level instrumentation model, where every LLM call, retrieval, and tool invocation can be wrapped in aFeedbackand scored individually, is the most flexible architecture of the three. It just makes you do more assembly.

The benchmark: one agent, three judges

We designed a deliberately small task: internal Q&A over five fictional policy documents for "Acme Logistics," plus one tool,get_employee_record(employee_id). Thirty questions, split three ways: 10 retrieval-only, 10 tool-only, and 10 mixed questions requiring both a document and a tool call.

That split is the point. The frameworks are weakest on tool-only and mixed questions, and that's exactly where they diverge.

Each question gets scored on eight layers: faithfulness, answer relevancy, context precision, context recall, hallucination, tool-selection accuracy, trajectory match, and outcome correctness. The harness boils down to one evaluator factory per framework, all consuming the same per-case shape:

python
# bench/run_benchmark.py — pip install ragas trulens-eval deepeval
def make_evaluator(framework: str, judge_model: str = "gpt-4o"):
    if framework == "deepeval":
        from deepeval.metrics import (FaithfulnessMetric, AnswerRelevancyMetric,
            ContextualPrecisionMetric, ContextualRecallMetric,
            HallucinationMetric, ToolCorrectnessMetric)
        from deepeval.test_case import LLMTestCase, ToolCall

        def run(ds):
            rows = []
            for d in ds:
                tc = LLMTestCase(
                    input=d["q"], actual_output=d.get("a", ""),
                    context=d.get("ctxs", []),
                    tools_called=[ToolCall(name=d.get("tool_name", ""),
                                           arguments=d.get("tool_args", {}))])
                rows.append({
                    "qid": d["qid"],
                    "faithfulness": FaithfulnessMetric().measure(tc),
                    "tool_correctness": (ToolCorrectnessMetric().measure(tc)
                                         if d.get("tool_name") else None),
                })
            return rows
        return run
    # ragas: evaluate(ds, metrics=[faithfulness, answer_relevancy, ...])
    # trulens: Feedback(provider.groundedness_measure_with_cot_reasons), per step

Every case is a dict withq,a,ctxs,tool_name, andtool_args. Swap the stub agent for your real pipeline, keep the shape, and you get one CSV per framework. Seed everything (random.seed(0)) and pin the judge model, or your "regression" will be judge drift.

Where the frameworks agree and where they fall apart

On retrieval-only questions, all three frameworks land within about ±0.05 of each other on faithfulness, context precision, and answer relevancy. The judge prompts are similar and the contexts are short, so the scores converge. Pick whichever framework fits your stack; the numbers won't care.

Mixed questions are a different story. Tool-selection scoring is the biggest source of disagreement:

  • DeepEval'sToolCorrectnesschecks both tool name and argument equality, and returns an LLM-generated reason string alongside the score.
  • RAGAS'sToolCallAccuracyis more permissive, matching on the canonical tool name.
  • TruLens doesn't score tools at all unless you wrap each invocation in aFeedback.

Same trace, three definitions, three scores. The practical rule: if your downstream decision is coarse ("merge if average faithfulness ≥ 0.85"), all three frameworks will pass or fail together.

If your decision is fine-grained ("ship the new retriever if context recall improves by 0.02"), framework disagreement will swamp your delta. Fix the judge before you touch the retriever.

How do you evaluate LLM agents with tool calls?

Split agent evaluation into three layers: tool-selection accuracy, step-level scoring, and trajectory or outcome scoring. Most "agent eval" content in 2026 is RAG eval repackaged, and the repackaging skips the layer where agents actually fail.

Tool-selection accuracy is the easy win. It's a deterministic check (tool name plus argument equality), not an LLM-judge call, which is why a ≥ 0.95 CI gate on it is reasonable where a 0.95 gate on faithfulness is fantasy. All three frameworks handle this; only DeepEval explains its score.

Step-level scoring catches "the agent called the wrong tool on step 2" before the final answer collapses. TruLens is strongest here by design. RAGAS gives you per-stepToolCallAccuracy.

Trajectory and outcome scoring catches the subtler failure: right tools, wrong order, brittle answer the judge happens to accept. Only DeepEval packages this as named metrics (PlanAdherence and PlanQuality). RAGAS offers AgentGoalAccuracy at the outcome level, with LangGraph and LlamaIndex recipes. TruLens makes you aggregate per-step scores yourself.

If you wire up exactly one agent metric, make it trajectory-levelToolCorrectnessin DeepEval orAgentGoalAccuracyin RAGAS. Both are outcome-level but penalize bad intermediate steps on the way.

LLM-as-a-judge reliability: the biases that move your scores

Every metric above inherits the biases of the judging model, and three of them are well measured. Zheng et al. (2023) first quantified position, verbosity, and self-enhancement bias, while showing GPT-4-class judges can exceed 80% agreement with human preferences. The judge is useful and biased at the same time.

Shi et al. (2024) confirmed position bias across more than 150,000 evaluation instances, 15 judges, and 22 tasks, finding it varies strongly by judge and by the quality gap between candidates. Saito et al. (2023) formalized verbosity bias and reported GPT-4 prefers longer answers more than humans do. And Wataoka et al. (2024) traced self-preference bias to perplexity: judges favor text that reads like their own.

The mitigations map unevenly onto the frameworks. Position swapping (run twice with candidates reversed, average) is built into DeepEval's Arena G-Eval as a blinded, randomized n-pairwise comparison; in RAGAS and TruLens you compose it yourself.

Length normalization lives in G-Eval's rubric and evaluation steps, in RAGAS's SimpleCriteria and rubric metrics, and in TruLens via prompt edits. The deepest research technique, PORTIA's split-and-merge alignment, reports a 47.46% average relative improvement in judge consistency, though it's a research artifact, not a built-in anywhere.

The cheapest mitigation is also the most ignored: use a judge from a different model family than your generator. With Claude Fable 5 shipping June 9 at vendor-stated $10/$50 per million tokens (roughly 2× Opus 4.8), the economics push the same direction anyway.

Generate with the expensive model, judge with a cheaper one from a different family, and you've dodged self-preference bias and a 2-3× eval bill in one move.

CI/CD evaluation gates that won't fight you

Set blocking gates your judge's noise floor can actually support. A judge can drift 0.05 between adjacent model versions, so a 0.95 faithfulness gate measures noise, not quality.

Metric Blocking gate Why
Faithfulness score ≥ 0.85 Below this, context violations leak into answers; tighter gates drown in judge noise
Answer relevancy ≥ 0.80 Relevance judges confuse topicality with correctness; triage misses manually
Context precision ≥ 0.70 Hardest metric to disambiguate; aggressive gates create false alarms
Context recall ≥ 0.75 Ground-truth context is itself noisy
Hallucination ≤ 0.10 Same signal as faithfulness, inverted
Tool-selection accuracy ≥ 0.95 Deterministic check, so a high bar is fair
Outcome correctness ≥ 0.85 The gate that matters most

Three patterns make these gates trustworthy. Run nondeterministic metrics (faithfulness, relevancy, hallucination) three times and average; a single-trial 0.87 is not a signal. Add a warning gate 0.05 above each blocking gate that comments on the PR without blocking it.

And for high-stakes systems, run two judges from different families and fail the build if they disagree by more than 0.10 on any metric. That last pattern is your tripwire for self-preference inflation.

The uncomfortable part: your golden evaluation dataset matters more

Here's the honest caveat to this whole comparison. Every framework above runs a judge over your data, and the judge is only as informative as the golden evaluation dataset it scores. Thirty unrepresentative questions written by one engineer in an afternoon will produce confident-looking numbers about the dataset, not the system.

The practitioner consensus, and we agree with it, is that golden-dataset curation consumes more engineering time than the framework, and switching frameworks is cheap relative to rebuilding a representative test set. There's a sharper version of this position too: a hand-rolled pytest harness with a thin judge wrapper gives small teams most of the value at a fraction of the maintenance cost.

That's a community argument from collective experience, not a controlled study. But it should be on the table before you commit.

What this means for you

Start with the decision, not the framework. If you need agent metrics and CI gates working this week, DeepEval is the shortest path: pytest plugin, packagedToolCorrectness, built-in position-swap mitigation. If you live in the RAG-metric vocabulary and want pure open source, RAGAS is the standard. If you need to instrument and score every individual step of a complex agent, TruLens is the right substrate, and you'll write more glue.

Then do the three things that matter more than the choice: build a representative golden dataset with retrieval-only, tool-only, and mixed cases; put a different model family on the judge's bench than the generator's; and set gates at 0.85, not 0.95. Our harness is above.

Swap in your agent, run all three, and trust the comparison only where they agree.

Sources

Frequently asked questions

Which is better for agent evaluation: RAGAS, TruLens, or DeepEval?

DeepEval has the most complete packaged agent metrics (ToolCorrectness, PlanAdherence, PlanQuality) and first-party pytest CI support. RAGAS covers tool calls with ToolCallAccuracy and AgentGoalAccuracy but relies on community CI integrations. TruLens can score anything via per-step Feedback functions, but you assemble trajectory scoring yourself.

Why do RAGAS, TruLens, and DeepEval give different scores on the same trace?

Each framework defines 'correct tool call' differently. DeepEval checks tool name and argument equality, RAGAS matches more permissively on the canonical tool name, and TruLens only scores tools you explicitly wrap in a Feedback. The same multi-tool trace can land at 0.7, 0.8, and 0.9 without any framework being wrong.

What faithfulness score should gate a CI/CD pipeline?

A blocking gate of 0.85 average faithfulness is a sensible default. Judge models can drift around 0.05 between adjacent versions, so a stricter 0.95 gate mostly measures judge noise. Run nondeterministic metrics three times and average, and add a warning gate 0.05 above the blocking gate.

Is LLM-as-a-judge reliable enough for automated evaluation gates?

Yes, with mitigations. Zheng et al. (2023) showed GPT-4-class judges exceed 80% agreement with human preferences but exhibit position, verbosity, and self-preference biases. Position-swap the candidates, use a judge from a different model family than your generator, and average multiple trials before trusting a number.