Research Summary / March 2026

The State of
Agentic AI
Evaluation

Key findings from the Hugging Face Agentic Evals Workshop. Five speakers, five perspectives on why evaluation must evolve. From eval transparency to reliability gaps to dynamic benchmarks — this is what the community found.

Eval Transparency

What's hidden in the fine print

The Reliability Gap

Why capability benchmarks miss the point

Dynamic Benchmarks

Testing agents in changing environments

Environment Design

Sandbox-first evaluation methodology

Community Standards

Open, decentralized, reproducible evals

10 panels total

Abijit Kosh

HF / Eval Coalition

Arvind Narayanan

Princeton University

Pierre-André

Meta / ICLR 2026

Mahesh

Bespoke Labs

Nathan

HF / Open LLM Leaderboard

Hugging Face Agentic Evals Workshop

mindpattern.ai 01 / 10

Speaker 1 / Abijit Kosh, Hugging Face

The Eval Crisis:
Reporting Problems

Evaluation nuances are hidden in the fine print. Model developers are becoming less transparent, not more.

40 / 237

Problems omitted from OpenAI's SWE-bench score (tiny fine print)

<15%

Of model releases mention labor or environmental costs

2022

Transparency declining since then

Specific Failures

OpenAI GPT-5.2

Reported high SWE score but omitted 40 of 237 problems in tiny fine print. Criticized by Anthropic and eval researchers.

Meta Llama 4

Used different model versions to score different benchmarks. Benchmark maxing.

Chart manipulation

69.1 and 30.8 displayed as similarly tall histograms. Unscaled axes and missing error bars.

Declining teams

Companies have broken up or reassigned teams dedicated to documentation and social impact evaluation.

Positive signal: third-party evaluations are increasing in both quality and quantity.

Source: Abijit Kosh, Hugging Face / Eval Coalition

mindpattern.ai 02 / 10

Speaker 1 / Continued

What’s Missing
+ Solutions

6 Gaps in Agent Evals

Session-level data

Identical scores can mask completely different agent behavior

Agent identity

Model is not the agent: sub-agents, MCP servers, memory config matter

Robustness standards

No norms for seeds, prompt perturbations, or pass@K

Cost reporting

Inconsistent or entirely absent across benchmarks

Generalist compatibility

Benchmark protocols block generalist agents

Human interaction

Human-agent collaboration remains undermeasured

Every Eval Ever Project

A unified open data format collecting ALL first and third-party evaluations on Hugging Face. Schema requires: source provenance, model specification (quantization, version), evaluation library, instance-level results. Goal: “Eval Cards” — visit a website, click any model, see all organized evaluations in one place.

Proposed Agentic Extensions

System Composition

Models inside the system, their roles, sub-agents, MCP servers

Session Semantics

Fields defining what constitutes a particular run or session

Interaction Accounting

Measurements of every interaction in the agent workflow

Eval Conditions

Conditions needed to reproduce an agent action by a third party

Policy Implications

Governance requires session data. Just giving a score is not enough to audit agent behavior. White-box system records are needed to identify which actor was responsible for a harm.

Source: Abijit Kosh, Hugging Face / Eval Coalition

mindpattern.ai 03 / 10

Speaker 2 / Arvind Narayanan, Princeton

Capability ≠
Reliability

AI agents crush capability benchmarks. Yet there is no measurable GDP impact. What’s going on?

Real-World Reliability Failures

Rabbit R1

Delivered food to the wrong address. With pre-LLM systems like Alexa, playing the wrong song is an annoyance. With agents using your credit card, 10% failure rate is dead on arrival.

OpenAI Operator

Incorrect online purchases.

Agentic coding

A coding agent deleted a production database.

Governments

Even governments, usually slow to adopt, are not immune from reliability failures.

Ever since the Wright brothers, we’ve demonstrated planes can fly. Getting to one error per trillion miles took most of a century. AI agent reliability may face a similar timeline.

Over 18 Months / 14 Frontier Models

~85%

Capability

Can it do the task at all?

~32%

Reliability

Does it do it consistently?

Source: Arvind Narayanan, Princeton University

mindpattern.ai 04 / 10

Speaker 2 / Continued

Reliability:
Deep Dive

4 Dimensions, 12 Metrics

Consistency

1a. Outcome — same pass/fail each run?
1b. Trajectory — same action sequence?
1c. Cost — stable resource usage?

Robustness

2a. Fault — inject API timeouts, errors
2b. Prompt — reworded input, same answer?

Calibration

3a. Calibration — confidence vs reality. Improving.
3b. Discrimination — separate wins from losses? Getting worse.

Failure Severity

4a. Severity — formatting errors vs data deletion, wrong purchases. Measured separately.

Trace Analysis Insights

•Models confuse clean process with correct answer — messy tool-calling creates false underconfidence

•Ambiguous questions are valuable — real tasks are always ambiguous

•Hallucinations spike when faults are injected — model fabricates instead of failing

When benchmarks “saturate,” maybe it’s the metric that’s saturated, not the task suite. Planning a Reliability Index to track all 12 metrics over time.

Source: Arvind Narayanan, Princeton University

mindpattern.ai 05 / 10

Speaker 3 / Pierre-André, Meta

GAIA 2:
The Benchmark

Real agents work in a changing world. Emails arrive, prices shift, meetings get cancelled. GAIA 2 simulates interconnected apps where the world changes during the task.

Why Simulation?

Reproducible

Observable evaluation, test robustness reliably

Safe

Test destructive actions (delete emails, cancel events) without risk

Cheap

No external API dependencies, no bandwidth costs

Controllable

Inject faults, vary API signatures, add noise at will

1,000

Scenarios

Universes

Apps per universe

Built on MARE Framework

Apps → Universes → Events → Scenarios

Apps

Stateful services (email, messenger, calendar, files) with API, MCP, and CLI access

Universes

Simulated environments with auto-generated personas, past emails, calendar events, message history

Events

Injected from user, agent, or environment mid-task

Scenarios

Task prompt + expected event sequence + environment changes during task

Source: Pierre-André, Meta — ICLR 2026

mindpattern.ai 06 / 10

Speaker 3 / Continued

GAIA 2: Results
& Verification

5 Capability Dimensions

Execution

~75%

~68%

Adaptability

~25%

Time

~0%

Ambiguity

~22%

Execution

Multi-tool tasks in one turn. Cancel meetings, respond to emails.

Find information across apps. “Find my Netflix password — I shared it with my parents.” Agent must figure out who your parents are, search communication apps.

Adaptability

Multi-turn. Environment disrupts mid-task. Agent booked meetings but attendees cancel — must react and reschedule.

Time

React to time-based events. Book a flight when the price drops. Simulation fast-forwards weeks so you don’t wait.

Ambiguity: Agent must recognize when to stop and ask clarifying questions before acting.

Time-based tasks: near zero percent success across all frontier models.

Verification Approach

Hard Verifiers

Compare expected DAG of actions vs actual DAG. Event-by-event: correct order, parameters, recipients, timing. Pure equality or simple algorithmic checks. Cheap, reliable, reproducible.

Soft Verifiers

LLM-as-judge for generated content (e.g., does the email body match the expected intent?). Used only where hard verification isn’t possible.

Moved away from rubric-based judging — too expensive and too dependent on LLM capability. Hard verifiers are the default.

Robustness Testing

Tool failure injection API signature variation Environment noise Agent-to-agent delegation

Source: Pierre-André, Meta — ICLR 2026

mindpattern.ai 07 / 10

Speaker 4 / Mahesh, Bespoke Labs

How to Evaluate:
Methodology

Like Hiccup in How to Train Your Dragon — the real transition happens when you start to observe and truly understand your agent before trying to train it.

Common Mistakes

×Checking only the final output — breaks when anything changes

×Starting with granular function call details too early

×Deploying to production and evaluating there

Why Agent Evals Are Hard

Stochastic

Two identical runs can produce completely different results

Multi-step

Many actions to reach the final outcome, each introducing complexity

World-changing

Agent actions alter the environment, making evals non-reproducible

Expensive

Real-world interactions cost money and risk destructive actions

Two Levels of Evaluation

Verifiable outputs

Did the unit tests pass? Is the math correct? Is the bug fixed? Gotcha: don’t weight all tests equally — many simple tests can mask important failures. Watch for reward hacking: the agent fixes the test instead of the bug.

Non-verifiable outputs

For tasks like deep research where there’s no single right answer. Define rubrics with weighted dimensions scored 0–1. Use LLM-as-judge. Get a composite score you can track over time.

Source: Mahesh, Bespoke Labs

mindpattern.ai 08 / 10

Speaker 4 / Continued

How to Evaluate:
The Stack

The Environment Model

Environment = Sandbox containing dependencies, state of the world, tools, and data. Formats: Docker/Harbor, OpenEnd.

Critical Design Rules

•Environment must mirror production as closely as possible

•Agent must NEVER have access to the grader or solution (prevents reward hacking)

•Grade ALL requested outcomes — if you grade 2 of 3 things, the agent ignores the third

•Add fingerprints to unit tests to prevent agent manipulation

•Same model on different harnesses gives drastically different results (visible on Terminal Bench dashboard)

The 6-Step Evaluation Stack

Build sandbox environment

Mirror production with Docker, Harbor, or OpenEnd

Define representative tasks

Reflective of real workloads, varying difficulty

Set up graders

Hard verifiers where possible, rubrics for open-ended

Measure success rate / pass@K

N rollouts, compare across models and harnesses

Analyze traces

Automated failure analysis for hidden patterns

Document everything

System composition, session semantics, eval conditions

Metrics

Success Rate Pass@K Steps Tokens Latency Cost Safety

Once environment, tasks, and graders are solid, run RL algorithms like GRPO to optimize. Measure improvements, then deploy.

Source: Mahesh, Bespoke Labs

mindpattern.ai 09 / 10

Speaker 5 / Nathan, Hugging Face

Community Evals
+ Call to Action

Community Evals on the Hub

Anyone can contribute — decentralized evaluation at scale

eval.yaml defines: benchmark name, evaluation framework (e.g., Inspect AI), tasks, solvers (how to prompt the model), scorer (how to grade the answer)

One command to run: use Inspect AI with HF dataset references

Results as PRs on model repos — community discusses, model authors can hide disputed scores

13+ benchmarks live: SWE-Bench, Terminal Bench, HLE, AM 2026, and more

Never evaluate via inference providers. You evaluate the provider’s configuration, not the model itself.

Open Challenges

Long horizon evaluationUnsolved

Multi-agent evaluationEarly stage

Sim-to-real gapActive research

Benchmark gamingNeeds standards

Reliability optimizationNot guaranteed

The Call to Action

→ Build evals in the open

→ Use open reporting frameworks

→ Stop benchmark chasing

→ Measure reliability, not just capability

→ Invest in scalable oversight

→ Keep humans in eval policy

Whether a benchmark is public or private has no strong correlation with how quickly it saturates. The real challenge is multi-agent evaluation — adding more agents creates an explosion of parameters.

The State of Agentic AI Evaluation — Hugging Face Workshop 2026

mindpattern.ai 10 / 10

The State ofAgentic AIEvaluation

The Eval Crisis:Reporting Problems

What’s Missing+ Solutions

Capability ≠Reliability

Reliability:Deep Dive

GAIA 2:The Benchmark

GAIA 2: Results& Verification

How to Evaluate:Methodology

How to Evaluate:The Stack

Community Evals+ Call to Action

The State of
Agentic AI
Evaluation

The Eval Crisis:
Reporting Problems

What’s Missing
+ Solutions

Capability ≠
Reliability

Reliability:
Deep Dive

GAIA 2:
The Benchmark

GAIA 2: Results
& Verification

How to Evaluate:
Methodology

How to Evaluate:
The Stack

Community Evals
+ Call to Action