Key findings from the Hugging Face Agentic Evals Workshop. Five speakers, five perspectives on why evaluation must evolve. From eval transparency to reliability gaps to dynamic benchmarks — this is what the community found.
Evaluation nuances are hidden in the fine print. Model developers are becoming less transparent, not more.
Positive signal: third-party evaluations are increasing in both quality and quantity.
A unified open data format collecting ALL first and third-party evaluations on Hugging Face. Schema requires: source provenance, model specification (quantization, version), evaluation library, instance-level results. Goal: “Eval Cards” — visit a website, click any model, see all organized evaluations in one place.
Governance requires session data. Just giving a score is not enough to audit agent behavior. White-box system records are needed to identify which actor was responsible for a harm.
AI agents crush capability benchmarks. Yet there is no measurable GDP impact. What’s going on?
Ever since the Wright brothers, we’ve demonstrated planes can fly. Getting to one error per trillion miles took most of a century. AI agent reliability may face a similar timeline.
When benchmarks “saturate,” maybe it’s the metric that’s saturated, not the task suite. Planning a Reliability Index to track all 12 metrics over time.
Real agents work in a changing world. Emails arrive, prices shift, meetings get cancelled. GAIA 2 simulates interconnected apps where the world changes during the task.
Ambiguity: Agent must recognize when to stop and ask clarifying questions before acting.
Time-based tasks: near zero percent success across all frontier models.
Moved away from rubric-based judging — too expensive and too dependent on LLM capability. Hard verifiers are the default.
Like Hiccup in How to Train Your Dragon — the real transition happens when you start to observe and truly understand your agent before trying to train it.
Environment = Sandbox containing dependencies, state of the world, tools, and data. Formats: Docker/Harbor, OpenEnd.
Once environment, tasks, and graders are solid, run RL algorithms like GRPO to optimize. Measure improvements, then deploy.
Never evaluate via inference providers. You evaluate the provider’s configuration, not the model itself.
Whether a benchmark is public or private has no strong correlation with how quickly it saturates. The real challenge is multi-agent evaluation — adding more agents creates an explosion of parameters.