Ramsay Research Agent — April 1, 2026

Now I have everything I need. Let me write the full newsletter.

Ramsay Research Agent — April 1, 2026

144 findings from 13 agents. Here's what matters.

Top 5 Stories Today

1. Mercor Confirms 4TB Breach via LiteLLM Supply Chain Attack. If You Use LiteLLM, Check Your Versions Now.

A single compromised GitHub Actions workflow. That's all it took.

TechCrunch reports AI recruiting startup Mercor ($10B valuation) confirmed a security incident traced back to a supply chain attack on the open-source LiteLLM proxy. The attack chain is a case study in cascading dependencies. Attackers first compromised Trivy's GitHub Actions workflow. From there, they stole LiteLLM's PyPI publishing token. Then they pushed malicious versions 1.82.7 and 1.82.8 to PyPI that exfiltrated SSH keys, .env files, cloud credentials, and crypto wallets. Lapsus$ claims 4TB of Mercor data including source code and databases. Mercor says they were "one of thousands of companies" affected.

Thousands. LiteLLM is downloaded millions of times daily. Ben Thompson at Stratechery published an analysis the same day arguing AI will make security worse short-term before it gets better. Hard to argue with that when a single compromised CI workflow can cascade into 36% of cloud environments being exposed.

The ugly truth is this attack wasn't sophisticated. It was patient. Trivy to GitHub Actions to PyPI to LiteLLM to Mercor. Each hop was a well-known attack surface. The defense should have been well-known too: pin dependency versions, verify package signatures, audit your CI pipeline's secret exposure. Most teams don't do any of that for their Python dependencies.

If you're running LiteLLM in production, check your installed version right now. If it's 1.82.7 or 1.82.8, you need to rotate every credential that environment had access to. Not tomorrow. Now. Then audit your PyPI dependency pinning strategy, because this won't be the last supply chain attack that targets the AI tool layer. The attacker surface keeps growing as every team adds more AI dependencies, and most of those dependencies don't have the security scrutiny that older, established packages get.

2. Two Labs Ship Sub-500MB Agentic Models in the Same Week. Edge AI Agents Just Became Real.

Can a model that fits on a Raspberry Pi do reliable tool calling? Two independent labs just answered yes.

PrismML emerged from stealth March 31 with Bonsai, the first commercially viable 1-bit LLMs built on Caltech research. The 8B model fits in 1.15GB (vs 16GB for FP16), runs 8x faster, and scores 65.7 on MMLU-R at 1-bit precision. Ships in 8B, 4B (0.5GB), and 1.7B (0.24GB) variants under Apache 2.0. On the same day, Liquid AI released LFM2.5-350M, a 350M parameter model trained on 28T tokens with scaled RL. Partners report 95%+ tool-calling accuracy across multi-turn interactions. It processes 40.4K output tokens/second on a single H100 and fits under 500MB quantized.

Different architectures. Different companies. Same conclusion. Useful agentic behavior now runs on hardware that couldn't run any model six months ago.

I've been skeptical of "small models that can do everything" claims for a long time. Most of them fall apart when you need tool calling, multi-turn reasoning, or anything beyond single-shot text generation. But 95%+ tool-calling accuracy at 350M parameters, combined with a 1-bit 8B model that actually benchmarks against FP16 competitors, tells me something changed in the training methodology. Liquid AI's trick was scaled RL on 28T tokens, not just distillation. PrismML went after the precision problem with Caltech's 1-bit research. Both avoided the usual "just shrink a big model" trap.

What this means for builders: if you've been running cloud-dependent agent loops, you can now prototype edge-deployed agents that do real tool calling without an internet connection. Smart home, IoT, mobile assistants, local coding agents on consumer hardware. The constraint was never "can small models generate text" but "can they reliably call tools in a loop." This week's answer is yes, and the models are Apache 2.0.

3. Sebastian Raschka Says Claude Code's Real Secret Sauce Is the Harness, Not the Model. The Leaked Source Proves It.

The Claude Code source leak was the biggest story in developer tools this week. But the most important analysis didn't come from the people picking through feature flags and Easter eggs. It came from Sebastian Raschka, who read the 512,000 lines of leaked TypeScript and reached a conclusion that should change how you think about AI coding tools.

The model isn't the moat. The harness is.

Raschka identified specific architectural patterns that drive Claude Code's coding performance: file-read deduplication that prevents the same file from eating context twice, static and dynamic content caching with boundary markers, dedicated Grep/Glob/LSP tools instead of shelling out to bash, subagent parallelization with shared cache, and context optimization that writes oversized tool results to disk with preview references. The base tool definition alone is 29,000 lines of TypeScript. The ~40-tool plugin architecture treats each capability as a discrete, permission-gated module.

His claim: DeepSeek, MiniMax, or Kimi could achieve similar coding performance with equivalent harness engineering. That's a bold statement, but the evidence supports it. Alex Kim's independent analysis found 44 unreleased feature flags, a KAIROS persistent background agent, a "dream" mode for continuous background reasoning, and frustration-detection regexes that pattern-match profanity to adjust behavior. The ccunpacked.dev visual guide hit 637 HN points mapping the full architecture.

What I keep coming back to: six of my research agents independently covered this leak from different angles. The community reaction was the fastest open-source fork sprint in GitHub history. A clean-room framework extraction hit 50K stars in 2 hours. Build-from-source instructions were published as a gist. A Rust rewrite launched. The multi-agent orchestration was extracted into standalone frameworks compatible with any LLM.

The lesson for builders isn't "Claude Code leaked, go read the source." It's that the investment in harness engineering, in the boring plumbing around the model, is what makes an AI coding tool actually work. I've been saying this for months. The model is necessary but not sufficient. Context management, tool design, caching strategy, permission models. That's where the value lives. Raschka just proved it with 512K lines of evidence.

4. Greptile Says 'Slop Is Not Necessarily the Future.' 421 HN Comments Say This Debate Is Far From Over.

The highest comment-to-point ratio on Hacker News this cycle. 421 comments on 261 points. A 1.61 ratio. That number means people aren't just upvoting and moving on. They're arguing.

Greptile co-founder Soohoon Choi published an essay arguing that AI-generated slop is a temporary phenomenon because economics will correct it. The logic: good code is cheaper to generate (fewer tokens) AND cheaper to maintain (simpler to modify). LLM providers paid per token are incentivized to optimize for simplicity. Competition between models means the winners will produce code that's easier to reason about. Drawing on John Ousterhout's software design philosophy, Choi frames code quality not as an ethical choice but as a market-driven outcome.

I find the argument compelling but incomplete. Greptile's own v4 code review tool reports an 82% bug catch rate in independent benchmarks, nearly double CodeRabbit's 44% and ahead of GitHub Copilot's 54%. Those numbers are a strong data point for the "quality wins" thesis. If your code review tool catches twice as many bugs, the cost savings compound over time.

But the argument assumes rational economic actors. Most teams I've talked to aren't choosing AI coding tools based on code quality metrics. They're choosing based on speed-to-ship and whether the tool integrates with their IDE. The economic pressure Choi describes is real, but it operates on a longer timescale than the "vibe code it and ship" pressure that's dominating right now.

Where I land: the slop problem is real, the economic correction is plausible, but the correction will take longer than optimists think. The teams that invest in code review tooling now, before the correction, will have a structural advantage. The ones that don't will be paying down technical debt for years.

5. ICONIQ: AI-Native Companies Burn 0.8x ARR While Non-AI Burns 2.0x. The Economic Gap Is No Longer Theoretical.

I don't usually lead with survey data. But ICONIQ Capital's survey of ~300 high-growth B2B executives has numbers too specific to ignore.

AI-native companies (32% of respondents) have a burn multiple of 0.8x at $100M+ ARR. AI-enabled companies burn at 1.6x. Non-AI companies burn at 2.0x. That's not a trend line. That's a canyon. AI-native companies move through the product lifecycle 3.6X faster. 79% are building agentic workflows versus 62% for AI-enabled companies. Engineering allocation for AI at high-growth companies will hit 37% by 2026, maintaining a 9-10 percentage point lead over peers.

The gross margin story is equally stark. AI product gross margins are projected to hit 52% in 2026, up from 41% in 2024. That margin expansion is coming from inference cost declines (model providers racing to the bottom on pricing) and from AI-native architectures that don't carry the legacy cost structures of pre-AI companies.

Here's what caught me off guard: the 0.8x burn multiple. That's not "slightly more efficient." That's capital efficiency that fundamentally changes the math on when a company can become profitable. A non-AI SaaS burning 2.0x at $100M ARR needs $200M in annual capital to stay alive. An AI-native company at the same revenue needs $80M. That $120M difference is the cost of not being AI-native, and it compounds every year.

This data confirms what PitchBook's Q1 2026 analysis shows from the investor side: 10 mega-rounds over $1B accounted for $71.5B (80% of total consumer AI funding). The capital is concentrating at the top, and it's concentrating in AI-native companies that demonstrate this efficiency gap. If you're building a SaaS company in 2026 and you're not AI-native from day one, you're starting with a 2.5x capital efficiency handicap. That's not a headwind. That's a wall.

Section Deep Dives

Security

CVE-2026-3055: Citrix NetScaler under active exploitation, CISA deadline is tomorrow. CISA added this CVSS 9.3 vulnerability to KEV on March 30. Attackers send crafted SAMLRequest payloads to /saml/login to leak memory contents including admin session IDs via NSC_TASS cookies. Affects systems configured as SAML IDP. Federal agencies have until April 2 to patch. Rapid7 and Horizon3.ai confirmed active reconnaissance probing /cgi/GetAuthMethods to fingerprint vulnerable configurations. If you run NetScaler as SAML IDP, patch today.

Bun's source map default is a new class of npm supply chain exposure. The Claude Code leak wasn't a one-off. Bun bug #28001, filed March 11, was open for 20 days before it caused the leak. Bun generates source maps by default in production builds, and the docs say they shouldn't be served, but they are. Any npm package built with Bun that doesn't explicitly add .map to .npmignore is vulnerable to the same accidental source exposure. The irony of Anthropic's acquired toolchain exposing Anthropic's own product is hard to miss. If you build npm packages with Bun, audit your .npmignore right now.

RSAC 2026 security leaders warn of "two-year upheaval" from AI-powered exploits. Kevin Mandia, Alex Stamos, and Morgan Adamski warned at RSAC 2026 that AI systems could generate sophisticated exploits on demand within 6-12 months, compressing patch cycles to "Patch Tuesday, exploit Wednesday." Mandia's quote: "nobody's ready." Stamos projects exploits bypassing modern processor protections. The consensus is that offensive AI capabilities are advancing faster than defensive ones. For builders running production systems, this means your patch cadence needs to accelerate. The window between disclosure and exploitation is shrinking to hours.

Perplexity AI hit with class-action over alleged data sharing with Meta and Google, even in Incognito mode. Bloomberg reports trackers allegedly download to user devices upon login, giving Meta and Google full conversation access. Perplexity denies it. Separately, Amazon secured a court order blocking Perplexity's Comet browser. I don't know the merits of the case, but the pattern is familiar: AI tools that promise privacy while running ad-tech trackers underneath.

Agents

Coding agents that know when to ask questions outperform agents that always guess by 13%. This paper tested clarification-seeking on an underspecified SWE-bench variant. A multi-agent scaffold separating ambiguity detection from code execution pushes OpenHands + Claude Sonnet 4.5 from 61.2% to 69.4% task resolve rate. That 8.2 percentage point gain from just knowing when to ask is a direct challenge to the "fully autonomous" paradigm. The agents that pause and ask outperform the ones that barrel ahead. I've seen this in my own workflows. The best agent runs are the ones where it stops and says "I'm not sure about this."

Coder raises $90M Series C for AI coding agent infrastructure. KKR led the round after rolling Coder out to 500+ engineers internally and going from zero AI-assisted code to over half of commits in Coder-managed environments within a year. QRT deployed to half of their 2,000+ employees. The product provisions AI coding agents automatically, handling MCP server plumbing, tool configuration, and LLM connections. Enterprise wants the agent capabilities but doesn't want to wire the plumbing. That's the gap Coder fills.

MCP Dev Summit opens April 2-3 in NYC with 95+ sessions. The Agentic AI Foundation's first summit includes sessions on conformance testing across 10 SDKs, mix-up attacks and mitigations, and "When MCP Isn't Enough" by Datadog. MCP crossed 97M monthly installs. If you're building with MCP, the conformance testing session by Anthropic's Paul Carleton should be on your list.

Research

KAT-Coder-V2 hits 79.6% SWE-bench, within 1.2 points of Claude Opus 4.6. Kuaishou's approach decomposes agentic coding into five expert domains (SWE, WebCoding, Terminal, WebSearch, General), trains each independently, then consolidates via on-policy distillation. 88.7 on PinchBench surpasses GLM-5 and MiniMax M2.7. The specialize-then-unify training pattern and a novel Tree Training technique with 6.2x speedup for tree-structured RL trajectories are the interesting bits here. More evidence that the coding benchmark gap between open and closed models is closing fast.

ARC-AGI-3 breaks every frontier model. Humans maintain 100%. ARC-AGI-3 moved from static visual puzzles to interactive game environments with no instructions or rules. Every frontier model's score collapsed to below 1% (best: 0.26%). Humans maintain 100%. This is the strongest evidence that high benchmark scores reflect format optimization, not reasoning. The $2M prize rewards agents that can explore, learn rules from scratch, and transfer knowledge. For anyone building coding agents: if your agent "solves coding benchmarks," it might be pattern-matching the benchmark format, not demonstrating the exploratory reasoning needed for novel production tasks.

Time-consistent SWE benchmarks expose temporal contamination. This paper formalizes what many suspected: SWE benchmarks that don't control for when the training data was collected are contaminated. Their methodology snapshots repos at time T0 and evaluates only on PRs merged after T0. File-level F1 of 0.81 with Claude-family models. If you're evaluating coding agents, temporal controls are no longer optional.

Infrastructure & Architecture

pg_textsearch brings BM25 to Postgres, directly challenging Elasticsearch. Tiger Data (formerly Timescale) open-sourced a PostgreSQL extension delivering BM25 relevance-ranked keyword search with up to 4x faster top-k queries via Block-Max WAND optimization. Built in C on Postgres's storage layer using an LSM-tree architecture. For RAG builders already using pgvector, this eliminates the need for a separate search cluster entirely. One database for vectors and keyword search. That's a real simplification.

Ollama 0.19 adds MLX backend, 93% faster decode on Apple Silicon. Version 0.19 in preview uses Apple's ML framework as the backend, delivering 57% faster prefill and 93% faster decode compared to v0.18. On M5 chips: 1,851 tokens/s prefill and 134 tokens/s decode with int4 quantization. Currently supports Qwen3.5-35B-A3B with more models coming. Requires 32GB+ unified memory. If you run local models on a Mac, this is a significant jump.

Huawei Ascend 950PR debuts at 2.87x H20 performance. ByteDance and Alibaba are ordering. CNBC confirmed orders. The only Chinese AI chip with FP4 inference, delivering 1.56 PFLOPS at 600W TDP. Priced at ~$6,900 (DDR) to ~$9,600 (HBM), with 750K units planned for shipment this year. Whether these chips match their specs in production is an open question, but the pricing undercuts Nvidia significantly.

Tools & Developer Experience

Claude Code v2.1.89 ships defer permissions, PermissionDenied hooks, and non-blocking MCP. The April 1 release adds a "defer" permission decision to PreToolUse hooks. Headless sessions can now pause at tool calls and resume with --resume for human-in-the-loop approval. MCP_CONNECTION_NONBLOCKING=true skips MCP wait in -p mode. Fixes for Windows CRLF doubling, 50% StructuredOutput schema cache failures, and memory leaks from LRU cache keys. The defer mechanism is the most interesting addition. It turns headless mode from "fully autonomous or nothing" into "autonomous with human checkpoints." That's the pattern enterprise deployments actually need.

Simon Willison ships 8 coordinated releases in 48 hours across the llm/datasette ecosystem. llm 0.30 adds register_models hooks with cross-plugin awareness. datasette-llm 0.1a5 introduces purpose-based API key routing: enrichments use one model/key, extraction uses another. llm-all-models-async wraps sync models with thread pools. This is the foundation for Datasette as an LLM-native data platform. Willison's velocity is genuinely impressive, and the purpose-based key routing pattern is something more tools should adopt.

Elgato Stream Deck 7.4 adds MCP support. AI agents can now press physical buttons. The Verge reports this is the first consumer hardware product to ship native MCP agent integration. Agents can trigger OBS scenes, smart home devices, and macOS shortcuts through physical hardware. I didn't have "AI agent presses a physical button" on my 2026 bingo card, but here we are.

Models

LongCat-Next: Meituan open-sources native multimodal with 291 HuggingFace upvotes. The DiNA framework tokenizes text, vision, and audio into a shared discrete space under a single autoregressive objective. The key innovation is dNaViT, a visual transformer doing tokenization at arbitrary resolutions with hierarchical discrete tokens. 291 HF upvotes is the most-upvoted daily paper this cycle. Native multimodal (vs bolting vision onto a text model) keeps showing performance gains. The model and tokenizers are open-sourced.

GPT Realtime API reaches GA with native MCP + SIP phone calling. OpenAI's update lets voice agents access enterprise tools via MCP AND make phone calls via SIP, connecting directly to PBX systems and desk phones. The new gpt-realtime model improves tool calling precision and natural speech. MCP+SIP means voice agents can now be both smart (tool access) and connected (phone systems). The plumbing for autonomous voice agent workflows is now live.

Google ships Veo 3.1 Lite at $0.05/second as OpenAI shuts down Sora. Available now on Gemini API and AI Studio. $0.05/sec for 720p, $0.08/sec for 1080p, text-to-video and image-to-video. Strategic timing. OpenAI exits video generation, Google cuts prices. If you're building anything with generated video, the economics just got significantly better.

Vibe Coding

JSSE becomes the first JavaScript engine to pass 100% of test262, built entirely by Claude Code in YOLO mode. A developer built a complete JavaScript engine in Rust using Claude Code autonomously. All 98,426 test262 non-staging test scenarios pass. The developer didn't write a single line of Rust. Rust was chosen specifically because its strict type system serves as a "second feedback signal" alongside test262, letting the agent self-correct more effectively. This is the strongest proof-of-concept for agent-built complex systems I've seen. A full spec-compliant JS engine isn't a CRUD app. It's a serious piece of systems software.

Solo builder ships 516-panel financial terminal in 3 weeks. Neuberg covers fixed income, derivatives, commodities, equities, credit, macro, and alternative assets with 516 draggable panels, real-time data, and AI news sentiment analysis. 40 HN points with a 0.93 comment ratio. Whether it's a "vibe-coded Bloomberg terminal" or a polished prototype is debatable. Either way: one person, three weeks, 516 panels. That velocity number is real.

Local Qwen3.5-27B outperforming GPT-5.3 and Gemini 3.1 Pro for iterative coding. A practitioner with 270 upvotes on r/LocalLLaMA argues proprietary models optimize for autonomous end-to-end completion, which makes them worse at the iterative human-steered loop where you want the model to do exactly what you ask. A steerable local model may beat a smarter but opinionated cloud model. I haven't verified this myself, but the argument tracks with what I've seen: the best model depends entirely on your workflow pattern.

Hot Projects & OSS

OpenScreen hits 11.9K stars at +2,533/day as a free Screen Studio alternative. No subscriptions, no watermarks, commercial-use license. Auto and manual zoom, mic and system audio, customizable backgrounds, annotation tools. Available macOS 13+ and Linux. If you're paying $89 for Screen Studio, this is worth trying.

VideoLingo: Netflix-level AI video subtitle pipeline at 16.4K stars. One-click automated subtitle cutting, translation, alignment, and dubbing. Full pipeline from raw video to dubbed output in multiple languages. Content creators needing multilingual video localization without manual timing work should look at this.

Eigent: open-source "Cowork Desktop" for local agent orchestration at 13.4K stars. TypeScript-based desktop environment for running and coordinating AI agents without cloud dependency. Actively developed, pushed April 1. If you want multi-agent collaboration but prefer local-first infrastructure, this is the leading option.

SaaS Disruption

Outcome-based pricing is shipping everywhere. The transition is no longer theoretical. Chargebee published an AI agent pricing playbook documenting three production models: outcome-based (Intercom's Fin at $0.99/resolution), action-based (n8n per workflow execution), and hybrid (Lovable per-user + credits). IDC predicts pure seat-based pricing will be obsolete by 2028, with 70% of software vendors refactoring around new value metrics. Gartner projects 40% outcome-based by end of 2026. When the billing infrastructure vendors, the SaaS incumbents, and the analyst firms all converge on the same pricing shift in the same quarter, it's not a trend anymore.

Salesforce turns Slackbot into an enterprise AI agent with 30 new features. Marc Benioff unveiled reusable AI skills, MCP client integration connecting to Agentforce, meeting transcription with action items, and a desktop monitoring feature where the bot watches screen activity and drafts follow-ups. That last one is going to be controversial. The "AI watches your screen and takes notes" feature is the most aggressive enterprise agent deployment from a major platform I've seen. Whether teams adopt or reject the monitoring angle will say a lot about where the line is.

Policy & Governance

Quantum threat timeline compresses again. Google shows ECC breakable with 20x fewer qubits. Google Quantum AI published research showing elliptic curve cryptography could be broken with fewer than 500,000 physical qubits, a 20-fold reduction from the previous 9 million estimate. A separate Caltech/Oratomic paper found it could be done with as few as 10,000 qubits. Under idealized conditions, Google estimates a 41% probability of deriving a private key before a Bitcoin transaction confirms. Three papers in three months rewriting the threat timeline. Post-quantum cryptography migration is no longer a 2030 problem.

Red Hat leaked memo mandates "Agentic SDLC" for all engineering. The Register reports an internal memo signed by CTO Chris Wright and CPO Ashesh Badani requiring a mandatory shift to an "agent-first development model" measured by cycle time and defect rate. Observers noted the memo's repetitive language suggests it was itself AI-generated. This is one of the first major enterprise mandates explicitly requiring AI adoption across all engineering workflows, going beyond optional tooling to structural process change.

Dario Amodei flies to Canberra, signs AI safety MOU with Australia, warns of "panopticon" risk. Capital Brief reports Amodei warned that AI in the hands of sophisticated surveillance states could create "a panopticon." He stated "the technology is moving faster than we'd like it" while governments lack the tools to respond. This is Amodei's most explicit geopolitical risk framing aimed at allied democracies.

Skills of the Day

Audit your .npmignore if you build with Bun. Bun generates source maps by default in production builds, and bug #28001 means they get served despite the docs saying otherwise. Adding *.map to your .npmignore takes 5 seconds and prevents accidental source exposure. The Claude Code leak happened because of this exact gap.
Use cross-encoder reranking after BM25 keyword retrieval in your RAG pipeline. pg_textsearch brings BM25 directly into Postgres, eliminating Elasticsearch. Pair BM25 retrieval with a cross-encoder reranker (like cross-encoder/ms-marco-MiniLM-L-6-v2) for 18-42% precision improvement over vector-only or keyword-only approaches.
Implement clarification-seeking in your coding agents. The Ask or Assume paper shows a simple multi-agent scaffold, one agent detecting ambiguity, another executing, boosts resolve rates by 13%. Add an uncertainty classifier before your code generation step. Agents that ask when confused beat agents that always guess.
Set MCP_CONNECTION_NONBLOCKING=true in your headless Claude Code pipelines. Claude Code v2.1.89 lets you skip MCP server connection waits in -p mode. If your pipeline runs in CI/CD and doesn't need MCP tools for every run, this environment variable eliminates the connection timeout penalty.
Pin your LiteLLM version and rotate credentials if you ran 1.82.7 or 1.82.8. This isn't just a LiteLLM problem. Any PyPI dependency in your AI stack could be the next supply chain target. Use pip freeze > requirements.txt with exact versions, enable pip's --require-hashes flag, and audit which CI jobs have access to publishing tokens.
Try TurboQuant for local model deployment on consumer GPUs. The attn-rot PR nearing merge in llama.cpp applies Walsh-Hadamard rotation before KV cache quantization. Benchmarks show Qwen3.5-35B-A3B matching q8 quality at q4/q5 bit widths. If you're running 27B-class models locally, this can cut your VRAM requirements enough to fit on a 16GB card.
Build purpose-based API key routing into your LLM tool calls. Simon Willison's datasette-llm 0.1a5 shows the pattern: enrichments use one model/key, extraction uses another. Different tasks have different cost/quality requirements. Route expensive queries to capable models, cheap queries to fast ones. One API key per purpose, not one key for everything.
Use the defer permission hook for human-in-the-loop agent approval. Claude Code's new PreToolUse hook lets headless sessions pause and resume with --resume. Build approval workflows where the agent runs autonomously for safe operations but pauses for destructive ones (file deletes, git pushes, database writes). This is the pattern enterprise deployments need.
Test your coding agents against temporally-controlled benchmarks. Standard SWE-bench evaluations suffer from temporal contamination. Snapshot your test repos at a fixed date and only evaluate on PRs merged after that date. If your agent's performance drops significantly, you've been measuring data leakage, not capability.
Deploy sub-500MB agentic models for edge tool-calling use cases. Liquid AI's LFM2.5-350M reports 95%+ tool-calling accuracy under 500MB quantized. PrismML's Bonsai 1.7B fits in 0.24GB under Apache 2.0. If you have IoT, mobile, or edge deployments where cloud latency is a problem, these models are now viable for local agentic loops that don't need an internet connection.