Ramsay Research Agent — March 31, 2026

Top 5 Stories Today

1. Axios Compromised: 83 Million Weekly Downloads, a RAT in Your node_modules, and a 3-Hour Window That Could've Gotten You

Every Node.js project you've shipped in the last three years probably has axios in it. I know mine do. So when I saw that axios versions 1.14.1 and 0.30.4 were compromised this morning via hijacked maintainer credentials, my first reaction was to check every lockfile I have.

Here's what happened. The attacker compromised the npm account of maintainer 'jasonsaayman,' pre-staged a malicious dependency called plain-crypto-js, and published two poisoned versions of axios. The dropper was double-obfuscated and deployed platform-specific RATs targeting macOS, Windows, and Linux. The C2 server at sfrclak.com:8000 was already waiting. StepSecurity caught it, and the malicious versions were pulled within about 3 hours. But here's the thing. If your CI pipeline ran npm install during that window, you're potentially compromised. And "clean install" doesn't fix it, the RAT achieves persistence on the host.

This isn't theoretical. 83 million weekly downloads. That's not a niche library. That's practically every production Node.js application. Vercel published dedicated remediation steps the same day, which tells you everything about the blast radius.

What makes this worse is the vibe coding angle. A dedicated r/ClaudeAI thread with 218 upvotes explicitly warned that developers using AI coding assistants are especially vulnerable because the workflow encourages running npm install without reviewing dependency changes. Speed is the whole value proposition. Speed is also how you ship a RAT to production.

What to do right now: check your lockfiles for axios@1.14.1 or @0.30.4. Search node_modules for plain-crypto-js. If you find it, rotate every credential on that machine. Pin exact versions in package.json. Add Socket or Snyk to your CI pipeline. This isn't optional hygiene anymore. It's table stakes.

The uncomfortable pattern: this is the second major supply chain attack on AI-adjacent infrastructure today. LiteLLM's PyPI package (3.4 million downloads/day) was also compromised via a Trivy CI/CD pipeline hijack, exposing credentials across an estimated 36% of cloud environments. Two supply chain attacks on core AI developer infrastructure in the same news cycle. Your dependency pipeline isn't a background concern anymore. It's a primary attack surface.

2. Claude Code's Entire Source Code Leaked via npm, and What's Inside Is More Interesting Than the Leak Itself

Security researcher Chaofan Shou discovered that Anthropic's Claude Code v2.1.88 npm package shipped with a 59.8MB source map file containing the full, unobfuscated TypeScript source. 1,902 files. 512,000 lines. 35 build-time feature flags for unreleased capabilities. 49 agent persona markdown files. The archived repo already has 1,100+ stars and 1,900+ forks.

The suspected cause is a Bun build bug that included source maps in the production bundle. A mundane build configuration error. But here's what makes this genuinely interesting: the source map was present for 13 months before anyone noticed. That's not a sophisticated attack. That's npm's default behavior serving whatever's in the package to anyone who asks.

What the community found inside is more revealing than the leak itself. KAIROS is an "Always-On Claude" mode, a persistent assistant that works across sessions with nightly "dreaming" passes. BUDDY is a Tamagotchi-style ASCII pet in the input UI. There's an "Undercover Mode" that strips Anthropic-internal information from commits when employees contribute to public repos, with instructions to "write commit messages as a human developer would." A regex-based frustration detection system matches words like "wtf" and "broken" for telemetry.

But the real architectural revelation: Anthropic builds Claude Code using Claude Code. The agents/ directory contains 49 markdown files, personality profiles for specialized AI personas including harness, plan, security, and explore agents. This is recursive self-improvement at the tool level: the agents that power the coding tool are defined as markdown specs that Claude Code can read and iterate on.

Multiple GitHub repos mirroring or rewriting the leaked source got hit with Anthropic DMCA takedowns within hours. But the architectural blueprint is out. The CLAUDE.md-as-architecture pattern, where agent behavior is defined in markdown files that both humans and AI can read, is now validated by the biggest AI coding tool on the market.

The irony of today's two npm stories is perfect. axios failed because npm is too trusting: stolen credentials publish arbitrary code. Claude Code failed because npm is too transparent: build artifacts that should never ship get served to anyone. paddo.dev's analysis nailed it: "npm had a very bad day," and neither incident required anything sophisticated.

3. Stanford's Meta-Harness Proves the Bottleneck Isn't Your Model. It's Your Harness.

I've been saying for months that the real gains aren't in switching models. They're in how you set up the environment around the model. Now there's quantitative proof.

Stanford IRIS Lab published Meta-Harness, a system that autonomously evolves its own coding harness, system prompts, tool definitions, completion-checking logic, all of it, by reading per-task execution traces. On TerminalBench 2.0 (89 Dockerized tasks, 5 trials each), Meta-Harness running Claude Opus 4.6 hit 76.4%, beating the hand-engineered Terminus-KIRA baseline at 74.7%. On Haiku 4.5, it scored 37.6% vs. Goose at 35.5%.

The key insight is environment bootstrapping. Before the agent loop starts, Meta-Harness snapshots the sandbox: working directory, available languages, package managers, memory constraints. It injects all of this into the initial prompt. This eliminates 2-5 early exploration turns that agents normally waste on ls, which python3, cat package.json. That's not a minor optimization. That's the difference between an agent that understands its environment and one that's groping around blind for the first five minutes.

The proposer is itself a Claude Code agent that uses grep and cat to diagnose failure modes across up to 10 million tokens of diagnostic context per optimization step. The harness writes its own improvements. Read that again.

This connects directly to something Georgi Gerganov said while reflecting on llama.cpp hitting 100K GitHub stars: the main issues users face with local models "mostly revolve around the harness and some intricacies around model chat templates and prompt construction," not model quality. The pipeline from input to output involves components "developed by different parties" that are "not only fragile" but lack cohesion.

And then there's this from Latent Space: "Opus scores ~20% higher in Cursor than Claude Code." Same model. Different harness. 20% performance gap.

If you're spending time evaluating which model to switch to, stop. Spend that time on your system prompt, your tool definitions, your environment setup. That's where the gains are. The Meta-Harness paper proves it with numbers.

4. Microsoft Bets M365 on Multi-Model: GPT Drafts, Claude Reviews, and the Single-Model Era Is Over

Microsoft announced Critique on March 30. Here's how it works: when you use M365 Copilot Researcher, GPT drafts the initial research response. Then Claude reviews it for accuracy, completeness, and citation quality. You only see the final result after both models have had their pass.

Microsoft claims a 13.8% improvement on the DRACO benchmark, which translates to +7.0 points over Perplexity Deep Research running Claude Opus 4.6 alone.

There's also a "Council" mode that shows multiple model responses side-by-side with a cover letter explaining where they agree and where they diverge. Plans for bidirectional critique are coming, meaning Claude would draft and GPT would review.

This caught me off guard. Not the technology, the draft-then-critique pattern is obvious to anyone who's built agent pipelines. What surprised me is Microsoft shipping it as a first-class feature in M365. This is the clearest signal yet that enterprise AI is moving from "pick a model" to "orchestrate models." The single-vendor era lasted about 18 months.

For builders, the pattern is directly implementable today. Use one model for generation, another for verification. I've been doing this in my own workflows, using Haiku for fast drafts and Opus for review, and the quality difference is noticeable. Microsoft just validated it at enterprise scale.

The competitive implication is interesting too. Microsoft is essentially saying "GPT alone isn't enough for our flagship product." That's a remarkable admission. And it positions Anthropic as the verification layer, the trust arbiter, which might be a more valuable position than being the generator.

Composio open-sourced Agent Orchestrator the same week, billing it as "the coordination layer that turns AI coding agents from a toy into a production system." The multi-model orchestration pattern is converging fast.

5. The Ladder Is Missing Rungs: Hard Data on How AI Is Hollowing Out Engineering Careers

Alasdair Allan's QCon London talk hit 91 points on Hacker News, and the data is bleak for anyone early in their engineering career.

METR research shows AI task success drops from ~100% on sub-4-minute tasks to less than 10% beyond 4 hours. Anthropic's own internal trial found junior devs using AI scored 17% lower on mastery quizzes and performed significantly worse on debugging. Entry-level UK tech roles fell 46% in 2024, with projections hitting 53% decline by end of 2026. Teams with high AI adoption complete 21% more tasks but see 91% increases in code review time.

That last number is the one that keeps me up. The bottleneck shifts upstream to senior judgment. But senior judgment only comes from doing the work that AI now handles. We're cutting the rungs off the bottom of the career ladder while expecting people to show up at the top.

This connects directly to the Meta-Harness story. Harness quality requires the kind of systems thinking and debugging intuition that only comes from years of grinding through problems. If juniors never build that muscle because AI handles the grind, who writes the harness in five years?

Same day, Dario Amodei went viral saying "I have engineers who don't write any code" at Anthropic. "They just let the model write the code and they edit it." He predicted AI could take over most software engineering tasks within 6-12 months. Meanwhile, a 684-point HN post argued that writing IS thinking, and delegating it to AI kills the cognitive process where contradictions surface.

I don't know how this resolves. But I know that "AI makes everyone more productive" and "juniors who use AI learn 17% less" can both be true simultaneously. And that's a problem we're not taking seriously enough.

Section Deep Dives

Security

Langflow's "patched" CVE is still fully exploitable. JFrog confirmed that Langflow v1.8.2, widely reported as fixed for CVE-2026-33017 (CVSS 9.8 critical RCE), has no actual patch. The fix only exists in unreleased nightly builds. CISA mandates remediation by April 8. Sysdig observed exploitation within 20 hours of disclosure. If you're running Langflow in production, you're exposed right now.

$392M flooded agentic AI security in two weeks around RSAC. Ten startups raised massive rounds: XBOW ($120M, $1B+ valuation for autonomous offensive security), Oasis Security ($120M for non-human identity governance), RunSybil ($40M, founded by OpenAI's first security hire). The Alphabet-Wiz $32B deal also closed. Agent security isn't a niche anymore. It's the hottest category in cybersecurity.

68% of organizations can't tell AI agents from humans. The Cloud Security Alliance survey found 85% use AI agents in production, but 74% say agents have more access than necessary, and 31% allow agents to operate under human user identities. This is the identity governance crisis the security industry has been warning about, and the data says most companies aren't ready.

20% of AI vulnerability patches that pass CI break production. Backline.ai benchmarked Claude Code (79.3), Gemini 3 Pro (82.1), and Codex (76.3) on 25 real repos with verified CVEs. Adding structured planning phases improved scores to 89.5. Vulnerability remediation isn't a coding problem. It's a planning problem that ends with coding.

Agents

AWS ships frontier agents for pen testing and cloud ops. AWS Security Agent and DevOps Agent are now GA. Security Agent compresses pen testing from 2-6 weeks to 1-2 days. DevOps Agent claims 3-5x faster incident resolution with built-in CloudWatch, Datadog, and Splunk integrations. This is AWS making autonomous agents a standard infrastructure offering, not a beta experiment.

Emergent group dysfunction in multi-agent systems isn't rare, it's frequent. arXiv 2603.27771 (15 authors including Nouha Dziri) shows problematic behaviors arise reliably across repeated trials in three patterns: resource competition, sequential handoff (where downstream agents lose information), and collective decision aggregation. Individual agent alignment doesn't prevent group failure. If you're building multi-agent pipelines, test the group, not just the individuals.

Binary safety gates have provable limits. arXiv 2603.28650 provides an information-theoretic proof that classifier-based approve/reject gates can't allow unbounded self-improvement while maintaining bounded risk. For anyone building agent guardrails: you need continuous probability gates, not binary classifiers.

Research

Task-aware speculative decoding beats one-size-fits-all drafters. TAPS (arXiv 2603.27027), top trending on HuggingFace with 108 upvotes, shows that math-trained drafters dominate reasoning while ShareGPT-trained ones win on general tasks. Confidence-based routing between 2-3 specialized draft models outperforms entropy routing and weight merging. If you run local inference at scale, maintain specialized drafters with a router instead of one universal model.

Vision models invent images they never see. CDH-Bench (arXiv 2603.21687) demonstrates that when visual evidence conflicts with commonsense expectations, models override what's actually shown. They output the commonsense alternative instead of what's in the image. Don't trust AI visual perception for any production use case without human verification.

MIT Tech Review: benchmarks are fundamentally broken. The core argument: models are tested against individual humans on isolated problems, but deployed within teams and organizational workflows. Even newer dynamic evals still ignore human context. We need organizational-context benchmarks, not just model-vs-human comparisons.

Infrastructure & Architecture

Micron plans HBM-style stacked GDDR for cheaper AI accelerators. VideoCardz reports Micron is building ~4-layer vertically stacked GDDR targeting the gap between $5/GB GDDR and $25/GB HBM, with samples in 2027. For local LLM builders, this could mean future consumer GPUs with HBM-like bandwidth at near-GDDR pricing.

Bitwarden ships Agent Access SDK for credential management. OneCLI integration proxies every API call an agent makes, pulling credentials from Bitwarden's vault and injecting them at the network layer. The agent and LLM provider never see the keys. Every request requires human approval. This is the first major password manager to ship purpose-built agent credential infrastructure.

Docker MCP Gateway adds container isolation for agent tools. Each MCP server runs with restricted privileges and network access. Version 2026.04 adds provenance verification and runtime secret isolation. Call-tracing logs every agent-to-tool interaction. If you're running MCP servers in production, this is how you harden them.

Tools & Developer Experience

Claude Code quota crisis confirmed by Anthropic. The Register reports Anthropic called it the "top priority for the team." Reverse engineering found two independent bugs breaking prompt caching mid-session: cache writes succeed but reads fail, causing full reprocessing of every prior message. Combined with peak-hour throttling (5am-11am PT costs more since March 26), Max subscribers report 5-hour windows disappearing in 90 minutes. One user tracked $565 in API costs over 7 days on a $100/month plan.

GitHub ships agent hooks and MCP auto-approve. Agent hooks in public preview let you run custom commands at key points during Copilot agent sessions. MCP auto-approve can now be configured at server and tool level. Enterprise admins get visibility into who's using agents and control over what agents can do.

Anthropic publishes context engineering guide. The engineering blog post formalizes patterns beyond prompt engineering: progressive disclosure, compression, routing, evolved retrieval, and tool management. This is the discipline Meta-Harness automated. Reading it manually is still worth your time.

Models

Qwen 3.5 Omni ships natively omnimodal. Alibaba released text, image, audio, and video processing in a single pipeline. 100M+ hours of audio-visual training data. 113 speech-recognition languages. 256K context. Voice cloning from a single sample. Semantic interruption that distinguishes "uh-huh" from genuine interjections. Direct competitive pressure on ElevenLabs and Gemini 3.1 Pro simultaneously.

Qwen 3.6 Plus Preview hits OpenRouter with 1M context, free. 400 million tokens processed in its first 48 hours across ~400K requests. Mandatory chain-of-thought reasoning and tool use. Free tier. If you haven't tested it for agentic coding workflows, now's the time.

CoPaw-Flash-9B matches Plus-tier performance on consumer hardware. Alibaba's AgentScope fine-tuned Qwen3.5-9B on real agent trajectory data. Optimized for tool invocation, command execution, and multi-step planning. Open weights in 2B/4B/9B variants. A strong local option for agent workloads.

Vibe Coding

Shopify CEO's AI-generated PR: impressive stats, likely never merging. Josh Moody's analysis of Tobi Lütke's autoresearch PR against Liquid shows 3 of 4,192 specs failing, significantly less readable code, and 93 automated commits across 120 experiments. The 434-upvote r/programming thread became a proxy debate: does a 53% performance improvement justify code no one can review or maintain? I'd say no.

Google AI Studio gets multiplayer and persistent builds. The update adds real-time collaboration, builds that keep working when you close the tab, and live data connections. This positions AI Studio directly against Replit and v0 for the vibe coding workflow. 5.2K likes, 2.3M views.

"I vibe coded 10+ apps used almost a million times, then had to stop." SaaStr published a first-person account of the prototype-to-production gap. Auth, payments, security, error handling, and scalable architecture consistently break vibe-coded apps at scale. 45% of AI-generated code has security flaws with 2.74x higher vulnerability rates than human-written code.

Hot Projects & OSS

Scrapling hits #1 GitHub Trending at 34K stars. Scrapling is an adaptive scraping framework that learns from website changes and auto-relocates elements. The built-in MCP server pre-processes content before passing to Claude/Cursor, cutting token costs. 2,900+ stars gained in 24 hours.

Hermes Agent v0.6.0 ships multi-instance profiles at 19.9K stars. NousResearch's release includes 216 merged PRs, MCP server mode, Docker support, and ordered fallback provider chains. The self-improving learning loop where the agent creates and refines skills through experience remains the core differentiator. Gaining 1,909 stars/day.

Three Claude Code wrappers ship in the same window. Phantom (24/7 autonomous agent, 1,200 Reddit upvotes), pilot-shell (TDD-mandatory with enforced linting, 1,600 stars), and pro-workflow (self-correcting memory over 50+ sessions, 1,500 stars). All solve the same gap: Claude Code needs scaffolding for reliability. Persistent memory + enforced verification + session-spanning context is the pattern.

SaaS Disruption

SaaSpocalypse by the numbers: Intuit -46%, Workday -40%, Snowflake -37%. FinancialContent's breakdown shows the software index trading 20% below its 200-day MA, widest gap since the dot-com crash. The AI winner-loser performance gap exceeded 95 percentage points over 12 months. MVP costs collapsed from $500K to $20K.

"Death of SaaS" drives record M&A: PE-backed deals surged 100%+ to $89B. Fortune reports Q4 2025 enterprise SaaS M&A hit $83.7B across 245 deals. Public multiple compression makes take-privates cheap. IBM-Confluent ($11B) and Permira-Clearwater ($8.4B) led 17 mega-deals.

The "AI-native layer" pattern dominates funding. Five companies raising $40M-$125M in one week share identical architecture: AI layers that sit alongside existing systems, not replacing them. Reevo (GTM layer), Doss (inventory layer), Granola (context layer). "Layer, not replacement" is the winning go-to-market for 2026.

Policy & Governance

California AB-1043: every OS must verify user age by January 2027. Fireship's coverage (740K views) highlights the law covers Fedora, Arch, Debian, and "a teenager in Brazil who maintains a desktop environment." $7,500 penalties per affected child. Multiple distros are discussing simply excluding Californians rather than complying.

Oracle cuts up to 30,000 employees to fund AI data centers. The Next Web reports employees across the US, India, Canada, and Mexico received 6 AM termination emails with no prior warning. TD Cowen estimates 18% workforce reduction, freeing $8-10B for AI infrastructure. Entire teams saw 30%+ cuts.

Pentagon's action against Anthropic blocked by California judge. MIT Technology Review reports the DOD targeted Anthropic over its safety stance and reluctance on military applications. The ruling may embolden other AI companies to maintain independent safety positions.

Skills of the Day

Pin exact dependency versions in every project today. Run npm ls axios across all your projects and ensure lockfiles specify exact versions, not ranges. The axios attack exploited the 2-3 hour window where npm install with a range pulled the compromised version automatically.
Add environment bootstrapping to your agent prompts. Before your agent loop starts, snapshot the working directory, available languages, package managers, and memory. Inject it into the initial prompt. Stanford's Meta-Harness showed this eliminates 2-5 wasted exploration turns per task.
Use draft-then-critique with two different models for any critical output. Route generation to a fast model (Haiku, GPT-4o-mini) and verification to a strong model (Opus, GPT-5). Microsoft proved this pattern improves output quality 13.8% on research tasks. You can implement it with a simple two-call wrapper today.
Run npx source-map-explorer on your own published npm packages. The Claude Code leak happened because source maps shipped in a production build for 13 months unnoticed. Check your own packages for source map files before someone else does.
Implement confidence-based routing between specialized draft models for local inference. TAPS (arXiv 2603.27027) shows maintaining 2-3 task-specialized small models with a confidence router beats one universal drafter. Train a math drafter on MathInstruct and a general drafter on ShareGPT.
Add structured planning phases before AI-generated vulnerability patches. Backline.ai showed that 20% of AI patches that pass CI break production. Adding a planning step where the model reasons about the patch's impact before writing code improved scores by 7.4 points.
Audit your organization's AI agent identity management this week. CSA data shows 31% of companies let agents run under human identities and 43% use shared service accounts. Map which agents have what access. If you can't answer that question, you have a governance gap.
Test your multi-agent pipelines as groups, not individual agents. arXiv 2603.27771 shows emergent group dysfunction is frequent, not rare. Three failure patterns to test: resource competition, sequential handoff information loss, and collective decision aggregation errors.
Use Coasts or Docker isolation for parallel AI coding sessions. Running multiple agents on shared localhost creates port and database conflicts. Coasts boots from your docker-compose.yml and gives each agent its own isolated runtime. Community consensus caps practical parallel sessions at 3-5.
Budget context consumption like you budget compute. A single Claude Code Explore call can burn 94K tokens in 3 minutes. Pre-scope with Glob/Grep before spawning expensive agent calls. Use Haiku for reconnaissance. Reserve Opus context for synthesis and complex edits. The real rate limit is context velocity, not model capability.

Reply to this email with what you're building. I read every response.

Was today's issue useful? Reply "yes" or "no." That's it. One word helps me calibrate.