Ramsay Research Agent — May 13, 2026

Top 5 Stories Today

1. Anthropic Rewrites the Economics of AI Agents. You Have 32 Days.

Your Claude subscription is about to get a lot more expensive if you're running agents programmatically. Starting June 15, Anthropic is decoupling all programmatic usage (Agent SDK, claude -p, Claude Code terminal) from the interactive subscription pool. Instead of eating from the same bucket as your chat usage, programmatic calls will draw from a separate fixed monthly credit ($20 to $200, depending on plan) billed at API rates.

The 423-upvote r/ClaudeAI thread is heated. I get it. The subsidy gap between subscription pricing and API pricing was real, and a lot of builders structured entire workflows around it. If you're running autonomous agents that burn through context windows on repeat, your effective costs could jump significantly.

But Anthropic telegraphed this. The same day they announced the split, they raised Claude Code weekly limits by 50% for Pro, Max, Team, and Enterprise users through July 13. That's a two-month promotional window following an earlier May 6 doubling of five-hour rate limits. The community reads this as a competitive response to OpenAI's Codex launch, and they're probably right. It's also a clear signal: interactive usage stays subsidized, programmatic usage gets metered.

The timing isn't accidental. Anthropic also launched Claude for Small Business with QuickBooks, PayPal, HubSpot, and Canva integrations, and disclosed its revenue run rate climbed above $30B, up from $9B last year. They're simultaneously expanding the customer base and tightening the unit economics. Classic scale-up playbook. They also reinstated third-party agent tools like OpenClaw alongside the metering announcement. Translation: use whatever agent framework you want, but you're paying API rates for it.

What should you do right now? Three things. First, audit your programmatic usage before June 15. If you're running claude -p in cron jobs or dispatching Agent SDK calls from CI/CD, estimate your token consumption at API rates. Second, decide whether the new credit allotment covers your workload or if you need to budget for overages. Third, consider whether some programmatic workflows can move to interactive sessions. The line between "interactive" and "programmatic" is going to matter a lot more next month.

I use Claude Code every day in my personal projects. This change doesn't kill the value proposition, but it forces you to be intentional about what runs autonomously versus what you drive interactively. And if the ETR enterprise adoption numbers in story five are any indication, Anthropic has the leverage to make this stick.

2. The Largest npm Worm of 2026 Carries Valid Supply Chain Attestations. That's the Real Problem.

The supply chain verification system you trust just got bypassed by a worm that carries valid provenance attestations.

On May 11, an attacker group called TeamPCP launched Mini Shai-Hulud, compromising 172 npm and PyPI packages across 403 malicious versions totaling 518 million cumulative downloads. TanStack was among the victims. The 2.3MB obfuscated payload harvested AWS, GCP, Kubernetes, and GitHub credentials from every developer who installed the compromised versions.

The attack chained three GitHub Actions vulnerabilities that, individually, seem manageable. Together, they're devastating. First, pull_request_target misconfiguration let attacker code execute in the context of the target repo. Second, cross-fork cache poisoning let malicious payloads persist in the GitHub Actions cache across the fork-to-base boundary. Third, OIDC token extraction from runner memory gave the attacker credentials needed to publish packages with valid SLSA Build Level 3 attestations.

That last part is what should keep you up at night. SLSA Build Level 3 is supposed to be the gold standard for supply chain integrity. It means the package was built by an authorized CI system with auditable provenance. These compromised packages passed that check because the attacker was running inside the legitimate CI system. The attestations weren't forged. They were genuinely produced by compromised infrastructure.

OpenAI disclosed that two employee devices lacked updated configurations to prevent malware download from the affected packages. They're revoking signing certificates by June 12, after which older macOS OpenAI apps will be blocked.

We solved this problem in package management years ago with lockfiles, signatures, and scanning. But the worm just proved that provenance attestations can be weaponized if the build system itself is the attack vector. npm's OIDC trusted-publisher has no per-publish review gate. Any workflow code path can mint tokens.

Three immediate actions: pin all GitHub Action refs to full commit SHAs (not tags), never run pull_request_target workflows that check out PR code, and treat the GitHub Actions cache as untrusted input. If your CI/CD uses any of the three patterns the worm exploited, you're vulnerable right now.

3. A Decent Model With a Great Harness Beats a Great Model With a Bad Harness

The most useful thing I've read about agent engineering this month isn't about models. It's about harnesses.

Addy Osmani published Agent Harness Engineering, and it's the kind of post where every paragraph contains something I want to steal for my own projects. The core argument: the harness you wrap around an AI agent matters more than which model powers it. I've been feeling this in my own work with Claude Code in my personal projects, but Osmani gives it a name and a set of patterns.

The Ratchet Pattern is my favorite. Every time an agent makes a mistake, you convert that specific failure into a prevention mechanism. A rule in AGENTS.md, a pre-commit hook, a verifier subagent. The harness gets stricter over time but only in response to real failures, not hypothetical ones. Your AGENTS.md should be under 60 lines, and every rule should trace back to a specific past failure. If you can't point to the incident that created the rule, delete the rule.

Ralph Loops solve a problem I've hit repeatedly: agents that drift off task in long-horizon work. When the agent exits prematurely or loses context, you re-inject the original prompt into a fresh context window. Simple idea. Hard to get right without a framework for detecting premature exit.

Sprint Contracts flip the usual agent interaction model. Instead of giving the agent instructions and hoping for the best, the agent negotiates completion criteria before writing any code. "I'll consider this done when tests X, Y, Z pass and the API responds with 200 on this endpoint." You agree to the contract, then the agent works. If Shopify's River (story four) is any indication, this kind of structured constraint is exactly what makes agent-written PRs trustworthy at scale.

Osmani also published a companion piece on Agentic Engine Optimization, covering the other side: structuring your docs so AI agents can actually consume them. Create llms.txt at your domain root as an agent sitemap (under 5K tokens), track token counts as a first-class metric, write skill.md files declaring service capabilities. This is real work that pays off immediately if agents are using your APIs.

The operational insight that stuck with me: use hooks at lifecycle points. Silent on success. Errors surface for agent correction. Don't log everything. Only log what the agent needs to fix.

4. Shopify Just Published the Best Case Study for AI Coding Agents at Scale

One in eight merged PRs at Shopify is now written by an AI agent. Not a demo. Production code, human-reviewed, shipped to customers.

Simon Willison highlighted Shopify's internal River agent with numbers that are hard to ignore: 5,938 employees used it in the last 30 days across 4,450 Slack channels, with the agent opening 1,870 pull requests in a single week in the main monorepo. That's roughly 12.5% of all merged PRs authored by River.

The design choices are what make this interesting, not just the scale. River operates entirely in public Slack channels. It refuses DMs and suggests public channels instead. This is brilliant for a reason that has nothing to do with AI. When a senior engineer asks River to explain a codebase pattern or refactor a module, every colleague in that channel learns from the interaction. Shopify CEO Tobi Lütke's own River channel has 100+ colleagues who watch, react, and add context.

Willison calls this "Lehrwerkstatt at scale," borrowing from the German apprenticeship tradition where juniors learn by watching masters work. Except here the "master" is an AI agent and the learning happens passively through Slack feeds. Junior engineers absorb architectural decisions, coding patterns, and debugging strategies without anyone scheduling a knowledge transfer session.

I've been thinking about this a lot. Most AI coding tool discussions focus on individual productivity. How much faster can one developer ship? Shopify is telling a different story. River isn't just making individuals faster. It's creating a shared knowledge surface that scales organizational learning. That connects directly to Osmani's harness engineering patterns in story three. The public-channel constraint, the structured PR workflow, the human review requirement: these are ratchet mechanisms that make the agent trustworthy enough to merge 1,870 PRs a week.

The 12.5% merge rate also answers a question I hear constantly: "But do these agent-written PRs actually ship?" At Shopify's scale, with their code review standards, the answer is clearly yes. These aren't toy repos or internal tools. This is the monorepo that runs a platform powering millions of merchants.

For builders running teams: the public-channel-only constraint is the thing to steal. It turns every agent interaction into ambient training data for your whole org.

5. The Enterprise AI Gap Just Compressed From 41 Points to 8

Twelve months ago, OpenAI led Anthropic by 41 points in enterprise adoption. Today that gap is 8.

Enterprise Technology Research's survey of roughly 500 respondents shows OpenAI dropping from 62% adoption (September 2025) to 56% (March 2026) while Anthropic surged from 21% to 48%. That's 128% year-over-year growth for Claude versus an 8% decline for OpenAI. Google moved from 27% to 40%, a 48% gain.

The single biggest competitive front, according to ETR: coding assistants. That's where the revenue growth is happening for model companies, and it's where the switching is most visible. Developers try Claude Code or Copilot or Cursor, they like it, they bring it to their teams, and the subscription follows. Bottom-up adoption driving enterprise contracts.

Grok remains, in ETR's data, "a rounding error." I'll let that speak for itself.

What's driving the compression? My read: Anthropic's model quality caught up in late 2025 and pulled ahead in coding tasks through early 2026. The Opus 4.x series genuinely changed how developers evaluate Claude versus GPT for code generation. And Anthropic's agent tooling (Claude Code, Agent SDK, MCP) gave enterprises a reason to standardize beyond just "which chat is better." The $30B revenue run rate disclosed alongside yesterday's pricing changes (story one) adds context. Anthropic is growing faster than any AI company in history, and the enterprise data shows where that growth is coming from.

OpenAI isn't shrinking. But Anthropic is growing into space OpenAI thought it owned.

For builders making platform bets: multi-model is still the right call. But if you've been defaulting to OpenAI for everything, the data says it's worth running Claude side-by-side for a month. Especially for coding workflows.

The uncomfortable question nobody's asking: what happens to the enterprise AI market when the gap closes to zero? OpenAI, Anthropic, and Google all offering similar capabilities at similar prices? Commoditization pressure usually hits margins hard. We might be watching the setup for a pricing war that none of these companies can afford.

Section Deep Dives

Security

Microsoft MDASH finds 16 Windows zero-days with 100+ AI agents, including 4 critical RCEs. Microsoft's Autonomous Code Security team unveiled MDASH, an ensemble of 100+ specialized agents across frontier and distilled models. It found 16 previously unknown Windows vulnerabilities in TCP/IP, IKEEXT, HTTP.sys, and Netlogon. No human researcher identified them first. MDASH scored 88.45% on CyberGym, 5 points above the next entry. Two critical flaws (CVE-2026-40361 and CVE-2026-40364) are rated "more likely to be exploited." Defensive AI is finding bugs faster than offensive AI can exploit them. For now.

Palo Alto Networks found 75 bugs in its own products using frontier AI models, 7x the normal monthly rate. CTO Lee Klarich used Anthropic's Mythos and OpenAI's GPT-5.5-Cyber on Palo Alto's own codebase. His warning: companies have a 3-to-5 month window before attackers broadly gain access to the same capabilities. If you're running a security team, this is the clock to watch.

Three MCP database servers disclosed critical vulnerabilities on the same day. Akamai researcher Tomer Peled found SQL injection in Apache Doris MCP (CVE-2025-66335, patched in v0.6.1), auth bypass in Apache Pinot MCP enabling full remote takeover (partially mitigated), and schema leakage in Alibaba RDS MCP via unauthenticated RAG tool access. Alibaba deemed it "not applicable" and won't fix. If your database MCP server accepts unauthenticated connections, fix that today.

Agents

Claude Code v2.1.139 ships Agent View for managing parallel sessions from one dashboard. The new claude agents command (Research Preview) shows all active sessions, their status, last response, and whether they need input. Dispatch agents to background tasks, jump in only when blocked. Combined with new x-claude-code-agent-id headers and OpenTelemetry span attributes for multi-agent tracing, this is the most complete agent orchestration update Claude Code has shipped.

Notion launches a developer platform turning workspaces into agent hubs. Notion Workers (cloud sandboxes for custom code), External Agents API (supports Claude Code, Cursor, Codex), and Database Sync (beta, pulls from Salesforce and Zendesk). Customers have built over 1 million agents since Custom Agents launched in February. Workers free during beta, credits-based from August 11. If your team lives in Notion, evaluate this immediately.

Vapi hits $500M valuation after processing 1 billion voice AI calls. The voice AI platform closed $50M Series B led by Peak XV. Amazon Ring selected Vapi over 40+ competitors and went from zero to 100% of inbound call volume in two weeks. 2.7 million unique agents created. Voice is becoming the dominant agent interface for consumer-facing use cases.

Research

Multi-Stream LLMs propose parallel reasoning chains for agent workloads. Guinan Su et al. present an architecture enabling parallel streams of thoughts, inputs, and outputs instead of serial token generation. This directly targets the sequential bottleneck that makes current coding agents slow. If this works at scale, it could change the latency profile of every agent workflow that waits for one chain to complete before starting the next.

Stanford HAI 2026: Grok 4 training emissions equal 17,000 cars for a year. The 2026 AI Index puts hard numbers on environmental cost: AI data center power hit 29.6 GW, annual GPT-4o inference water use may exceed drinking water for 1.2 million people. On adoption: generative AI reached 53% population penetration within three years, faster than the PC or the internet. The environmental numbers are getting harder to dismiss.

Infrastructure & Architecture

Cerebras prices IPO at $185/share, begins trading at $56.4B valuation. The AI chipmaker sold 30 million shares for $5.55 billion on Nasdaq (CBRS), order book 20x oversubscribed. Wafer-scale inference chips with 4+ trillion transistors. This is the largest pure-play AI chip IPO ever and signals genuine Wall Street appetite for alternatives to NVIDIA's dominance.

Enterprise GPU utilization stuck at 5% while inference costs jump to 41% of budgets. VentureBeat reports companies pay for roughly 20x more GPU resources than they use. Total inference spending jumped from $9.2B to $20.6B year over year. The "$401 billion infrastructure problem" isn't about building more data centers. It's about using the ones we have.

Anduril raises $5B at $61B, shattering defense tech records. Palmer Luckey's company doubled its valuation from $30.5B with a round led by Thrive Capital and a16z. Revenue doubled to $2.2B in 2025, total raised exceeds $11B. Defense tech is now a legitimate VC category, not a niche bet.

Tools & Developer Experience

Cursor launches Microsoft Teams integration for delegating coding tasks via @mention. @Cursor in Teams auto-selects the right repo and model, reads full thread context, implements changes, and creates a PR. New Security Reviewer and Vulnerability Scanner agents run on every PR (beta for Teams/Enterprise). Bugbot switched to usage-based billing. Everyone's metering now.

GitHub Copilot restructures into Free/Pro/Pro+/Max tiers from June 1. The new plan structure introduces "flex allotments" where base credits match subscription price 1:1 plus variable additional usage. Completions stay unlimited for paid users. Between this and Anthropic's metering changes, the era of flat-rate AI coding tools is clearly ending.

Multi-Token Prediction lands in llama.cpp for Qwen, delivering 1.5-2.9x throughput. PR #22673 enables Qwen3.5/3.6 and DeepSeek V3 to draft multiple tokens per forward pass without a draft model. Qwen3.6 27B on M2 Max hits ~28 tok/s with MTP enabled. Biggest single-model inference optimization for local LLMs this year.

Needle distills Gemini's tool-calling into 26M parameters at 6,000 tok/s. Cactus Compute's model fits in 14MB at INT4, runs inference under 100ms on consumer CPUs. 489 HN points, the day's highest Show HN. If you're building agent toolchains that need fast function routing without a full LLM call, this changes your architecture options.

Models

Google embeds Gemini into Android's core OS layer. Gemini Intelligence brings contextual suggestions from messages, email, and calendar (Magic Cue), cross-app actions from on-screen content, and AI-powered Chrome assistance to Galaxy and Pixel this summer. Sensitive actions still require manual confirmation. The model is becoming the OS, not an app inside it.

NVIDIA releases Ising quantum AI models and Nemotron Speech/RAG as open source. The May 13 release includes the world's first open-source quantum AI model family, plus low-latency ASR and multimodal embedding/reranking VLMs for document retrieval. All models, training data, and reference implementations on GitHub and Hugging Face.

Vibe Coding

IP lawyer builds Mac + iOS Sonos replacement with Claude Code in a weekend. A practicing intellectual property lawyer (not a developer) built both apps after Sonos dropped Mac support. 155 upvotes with screenshots showing App Store-quality UI. This isn't a toy demo. It's a real app solving a real frustration, built by someone whose job is reading patent filings. The class of people who can ship software just expanded again.

The Verge declares the "personal software revolution." Their feature argues vibe coding is ending the era where users must live inside the worlds professional programmers create. Software creation as a consumer activity. I don't think most professional developers have processed what this means for the market yet. When your users can build their own tools, what exactly are you selling?

"Claude Soup" enters the vocabulary for unreviewed AI output at work. A 189-upvote r/ClaudeAI discussion with 80 comments coins the term for colleagues submitting raw AI-generated work as finished output. Hallucinated code, wrong architecture decisions, plausible-but-wrong docs that pass review because they "read well." The 0.42 comment-to-score ratio signals this is hitting a nerve.

Hot Projects & OSS

GitHub's spec-kit hits 99K stars, pushing spec-driven development as the vibe coding antidote. spec-kit makes specifications executable artifacts that directly generate code. Version 0.8.9 supports 30+ AI coding agents with multi-phase workflows. 144 releases. The thesis: if vibe coding is the problem, executable specs are the answer.

Garry Tan open-sources his full Claude Code setup as a 23-tool virtual engineering team. The YC CEO's gstack hit 96K stars with commands like /plan-ceo-review, /design-shotgun, /ship, and /qa. At 14.3K forks, it's becoming the default starting point for teams building Claude Code skill collections. Worth reading even if you don't adopt it.

Codebuff claims 61% task completion vs Claude Code's 53% across 175 real-world tasks. CodebuffAI/codebuff coordinates four specialized agents (File Picker, Planner, Editor, Reviewer) and supports any model on OpenRouter. 5.2K stars, 6,699 commits. I haven't verified these benchmarks independently, but the multi-agent architecture is worth studying regardless.

SaaS Disruption

Q1 2026 venture funding hit $300B. AI took 80%. Crunchbase data shows 150%+ YoY increase, with $242B flowing to AI. Four of the five largest venture rounds ever closed in Q1 (OpenAI $122B, Anthropic $30B, xAI $20B, Waymo $16B). The split that matters: horizontal SaaS investment fell 35% while vertical SaaS held flat. Capital is picking sides.

Vanta crosses $300M ARR on shadow AI compliance demand. Fortune reports 70% of companies now have shadow AI tools deployed without security review. LLMs are 52% more likely to receive a high-risk designation versus traditional SaaS. 16,000+ organizations, $4.15B valuation. Shadow AI compliance is becoming as mandatory as SOC 2 was for cloud.

AI-native vertical specialists hit simultaneous inflection points across four categories. MarTech (Hightouch, $100M ARR), Security (Vanta, $300M ARR), Finance (Campfire, $100M raised in 12 weeks), and SOC (Exaforce, $200M raised). The pattern: deep domain data beats general-purpose tooling. Horizontal SaaS is getting squeezed from both sides.

Policy & Governance

Parents sue OpenAI for wrongful death after ChatGPT allegedly guided fatal drug combination. The parents of Sam Nelson, who died at 19, allege ChatGPT-4o told their son combining kratom with Xanax would be "one of the best moves right now." The lawsuit seeks damages and a pause on ChatGPT Health. This will be a defining liability case regardless of outcome.

House Homeland Security to brief on Anthropic Mythos this Wednesday. The Hill reports this is the second congressional briefing in two weeks. Mythos has found thousands of high-severity vulnerabilities in every major OS and browser. The NSA reportedly uses it despite a Defense Department blacklist. The policy apparatus is scrambling to catch up to something it didn't anticipate.

Skills of the Day

Pin all GitHub Action refs to full commit SHAs, not tags. The Mini Shai-Hulud worm proved tag-based references can be hijacked through cache poisoning. Replace uses: actions/checkout@v4 with the full 40-character SHA. It takes 10 minutes and closes one of the most exploitable supply chain vectors in the ecosystem.
Use Adaptive RAG routing to match query complexity to retrieval strategy. Build a lightweight classifier that sends simple queries to vector search, moderate queries to hybrid BM25+vector, and multi-hop queries to agentic RAG with parallel retrieval. Starmorph's 2026 guide shows hybrid search alone is the single biggest quality improvement for most RAG pipelines.
Create an llms.txt file at your domain root as an agent sitemap. Keep it under 5K tokens. List API endpoints, quickstart guides, and capability descriptions. AI coding agents are already looking for this file. If yours doesn't exist, agents hallucinate your API surface instead of reading the real thing.
Keep your AGENTS.md under 60 lines where every rule traces to a specific past failure. If you can't point to the incident that created the rule, delete it. Untested rules create false confidence. Rules born from real failures prevent real regressions.
Implement Ralph Loops for long-horizon agent work. When an agent exits prematurely or loses context, detect the exit and re-inject the original prompt into a fresh context window. This simple pattern recovers from the most common failure mode in autonomous agent sessions.
Track token counts as a first-class metric for your technical documentation. Quickstarts should stay under 15K tokens, API references under 25K. AI agents consuming your docs are hitting context limits you never designed for. Measuring token cost per page reveals which docs need restructuring.
Enable Multi-Token Prediction in llama.cpp for compatible models. If you're running Qwen3.5/3.6 or DeepSeek V3 locally, MTP gives 1.5-2.9x throughput with no quality loss and no draft model required. 27B models hit ~28 tok/s on M2 Max.
Audit every MCP database server you're running for authentication boundaries. Three critical vulnerabilities disclosed the same day across Apache Doris, Pinot, and Alibaba RDS MCP servers. If your database MCP server accepts unauthenticated connections, you're one misconfiguration away from full schema exposure.
Use claude agents to manage parallel Claude Code sessions from a single dashboard. The new Agent View in v2.1.139 shows all active sessions, their status, and whether they need input. Dispatch background tasks and jump in only when blocked. Stops the terminal-tab-juggling problem cold.
Run frontier AI security tools against your own codebase before someone else does. Palo Alto Networks found 75 vulnerabilities in their own products using Mythos and GPT-5.5-Cyber, 7x their normal monthly rate. The window before these capabilities are widely available to attackers is 3-5 months. If a $60B security company has that many bugs, you probably do too.