Ramsay Research Agent — May 16, 2026
Top 5 Stories Today
1. Mitchell Hashimoto Says Entire Companies Are Under "AI Psychosis"
The HashiCorp founder posted something that's been rattling around my head all day. He believes there are "entire companies right now under heavy AI psychosis" where rational conversation about what AI can and can't do has become impossible. The post hit 1,812 points and 1,014 comments on Hacker News, making it the highest-engagement AI discussion this week.
What makes this land differently than the usual "AI is overhyped" take is Hashimoto's framework. He draws a direct parallel to the early cloud adoption era, specifically the MTBF vs MTTR reckoning. Back then, companies that couldn't have honest conversations about failure rates shipped brittle systems and paid for it later. Now he's seeing the same pattern: organizations embracing "it's fine to ship bugs because agents will fix them" without ever testing that assumption.
The timing is brutal. Same week, Amazon workers are openly admitting to "tokenmaxxing", making up tasks to hit mandatory 80%+ AI usage quotas. The internal tool MeshClaw is being used to automate unnecessary code deployments and emails, inflating usage scores with real production actions. This is Goodhart's Law playing out in real time at one of the world's largest engineering orgs.
Then Mistral's CEO tells French Parliament that his engineers "no longer write a single line of code." And Airbnb announces 60% of new code is AI-generated with "one engineer doing the work of 20."
Here's what I think is actually happening: there's a spectrum between "AI is useless" and "AI replaces all coding." The psychosis isn't believing AI is powerful. It's losing the ability to distinguish between a 71% productivity gain (Stanford's number, which I'll get to) and a 100% replacement. That gap is where all the bugs live.
What builders should do: Hashimoto's MTBF/MTTR framework is the diagnostic. Ask yourself: does your org track AI-generated bug rates separately from human-written code? Can you have a conversation about where agents fail without someone treating it as anti-AI? If no, you're in the psychosis zone. The fix isn't less AI. It's better measurement.
2. Stanford Studied 51 Real AI Deployments. The Results Challenge Everything You Think You Know.
Stanford's Enterprise AI Playbook dropped with data from 51 production deployments across 41 organizations, 9 industries, 7 countries, and over 1 million employees. The headline: agentic implementations show 71% median productivity gains versus 40% for high-automation systems.
Source: Stanford HAI / r/artificial
But the headline isn't the story. The story is what drives the gap. It's not model quality. It's not which frontier model you pick. The differentiators are workflow design, executive sponsorship, and exception handling. 77% of the hardest challenges were invisible costs: change management, data quality, process redesign. None of the sexy technical stuff.
Here's the finding that hit me hardest: 61% of successes followed a prior failed attempt. The companies that got it right the second time weren't picking better models. They were building better harnesses around the same capabilities. They'd figured out where the agent needed guardrails, where humans needed to stay in the loop, where the data pipeline was silently corrupting outputs.
This maps directly to what I've been building with my own orchestration pipeline. The model is maybe 20% of the system. The harness, the routing, the error recovery, the verification steps. That's where the 71% lives.
The Stanford data also kills the "just use the best model" argument that dominates Twitter discourse. Organizations using GPT-4 class models with bad workflow design underperformed organizations using smaller models with thoughtful orchestration. The harness beats the model every time when you're operating at enterprise scale.
What builders should do: Stop optimizing model selection. Start optimizing harness design. Build verification loops. Instrument your agent pipelines so you can see where they fail silently. And if your first attempt at an agentic workflow didn't work, try again with better exception handling before you blame the model.
3. $2 Trillion Wiped from Software Market Cap. This Isn't a Correction, It's a Repricing.
The SaaS selloff that started after Anthropic's Claude Cowork announcement in January has now erased approximately $2 trillion from software market capitalization. The IGV software ETF declined 22% relative to the S&P 500. That's worse than the dot-com bust. Worse than the 2008 financial crisis. Worse than the 2022 rate shock.
Forward P/E multiples collapsed from 84.1x (the 2020-2022 peak) to 22.7x. For the first time ever, software trades below the S&P 500's overall multiple. Let that sink in. The market is saying software companies are now less valuable per dollar of earnings than the average company in America.
Why? Per-seat SaaS is being repriced in a world where AI agents replace licensed users. If one engineer can do the work of 20 (Airbnb's claim), that's 19 fewer Jira seats, 19 fewer GitHub licenses, 19 fewer Slack seats. The market isn't being irrational. It's pricing in the math.
But here's the counter-signal: Figma just reported Q1 revenue of $333M growing 46% YoY with net dollar retention at 139%. Varonis grew SaaS ARR 69% to $683M. The companies that are thriving share one trait: they've become infrastructure for AI workflows rather than tools replaced by them. Figma's Make (their AI feature) is used weekly by 60% of large customers. AI credits are driving seat upgrades, not replacing seats.
The lesson is directional. If your product is a seat that an agent can fill, you're in trouble. If your product is a platform that agents need to operate on, you're in the clear. Supabase shipping an MCP plugin this week isn't a coincidence. It's survival strategy.
What builders should do: If you're building SaaS, price for consumption or outcomes, not seats. Futurum's survey shows 43% of enterprise buyers now prefer consumption-based pricing and hybrid models drive 38% higher NRR. If you're an employee at a per-seat SaaS company, look at your product through the lens of "would an AI agent need a license for this?" If yes, start planning your next move.
4. Gemini 3.2 Flash Leaked: 92% of GPT-5.5 Performance at 1/15th the Cost
Google's Gemini 3.2 Flash appeared in the Gemini iOS app and AI Studio before any official announcement. It showed up on LM Arena benchmarks. And the numbers are real: 92% of GPT-5.5's coding and reasoning performance with sub-200ms latency at roughly 1/15th the cost.
This shifts the cost-performance frontier for every developer calling LLM APIs. Most development workflows don't need frontier-grade reasoning. They need fast, cheap, good-enough intelligence for routing, classification, extraction, and simple code generation. Flash-tier models at this quality level mean you can run 15x more inference for the same budget, or cut your API costs by 93% without meaningful quality loss.
Google I/O is in three days (May 19). This leak is almost certainly intentional positioning. But the timing doesn't matter. What matters is the capability curve: we're now at a point where a model released as "Flash" tier outperforms what frontier models could do 12 months ago. The floor keeps rising.
For my own pipeline, this changes the routing math immediately. I run 13 research agents daily. If I can route 80% of their work to Flash-tier and only escalate to frontier for complex synthesis, my daily run cost drops substantially while findings quality stays constant. That's the kind of practical change this enables.
The sub-200ms latency number is equally important. At that speed, you can put an LLM in the hot path of user interactions without perceptible delay. Real-time coding suggestions, instant classification, live content filtering. All become viable at commodity prices.
What builders should do: Audit your model routing today. If you're sending everything to a frontier model, you're overspending by 10-15x on most requests. Implement tiered routing: Flash for simple tasks, Pro for medium complexity, frontier for hard reasoning. The 8% quality gap between Flash and GPT-5.5 is invisible for 80% of production use cases. Wait for the official I/O announcement for pricing confirmation, but start planning the migration now.
5. Awesome Design MD Hits 71K Stars: Your Brand System as Agent Context
VoltAgent's awesome-design-md repository packages 57 complete brand design systems as structured markdown files. Apple, Notion, Airbnb, Stripe, Uber. Drop one in your Claude Code workspace and the agent generates pixel-accurate UI matching that brand's visual language.
71K stars makes it one of the fastest-growing design tooling repos of 2026. And it's not growing because of novelty. It's growing because it solves a real, immediate problem that anyone doing UI work with coding agents hits within their first hour: the agent can write perfect React components but has no idea what your design system looks like.
This is context engineering applied to design. The same pattern that made CLAUDE.md files standard for code behavior is now being applied to visual output. Google Stitch generates DESIGN.md files alongside its UI designs. Claude Code skills marketplaces list DESIGN.md loaders. The format is converging into a standard that works across Claude Code, Cursor, Codex, and any MCP-compatible agent.
A practitioner on r/ClaudeAI found the key differentiator: specs that describe component hierarchy, spacing relationships, and interaction patterns produce near-perfect output. Specs that only describe visual appearance drift. The technique is to decompose into layout skeleton, component inventory, spacing system, and interaction states as separate sections.
This hits close to home for me. Twenty years of design background, and the bottleneck has always been translating design intent into engineering output. DESIGN.md as a standard format means a designer can express their intent once and have it faithfully reproduced by any agent, in any framework, indefinitely. That's a bigger shift than Figma-to-code ever was.
What builders should do: Go grab the DESIGN.md for whatever brand system is closest to your product's visual language. Drop it in your workspace. Try generating a few components. You'll immediately see the difference between "generate a login form" and "generate a login form matching Stripe's design language." Then write your own DESIGN.md for your product. It takes about an hour and pays dividends on every AI-generated UI component from that point forward.
Section Deep Dives
Security
Google GTIG confirms first AI-built zero-day exploit in the wild. A threat actor used an LLM to discover and weaponize a zero-day, a 2FA bypass on a popular web admin tool. The Python exploit had LLM hallmarks: ANSI color classes, educational prompts, fabricated CVSS scores. We crossed the line from "AI assists attackers" to "AI discovers vulnerabilities humans missed." Source
Windows BitLocker zero-day "YellowKey" PoC released, no patch. Defeats default TPM-only encryption on Windows 11 and Server 2022/2025 using crafted FsTx files on USB to abuse Windows Recovery Environment. Opens a command shell while the protected disk remains mounted. Microsoft says "investigating." If you rely on BitLocker without a PIN, your disk encryption is decorative right now. Source
Claude Mythos hits 18/41 on n-day exploit benchmark vs 1/41 for previous model. That's an 18x improvement in offensive cybersecurity capability between model generations. Open-source models scored zero. The capability gap between frontier and open models on security tasks is widening, not narrowing. Source
nginx-ui CVE-2026-33032 (CVSS 9.8) actively exploited. Missing MCP auth on /mcp_message exposes 2,600+ instances to full takeover. Empty IP whitelist means "allow all." Fixed in v2.3.4 with literally one line adding AuthRequired() middleware. Patch now. Source
OpenClaw "Claw Chain": four chainable vulns expose 245,000 public AI agent servers. CVE-2026-44112 (CVSS 9.6) is a TOCTOU race condition enabling sandbox escape. Patched in version 2026.4.22. If you're running OpenClaw, update immediately. Source
PraisonAI auth bypass exploited within 4 hours of disclosure. CVE-2026-44338 exposed /agents and /chat endpoints without any token requirement. All versions 2.5.6 through 4.6.33 affected. The speed of exploitation tells you everything about the current threat environment for AI frameworks. Source
Agents
OpenAI launches ChatGPT personal finance agent with bank account linking. Pro subscribers can connect via Plaid to 12,000+ financial institutions. Read-only access to balances, transactions, investments, liabilities. The move from "chatbot" to "financial agent with real account access" is a meaningful product category shift. Source
Apple designing App Store rules for autonomous AI agents ahead of WWDC26. Updated guideline 5.1.2(i) now requires apps to disclose data sharing with third-party AI. The tension between agents that act on your behalf and Apple's walled garden will define mobile AI for the next decade. Source
Fiserv launches AgentOS for banking. Six banks co-developing, two already piloting. Four first-party agents (commercial loan onboarding, AML triage, deposit intelligence, operational reporting) plus nine third-party partners. GA August 2026. Enterprise agentic AI is shipping in regulated industries now, not "coming soon." Source
Writer ships event-based triggers for enterprise agents. Agents listen for business signals across Gmail, Gong, Calendar, Drive, SharePoint, Slack and execute multi-step workflows without human initiation. The shift from "agent you invoke" to "agent that acts on signals" is where the real productivity gains live. Source
Microsoft warns ungoverned AI agents are "corporate double agents." Their 2026 Security Data Index shows 53% of organizations lack GenAI-specific security controls. The $99/user/month E7 bundle is Microsoft's answer. Expensive, but the risk they're describing is real. Source
Research
GraphBit: DAG-based agent orchestration hits 67.6% on GAIA with zero hallucinations. Outperforms LangChain, LangGraph, CrewAI, AutoGen, Pydantic AI, and LlamaIndex by 14.7 points. Only 11.9ms overhead per execution step. The insight: deterministic graph execution with Rust prevents the hallucination cascades that plague dynamic orchestration. Source
Prompting Policies: RL-trained prompter lifts black-box LLM reasoning from 55% to 90%. Google Research trained a lightweight model to generate optimal prompts for a frozen worker LLM. On Big Bench Extra Hard, the approach nearly doubles performance. Prompt engineering can be amortized into learned weights rather than manual iteration. Source
Orchard sets open-source SWE-Bench SOTA at 67.5% with Qwen3-30B. Uses credit-assignment SFT to learn from productive segments of unresolved trajectories. The gap between open-source and proprietary agent performance continues shrinking. Source
Continual Harness: first AI system completes Pokemon without a lost battle. Princeton and Google DeepMind built a reset-free self-improving runtime that alternates between acting and refining its own prompt, sub-agents, and memory. The architecture mirrors what coding harnesses already do for software agents. Source
AGENTS.md research finds LLM-generated context files REDUCE agent success rates. Counter-intuitive finding across 138 repos: more documentation can harm performance. Developer-written context provides only +4% improvement and only when minimal and precise. The "more context is better" assumption is wrong. Source
Infrastructure & Architecture
Orthrus-Qwen3-8B achieves 7.8x tokens per forward pass via dual-view diffusion. Provably identical output distribution with less than 1% GPU memory overhead. Unlike speculative decoding with a separate draft model, this conditions a diffusion head directly on the AR head's causal cache. ~6x real-world speedup with strictly lossless performance. This is a genuine breakthrough in inference efficiency. Source
Microsoft Agent Framework 1.0 ships DevUI debugger and multi-cloud hosted integration. Unifies Semantic Kernel and AutoGen with native MCP + A2A interoperability. The stable-API commitment matters for enterprises that need to bet on a framework for 3+ years. Source
SWE-bench Verified abandoned after audit finds 59.4% of hard cases fundamentally flawed. Every frontier model could reproduce gold-patch solutions from memory using only a task ID. The benchmark measured training data exposure, not coding ability. SWE-bench Pro is the new standard. Source
Tools & Developer Experience
Claude Code v2.1.143: plugin dependency enforcement, context cost estimates, worktree bypass. The cost projections in the plugin marketplace are genuinely useful. You can now see exactly how much context budget each MCP server consumes before enabling it. Small feature, big impact on session management. Source
Cursor 3.4: cloud agent dev environments with multi-repo support. Dockerfile-based config, build secrets, layer caching with 70% faster builds on cache hits. Environment version history with rollback. Cursor is building the infrastructure layer for cloud-hosted agents while everyone else focuses on the agent itself. Source
Cursor removes Bugbot seat fees, shifts to usage-based billing. High-effort reviews find 0.95 bugs per run on average. The pricing model shift mirrors the broader SaaS-to-consumption trend. Custom logic can dynamically determine effort per PR. Source
Supabase launches unified plugin for AI coding agents. MCP server + agent skills in a single install. Works with Claude Code, Cursor, Windsurf, Copilot, and Cline. Backend platforms shipping first-class AI agent interfaces is becoming the standard expectation. Source
Raindrop Workshop: open-source agent debugger. Streams every token, tool call, and decision to a local SQLite dashboard. Supports 14+ frameworks. Agent observability is criminally underbuilt right now, and this helps. Source
Models
Qwen3.6-35B-A3B beats Gemini 2.5 Pro on Terminal-Bench 2.0. A 35B MoE model with only 3B active parameters scored 24.6% vs Gemini's 19.6%. Small open-weight models with the right harness outperform frontier cloud models on terminal coding tasks. The harness matters more than the model. Source
AI Explained documents Claude Opus 4.7 "shrinkflation." SWE-bench hit 87.6% but creative writing lost warmth, web research attribution declined, contradiction detection weakened. Anthropic optimized for coding at the expense of prose. I've noticed this in my own usage. The model is sharper for engineering but blander for everything else. Source
Benchmark strategic selection exposed with open-source dataset. Benchmarking-Cultures-25 documents how AI labs cherry-pick which benchmarks to report. Empirical evidence for what everyone suspected. Read the paper before trusting any model comparison. Source
Vibe Coding
Git-worktree isolation converges as standard for multi-agent coding. Claude-Squad, Cursor, and Claude Code all ship it. One branch per agent, merge on completion. Three independent tools arriving at the same primitive tells you this is the right abstraction. Source
obra/superpowers surges to 194K stars (+1,281/day). Composable skills framework enforcing spec-driven development across Claude Code, Codex, Goose, Gemini CLI, and 8+ agents. The fact that a development methodology repo is growing this fast tells you developers want guardrails on their agents, not just raw power. Source
pilot-shell reaches 1,720 stars. Wraps Claude Code with spec-driven planning, enforced TDD, and automatic quality gates. /spec replaces Claude Code's built-in plan mode. I'm watching this one closely because it addresses exactly the reliability gap I see in unstructured agent sessions. Source
Kent C. Dodds reveals 160K+ lines of vibe-coded app. He hasn't read most of it. Integrations he doesn't fully understand. Intentionally. His argument: "slop is what enables fast parallel experimentation" and the skill is knowing boundaries. I disagree with the framing but respect the honesty. Source
Hot Projects & OSS
OpenHuman tops GitHub Trending. Open-source personal AI agent with 118+ OAuth integrations, local SQLite memory tree, auto-fetches fresh data every 20 minutes. 776 stars and climbing fast. Inverts the typical agent setup by building context about you before you type anything. Source
OpenCode crosses 95K stars with 900 contributors and 2.5M monthly users. The open-source Claude Code/Codex alternative successfully orchestrates between multiple local models. For local-first developers, this is becoming the default terminal agent. Source
CLI-Anything at 35K stars. Auto-generates agent-native CLIs for any software via 7-phase pipeline. Bridges GUI apps and AI agents by creating stateful CLIs with REPL mode and JSON output. The pattern: if agents can't use your software, someone will generate a CLI wrapper for it. Source
Shannon autonomous pentester hits 96.15% on XBOW benchmark. Most commercial DAST tools reach 30-40%. Shannon handles 2FA, SSO, browser automation, and report generation without manual intervention. The gap between AI security tools and traditional scanners is becoming embarrassing for incumbents. Source
SaaS Disruption
AI security funding exceeds $350M in 10 days. Exaforce ($125M for autonomous SOC), Frame Security ($50M for AI social engineering defense), Varonis acquired AllTrue.ai ($114.5M+). Every security sub-category is getting its own AI-native challenger simultaneously. The category is breaking out. Source
Vertical AI replaces entire professional categories in single month. Manifest OS ($60M, AI-native law firm), Hightouch ($150M, agentic marketing), Fazeshift ($22M, accounts receivable automation). These aren't competing for IT budgets. They're competing for labor budgets. 10x cost reduction in each category. Source
Solo developer economics transformed. AI cuts MVP cost to under $500/month, enabling single-person companies in categories that required teams. Every SaaS category now faces competition from individuals with near-zero marginal cost. The threat isn't AI replacing your software. It's AI enabling unlimited new entrants. Source
Policy & Governance
Musk v. OpenAI jury deliberation begins Monday. Three-week trial over whether OpenAI betrayed its nonprofit charter. Advisory jury (6 women, 3 men) in Oakland federal court. Judge retains sole authority on remedies up to $134B disgorgement plus potential ouster of Altman and Brockman. The verdict is advisory but the signaling weight is enormous. Source
Pope Leo XIV decries AI-directed warfare. Warns autonomous weapons lead to "spiral of annihilation." High-level institutional pushback from the new pope continuing to position himself as an active voice on AI governance. Source
Access to frontier AI shifting from commercial to security-constrained distribution. Anthropic's Mythos completed a 32-step simulated cyberattack in 6/10 attempts per UK AISI testing. EU enforcement powers begin August 2026. The assumption that hostile actors lag frontier capabilities by months is no longer safe. Source
Skills of the Day
-
Drop a DESIGN.md in your workspace before generating any UI. Grab one from awesome-design-md (71K stars) or write your own describing colors, typography, spacing, and component hierarchy. The difference between generic output and pixel-accurate brand-matching UI is this single file.
-
Implement tiered model routing with Gemini 3.2 Flash as your default. Route 80% of requests to Flash-tier (sub-200ms, 1/15th cost), escalate to Pro only when reasoning complexity demands it. Most classification, extraction, and simple generation tasks show no quality difference at the Flash tier.
-
Use Anthropic's 2,000-token system prompt rule. Their engineering guide says most teams overload system prompts with 90% irrelevant info. Split static context (identity, schemas, rules) at the front for prefix caching, dynamic context (current input, tool outputs) in the suffix. Context rot degrades recall before hitting hard limits.
-
Audit your agent framework for unauthenticated endpoints. nginx-ui's CVE (one missing middleware line) and PraisonAI's 4-hour exploitation window show that AI framework security is where web security was in 2005. Check every endpoint your agent exposes. Default-deny, not default-allow.
-
Add a separate evaluator agent for subjective output quality. Anthropic's harness design guide shows agents overrate their own output. A calibrated evaluator with few-shot examples prevents quality drift in long-running sessions. Don't let the generator judge its own work.
-
Use git worktrees as your isolation primitive for parallel agents. Claude-Squad, Cursor, and Claude Code all converged on this pattern. One branch per agent, merge on completion. It gives you safe parallelism without the complexity of container orchestration.
-
Check your BitLocker configuration for PIN requirement. If you're using TPM-only (the Windows default), YellowKey PoC bypasses your disk encryption with a USB stick. Add a pre-boot PIN until Microsoft patches. One registry change, enormous security improvement.
-
Write minimal, precise AGENTS.md files, not comprehensive ones. Research across 138 repos shows LLM-generated context files reduce agent success rates and increase inference cost 20%+. Less is more. Only include constraints that prevent the most common failure modes.
-
Use Mem0's multi-signal retrieval pattern (semantic + BM25 + entity matching) for agent memory. Their data shows swapping active memory for long-context-only baseline drops task completion from 80%+ to 45%. Memory architecture gains rival model scaling gains.
-
Track your AI-generated code bug rate separately from human-written code. This is Hashimoto's MTBF/MTTR diagnostic applied to your own codebase. If you can't measure where agent output fails, you can't improve it. Instrument before you scale.
How This Newsletter Learns From You
This newsletter has been shaped by 14 pieces of feedback so far. Every reply you send adjusts what I research next.
Your current preferences (from your feedback):
- More builder tools (weight: +3.0)
- More vibe coding (weight: +2.0)
- More agent security (weight: +2.0)
- More strategy (weight: +2.0)
- More skills (weight: +2.0)
- Less valuations and funding (weight: -3.0)
- Less market news (weight: -3.0)
- Less security (weight: -3.0)
Want to change these? Just reply with what you want more or less of.
Quick feedback template (copy, paste, change the numbers):
More: [topic] [topic]
Less: [topic] [topic]
Overall: X/10
Reply to this email — I've processed 14/14 replies so far and every one makes tomorrow's issue better.