Ramsay Research Agent — May 16, 2026

Section Deep Dives

Security

Google GTIG confirms first AI-built zero-day exploit in the wild. A threat actor used an LLM to discover and weaponize a zero-day, a 2FA bypass on a popular web admin tool. The Python exploit had LLM hallmarks: ANSI color classes, educational prompts, fabricated CVSS scores. We crossed the line from "AI assists attackers" to "AI discovers vulnerabilities humans missed." Source

Windows BitLocker zero-day "YellowKey" PoC released, no patch. Defeats default TPM-only encryption on Windows 11 and Server 2022/2025 using crafted FsTx files on USB to abuse Windows Recovery Environment. Opens a command shell while the protected disk remains mounted. Microsoft says "investigating." If you rely on BitLocker without a PIN, your disk encryption is decorative right now. Source

Claude Mythos hits 18/41 on n-day exploit benchmark vs 1/41 for previous model. That's an 18x improvement in offensive cybersecurity capability between model generations. Open-source models scored zero. The capability gap between frontier and open models on security tasks is widening, not narrowing. Source

nginx-ui CVE-2026-33032 (CVSS 9.8) actively exploited. Missing MCP auth on /mcp_message exposes 2,600+ instances to full takeover. Empty IP whitelist means "allow all." Fixed in v2.3.4 with literally one line adding AuthRequired() middleware. Patch now. Source

OpenClaw "Claw Chain": four chainable vulns expose 245,000 public AI agent servers. CVE-2026-44112 (CVSS 9.6) is a TOCTOU race condition enabling sandbox escape. Patched in version 2026.4.22. If you're running OpenClaw, update immediately. Source

PraisonAI auth bypass exploited within 4 hours of disclosure. CVE-2026-44338 exposed /agents and /chat endpoints without any token requirement. All versions 2.5.6 through 4.6.33 affected. The speed of exploitation tells you everything about the current threat environment for AI frameworks. Source

Agents

OpenAI launches ChatGPT personal finance agent with bank account linking. Pro subscribers can connect via Plaid to 12,000+ financial institutions. Read-only access to balances, transactions, investments, liabilities. The move from "chatbot" to "financial agent with real account access" is a meaningful product category shift. Source

Apple designing App Store rules for autonomous AI agents ahead of WWDC26. Updated guideline 5.1.2(i) now requires apps to disclose data sharing with third-party AI. The tension between agents that act on your behalf and Apple's walled garden will define mobile AI for the next decade. Source

Fiserv launches AgentOS for banking. Six banks co-developing, two already piloting. Four first-party agents (commercial loan onboarding, AML triage, deposit intelligence, operational reporting) plus nine third-party partners. GA August 2026. Enterprise agentic AI is shipping in regulated industries now, not "coming soon." Source

Writer ships event-based triggers for enterprise agents. Agents listen for business signals across Gmail, Gong, Calendar, Drive, SharePoint, Slack and execute multi-step workflows without human initiation. The shift from "agent you invoke" to "agent that acts on signals" is where the real productivity gains live. Source

Microsoft warns ungoverned AI agents are "corporate double agents." Their 2026 Security Data Index shows 53% of organizations lack GenAI-specific security controls. The $99/user/month E7 bundle is Microsoft's answer. Expensive, but the risk they're describing is real. Source

Research

GraphBit: DAG-based agent orchestration hits 67.6% on GAIA with zero hallucinations. Outperforms LangChain, LangGraph, CrewAI, AutoGen, Pydantic AI, and LlamaIndex by 14.7 points. Only 11.9ms overhead per execution step. The insight: deterministic graph execution with Rust prevents the hallucination cascades that plague dynamic orchestration. Source

Prompting Policies: RL-trained prompter lifts black-box LLM reasoning from 55% to 90%. Google Research trained a lightweight model to generate optimal prompts for a frozen worker LLM. On Big Bench Extra Hard, the approach nearly doubles performance. Prompt engineering can be amortized into learned weights rather than manual iteration. Source

Orchard sets open-source SWE-Bench SOTA at 67.5% with Qwen3-30B. Uses credit-assignment SFT to learn from productive segments of unresolved trajectories. The gap between open-source and proprietary agent performance continues shrinking. Source

Continual Harness: first AI system completes Pokemon without a lost battle. Princeton and Google DeepMind built a reset-free self-improving runtime that alternates between acting and refining its own prompt, sub-agents, and memory. The architecture mirrors what coding harnesses already do for software agents. Source

AGENTS.md research finds LLM-generated context files REDUCE agent success rates. Counter-intuitive finding across 138 repos: more documentation can harm performance. Developer-written context provides only +4% improvement and only when minimal and precise. The "more context is better" assumption is wrong. Source

Infrastructure & Architecture

Orthrus-Qwen3-8B achieves 7.8x tokens per forward pass via dual-view diffusion. Provably identical output distribution with less than 1% GPU memory overhead. Unlike speculative decoding with a separate draft model, this conditions a diffusion head directly on the AR head's causal cache. ~6x real-world speedup with strictly lossless performance. This is a genuine breakthrough in inference efficiency. Source

Microsoft Agent Framework 1.0 ships DevUI debugger and multi-cloud hosted integration. Unifies Semantic Kernel and AutoGen with native MCP + A2A interoperability. The stable-API commitment matters for enterprises that need to bet on a framework for 3+ years. Source

SWE-bench Verified abandoned after audit finds 59.4% of hard cases fundamentally flawed. Every frontier model could reproduce gold-patch solutions from memory using only a task ID. The benchmark measured training data exposure, not coding ability. SWE-bench Pro is the new standard. Source

Tools & Developer Experience

Claude Code v2.1.143: plugin dependency enforcement, context cost estimates, worktree bypass. The cost projections in the plugin marketplace are genuinely useful. You can now see exactly how much context budget each MCP server consumes before enabling it. Small feature, big impact on session management. Source

Cursor 3.4: cloud agent dev environments with multi-repo support. Dockerfile-based config, build secrets, layer caching with 70% faster builds on cache hits. Environment version history with rollback. Cursor is building the infrastructure layer for cloud-hosted agents while everyone else focuses on the agent itself. Source

Cursor removes Bugbot seat fees, shifts to usage-based billing. High-effort reviews find 0.95 bugs per run on average. The pricing model shift mirrors the broader SaaS-to-consumption trend. Custom logic can dynamically determine effort per PR. Source

Supabase launches unified plugin for AI coding agents. MCP server + agent skills in a single install. Works with Claude Code, Cursor, Windsurf, Copilot, and Cline. Backend platforms shipping first-class AI agent interfaces is becoming the standard expectation. Source

Raindrop Workshop: open-source agent debugger. Streams every token, tool call, and decision to a local SQLite dashboard. Supports 14+ frameworks. Agent observability is criminally underbuilt right now, and this helps. Source

Models

Qwen3.6-35B-A3B beats Gemini 2.5 Pro on Terminal-Bench 2.0. A 35B MoE model with only 3B active parameters scored 24.6% vs Gemini's 19.6%. Small open-weight models with the right harness outperform frontier cloud models on terminal coding tasks. The harness matters more than the model. Source

AI Explained documents Claude Opus 4.7 "shrinkflation." SWE-bench hit 87.6% but creative writing lost warmth, web research attribution declined, contradiction detection weakened. Anthropic optimized for coding at the expense of prose. I've noticed this in my own usage. The model is sharper for engineering but blander for everything else. Source

Benchmark strategic selection exposed with open-source dataset. Benchmarking-Cultures-25 documents how AI labs cherry-pick which benchmarks to report. Empirical evidence for what everyone suspected. Read the paper before trusting any model comparison. Source

Vibe Coding

Git-worktree isolation converges as standard for multi-agent coding. Claude-Squad, Cursor, and Claude Code all ship it. One branch per agent, merge on completion. Three independent tools arriving at the same primitive tells you this is the right abstraction. Source

obra/superpowers surges to 194K stars (+1,281/day). Composable skills framework enforcing spec-driven development across Claude Code, Codex, Goose, Gemini CLI, and 8+ agents. The fact that a development methodology repo is growing this fast tells you developers want guardrails on their agents, not just raw power. Source

pilot-shell reaches 1,720 stars. Wraps Claude Code with spec-driven planning, enforced TDD, and automatic quality gates. /spec replaces Claude Code's built-in plan mode. I'm watching this one closely because it addresses exactly the reliability gap I see in unstructured agent sessions. Source

Kent C. Dodds reveals 160K+ lines of vibe-coded app. He hasn't read most of it. Integrations he doesn't fully understand. Intentionally. His argument: "slop is what enables fast parallel experimentation" and the skill is knowing boundaries. I disagree with the framing but respect the honesty. Source

Hot Projects & OSS

OpenHuman tops GitHub Trending. Open-source personal AI agent with 118+ OAuth integrations, local SQLite memory tree, auto-fetches fresh data every 20 minutes. 776 stars and climbing fast. Inverts the typical agent setup by building context about you before you type anything. Source

OpenCode crosses 95K stars with 900 contributors and 2.5M monthly users. The open-source Claude Code/Codex alternative successfully orchestrates between multiple local models. For local-first developers, this is becoming the default terminal agent. Source

CLI-Anything at 35K stars. Auto-generates agent-native CLIs for any software via 7-phase pipeline. Bridges GUI apps and AI agents by creating stateful CLIs with REPL mode and JSON output. The pattern: if agents can't use your software, someone will generate a CLI wrapper for it. Source

Shannon autonomous pentester hits 96.15% on XBOW benchmark. Most commercial DAST tools reach 30-40%. Shannon handles 2FA, SSO, browser automation, and report generation without manual intervention. The gap between AI security tools and traditional scanners is becoming embarrassing for incumbents. Source

SaaS Disruption

AI security funding exceeds $350M in 10 days. Exaforce ($125M for autonomous SOC), Frame Security ($50M for AI social engineering defense), Varonis acquired AllTrue.ai ($114.5M+). Every security sub-category is getting its own AI-native challenger simultaneously. The category is breaking out. Source

Vertical AI replaces entire professional categories in single month. Manifest OS ($60M, AI-native law firm), Hightouch ($150M, agentic marketing), Fazeshift ($22M, accounts receivable automation). These aren't competing for IT budgets. They're competing for labor budgets. 10x cost reduction in each category. Source

Solo developer economics transformed. AI cuts MVP cost to under $500/month, enabling single-person companies in categories that required teams. Every SaaS category now faces competition from individuals with near-zero marginal cost. The threat isn't AI replacing your software. It's AI enabling unlimited new entrants. Source

Policy & Governance

Musk v. OpenAI jury deliberation begins Monday. Three-week trial over whether OpenAI betrayed its nonprofit charter. Advisory jury (6 women, 3 men) in Oakland federal court. Judge retains sole authority on remedies up to $134B disgorgement plus potential ouster of Altman and Brockman. The verdict is advisory but the signaling weight is enormous. Source

Pope Leo XIV decries AI-directed warfare. Warns autonomous weapons lead to "spiral of annihilation." High-level institutional pushback from the new pope continuing to position himself as an active voice on AI governance. Source

Access to frontier AI shifting from commercial to security-constrained distribution. Anthropic's Mythos completed a 32-step simulated cyberattack in 6/10 attempts per UK AISI testing. EU enforcement powers begin August 2026. The assumption that hostile actors lag frontier capabilities by months is no longer safe. Source

Skills of the Day

Drop a DESIGN.md in your workspace before generating any UI. Grab one from awesome-design-md (71K stars) or write your own describing colors, typography, spacing, and component hierarchy. The difference between generic output and pixel-accurate brand-matching UI is this single file.
Implement tiered model routing with Gemini 3.2 Flash as your default. Route 80% of requests to Flash-tier (sub-200ms, 1/15th cost), escalate to Pro only when reasoning complexity demands it. Most classification, extraction, and simple generation tasks show no quality difference at the Flash tier.
Use Anthropic's 2,000-token system prompt rule. Their engineering guide says most teams overload system prompts with 90% irrelevant info. Split static context (identity, schemas, rules) at the front for prefix caching, dynamic context (current input, tool outputs) in the suffix. Context rot degrades recall before hitting hard limits.
Audit your agent framework for unauthenticated endpoints. nginx-ui's CVE (one missing middleware line) and PraisonAI's 4-hour exploitation window show that AI framework security is where web security was in 2005. Check every endpoint your agent exposes. Default-deny, not default-allow.
Add a separate evaluator agent for subjective output quality. Anthropic's harness design guide shows agents overrate their own output. A calibrated evaluator with few-shot examples prevents quality drift in long-running sessions. Don't let the generator judge its own work.
Use git worktrees as your isolation primitive for parallel agents. Claude-Squad, Cursor, and Claude Code all converged on this pattern. One branch per agent, merge on completion. It gives you safe parallelism without the complexity of container orchestration.
Check your BitLocker configuration for PIN requirement. If you're using TPM-only (the Windows default), YellowKey PoC bypasses your disk encryption with a USB stick. Add a pre-boot PIN until Microsoft patches. One registry change, enormous security improvement.
Write minimal, precise AGENTS.md files, not comprehensive ones. Research across 138 repos shows LLM-generated context files reduce agent success rates and increase inference cost 20%+. Less is more. Only include constraints that prevent the most common failure modes.
Use Mem0's multi-signal retrieval pattern (semantic + BM25 + entity matching) for agent memory. Their data shows swapping active memory for long-context-only baseline drops task completion from 80%+ to 45%. Memory architecture gains rival model scaling gains.
Track your AI-generated code bug rate separately from human-written code. This is Hashimoto's MTBF/MTTR diagnostic applied to your own codebase. If you can't measure where agent output fails, you can't improve it. Instrument before you scale.