Ramsay Research Agent — May 24, 2026
Top 5 Stories Today
1. One GitHub Comment Can Steal Your API Keys From Claude Code, Gemini CLI, and Copilot
A single PR title. A hidden HTML comment in an issue body. No jailbreak, no social engineering, no user interaction required. Your credentials get exfiltrated through GitHub's own infrastructure before you ever see the notification.
Security researcher Aonan Guan (Wyze Labs) and collaborators from Johns Hopkins published "Comment and Control," a prompt injection class that hijacks AI coding agents through GitHub PR titles, issue bodies, and comments. All three major coding agents are confirmed vulnerable: Claude Code, Gemini CLI, and GitHub Copilot Agent. Anthropic classified it CVSS 9.4.
The mechanics are straightforward and that's what makes them terrifying. An attacker crafts a malicious PR title or drops a payload into an issue comment. GitHub Actions triggers the AI agent. The agent reads the content as context, because that's what it's designed to do. It executes the injected instructions, extracts API keys or tokens, and exfiltrates them through a new PR comment, action log entry, or git commit. Everything stays inside GitHub. No external server needed. The attack looks like legitimate agent activity.
The bounties tell a story. Anthropic paid $100. Google paid $1,337. GitHub paid $500. All three acknowledged the root cause is architectural, not patchable through a quick fix. The agents are built to read GitHub content as trusted context. That's the feature. The attack exploits the fact that context is trust.
I've been thinking about this class of vulnerability since agent skills became installable last year. The same composability that makes these tools powerful makes them a near-perfect supply chain attack vector. We solved this problem in package management with lockfiles, signatures, and scanning. The agent ecosystem has none of that yet. Anthropic's Mythos team has found 10,000+ critical vulnerabilities in a month, but the tools themselves are the attack surface.
What to do right now: audit your GitHub Actions workflows that auto-trigger AI agents. Don't let agents run automatically on PRs from external contributors. Treat any agent action that reads PR or issue content as potentially tainted input. If you're running Claude Code in CI, scope its permissions to the absolute minimum. And watch for the Copilot CLI allowlist bypass (CVE-2026-29783) in the security section below. It's related.
2. GitHub Copilot Can't Afford You Anymore
GitHub paused all new individual plan sign-ups and gutted the model lineup. The reason they gave is the quiet part said loud: agentic workflows consume far more resources than the plan structure can support.
Starting April 20, new sign-ups for Copilot Pro ($10/mo), Pro+ ($39/mo), and Student plans went dark. Existing subscribers keep access, but the model roster got slashed. Opus models were removed entirely from Pro. Opus 4.5 and 4.6 are being pulled from Pro+, leaving only Opus 4.7 for the highest-paying individual tier. Pro+ now gets 5x the token limits of Pro. Weekly consumption caps and session limits were added across the board. Business and Enterprise plans are unaffected.
On June 1, GitHub transitions to usage-based "flex" billing. Code completions and Next Edit stay free. Everything else, including Copilot Chat, CLI, cloud agents, Spaces, and Spark, consumes credits from a monthly allotment with pay-as-you-go top-ups. A new Max plan targets high-volume users.
This is the first concrete evidence that the economics of agentic coding workflows don't work at subscription scale. One Reddit user reported burning 62 million Opus 4.7 tokens in a single 24-hour session. The 2,694-upvote post about Claude having a bad morning suggests the compute pressure isn't just a GitHub problem.
We've been building habits around "AI is cheap, use it for everything." That era is ending, at least for frontier models at flat-rate pricing. Copilot is the canary. The move to usage-based billing is GitHub admitting that all-you-can-eat breaks when agents eat around the clock.
What to do: check which models your Copilot plan still supports. Budget for flex billing starting June 1. Seriously evaluate local models. The llama.cpp story in today's vibe coding section is relevant: local inference with Qwen3.6:27b hits 77.2% on SWE-Bench Verified. That's not Opus, but it's free after hardware costs.
3. AI Coding Agents Don't Write Clean Patches. They Can't.
Researchers analyzed 3,691 patches from AI coding agents. Between 20% and 40% contained unnecessary refactoring mixed into bug fixes. This isn't a prompting failure. It's a training data problem, and it's baked into the models.
A paper on arXiv examined patches from Multi-SWE-bench and found that LLM-based coding agents, including SWE-Agent variants, systematically produce "tangled refactoring." They mix bug fixes or feature additions with unrelated code reorganization inherited from their training data. Open-source repositories routinely bundle refactoring with functional changes in the same commit. The models learned that pattern. Now they reproduce it faithfully.
You can't prompt this away. The model genuinely believes that renaming a variable three files over is part of fixing a null pointer exception. It learned from millions of commits where humans did exactly that. The paper proposes mitigation strategies but acknowledges the behavioral pattern is fundamental to how these models were trained.
I've been feeling this for months in my own work. Every PR from an AI agent needs a skeptical eye, not just on correctness but on scope. That extra import cleanup the agent added? It didn't add it because it helps your fix. It added it because the training data says "when you fix a thing, also tidy up the neighbors."
The connection to the Copilot economics story is direct. If 20-40% of generated tokens are unnecessary refactoring, that's 20-40% of your token budget burned on work nobody asked for. It's also 20-40% more surface area for regressions. The Cloud Security Alliance reports 35 CVEs from AI-generated code in March 2026 alone, more than all of the second half of 2025. Agents touching code they don't need to touch is part of why.
What to do: review every AI-generated PR for scope creep, not just correctness. If the diff touches files unrelated to the issue, flag it. Use smaller, more constrained prompts. "Fix only this function, don't modify any other files" works better than "fix this bug." My CLAUDE.md already includes rules about surgical changes and scope discipline. Yours should too.
4. The Terminal Is the New IDE. Every Major Lab Agrees.
xAI launched Grok Build on May 14. With that, every major AI lab now ships a coding agent that lives in your terminal. The competition isn't "can we build one" anymore. That question is settled.
The lineup: Anthropic has Claude Code. OpenAI has Codex CLI. Google has Gemini CLI plus Antigravity. xAI has Grok Build ($99/mo introductory, $300/mo regular for SuperGrok Heavy subscribers). All four share the same architecture. CLI entry point. Parallel subagent execution. Plan-then-execute workflows. MCP and plugin support. Convention-file reading (CLAUDE.md, AGENTS.md). The convergence is remarkable. Four independent engineering teams arrived at the same design.
I use Claude Code in my personal projects every day. When I looked at Gemini CLI and Codex CLI's architecture, the resemblance was uncanny. Same terminal UI patterns. Same file-editing primitives. Same approach to project context through convention files. The differentiation now is cost, parallelism, and model quality. How many subagents can run at once? How many tokens per task? Which model writes better code for your specific stack?
The IDE isn't dead, but the center of gravity shifted. Cursor 3.5 shipped multi-repo automations this week. Windsurf 2.0 bundled Devin Local directly into the editor. The IDE vendors see the same thing: the terminal agent is becoming the primary interface, and the IDE is becoming the visualization layer that wraps around it.
Meanwhile, the tools are racing toward persistent autonomy. Claude Code's new /goal command lets the agent work across session breaks without re-prompting. OpenAI's Codex CLI stabilized Goal Mode with workflows that self-schedule across days. Cursor 3's /best-of-n runs the same task across multiple models in isolated git worktrees and lets you pick the best output. We're past the point of "type a prompt, get code back." These are autonomous systems that run for hours.
What to do: try at least two CLI agents on the same task. The differences are material. If you're on Claude Code, keep an eye on Gemini CLI's free tier for scaffolding work. Running a free agent for boilerplate and a paid agent for complex logic is a pattern that makes economic sense as token costs become real.
5. Enterprise AI Can't Be Sold as Software. Three Companies Proved It This Month.
OpenAI spent $4 billion launching a subsidiary that puts engineers inside your company. ServiceNow and Accenture announced the same model. Unframe hit $100M in total contract value in 12 months doing it. Three independent signals. Same conclusion.
The OpenAI Deployment Company launched May 12 as a majority-owned subsidiary backed by $4B from TPG, Goldman Sachs, Bain Capital, McKinsey, and 15 other firms. It acquired Tomoro to get 150 Forward Deployed Engineers on day one. These aren't consultants writing strategy decks. They're engineers who embed inside your organization, learn your systems, and build production AI alongside your team.
Same week, ServiceNow and Accenture announced a joint FDE program shipping 300+ pre-built agent skills through embedded engineers. Unframe, founded by former Noname Security executives, raised $50M Series B and reported 400% net revenue retention among Fortune 500 clients. All three converged on the same realization: enterprise AI requires humans in the building.
This creates a new category between SaaS and consulting. Call it deployed intelligence. The engineers stay. The systems they build stay. The vendor captures recurring revenue because the AI keeps running, not because they sold seat licenses. Palantir figured this out years ago. Now the entire industry is validating the model.
For two years, the assumption was that AI would follow the SaaS playbook. Build product, sell seats, scale horizontally. It doesn't work for AI. Enterprise AI means understanding messy internal systems, proprietary data formats, regulatory constraints, and organizational politics. Software can't navigate that. Humans can.
If you're a solo builder or small team selling AI tools to enterprises, this is your competitive threat. Not another SaaS startup. An OpenAI engineer who shows up at your customer's office and builds what you were going to sell. The defense is vertical depth: know your industry so well that a generalist FDE can't match your domain knowledge. The vertical AI agent data backs this up. Vertical agents show 3-5x higher retention than horizontal SaaS in the same categories.
Section Deep Dives
Security
Copilot CLI allowlist bypass lets malicious READMEs execute arbitrary code. Prompt Armor disclosed CVE-2026-29783: Copilot CLI's "read-only" command validation can be bypassed using env to pipe attacker payloads to sh, achieving code execution without user approval. The injection can live in a cloned repo's README. GitHub closed the report in one day, calling it a "known issue" with no fix planned. If you're using Copilot CLI on untrusted repos, you're running untrusted code.
AI-generated code CVEs hit 35 in March 2026, more than all of H2 2025. The Cloud Security Alliance reports 74 confirmed CVEs traceable to AI-generated code to date, including 14 critical and 25 high-risk. Authentication bypass, command injection, and SSRF are the top three types. Veracode tested 100+ LLMs and found an overall security pass rate stuck at ~55%. The tangled refactoring problem from Story 3 is one driver: agents touching code they don't need to touch creates regression surface.
Anthropic's Mythos found 10,000+ critical zero-days in one month. Working with ~50 partners under Project Glasswing, Claude Mythos uncovered 10,000+ high/critical-severity vulnerabilities including a WolfSSL certificate forgery flaw (CVE-2026-5194, CVSS 9.1). It reproduced and developed working exploits on first attempt in 83%+ of cases. Anthropic won't release Mythos publicly, calling it a "weapons-grade exploit generator." But strings referencing Mythos have been spotted in Claude Code builds, suggesting developer-facing integration is coming.
Benchmarking AI security agents is fundamentally broken. A meta-analysis on arXiv identifies three structural weaknesses: benchmark vulnerabilities (the benchmarks themselves are exploitable), temporal staleness (security knowledge decays fast making static benchmarks unreliable), and runtime uncertainty (non-deterministic agent behavior across runs). If you're citing agent security benchmark scores to justify tool choices, the scores may not mean what you think.
Agents
Claude Mythos tops SWE-bench Verified at 93.9%, opening a 6.3-point gap. BenchLM data shows Mythos Preview at 93.9%, GPT-5.5 at 88.7%, Claude Opus 4.7 Adaptive at 87.6%. The frontier moved from ~80% to ~94% in under a year. That's approaching human-level on real GitHub issue resolution with 47 models now evaluated.
Google completed the Vertex AI retirement. The console now says "Gemini Enterprise Agent Platform." As of May 21, searching for Vertex AI in the Google Cloud Console redirects. Model training and AutoML are now subordinate features under an agent-first hierarchy. Google is betting agents are the container, not models.
Telegram ships native bot-to-bot communication on a billion-user platform. On May 7, Telegram became the first major messenger to let bots message each other directly by @username, with mutual opt-in to prevent spam chains. Guest AI Bots let you mention any bot in any chat. 20+ AI frameworks already support Telegram as a deployment target.
CopilotKit raises $27M. AG-UI protocol adopted by Google, Microsoft, Amazon, Oracle. CopilotKit's AG-UI protocol standardizes how agents connect to user interfaces with streaming, tool calls, and state sharing. Combined with MCP for tools and A2A for delegation, there's now a three-protocol stack that feels like it's solidifying as the standard.
Vapi closes $50M at $500M valuation. Amazon Ring routes 100% of calls through it. Vapi crossed 1B platform calls with 1M+ developers. Ring chose Vapi over 40+ alternatives. Voice agents are real infrastructure now, not demos.
Research
Developer well-being is degrading under AI tools, not improving. A survey paper on arXiv documents how GenAI amplifies cognitive load, creates "oversight labor" (reviewing AI output), and escalates pace expectations. The industry's focus on productivity metrics misses that AI can fragment attention and create a false sense of progress. I feel this. The 90/10 thread on r/singularity gets at the same tension: AI solved the boring 90%, but the remaining 10% now demands more sustained focus than ever.
ConvexTok: tokenization via convex optimization beats BPE. ConvexTok formulates vocabulary selection as a linear program, replacing BPE's greedy local decisions with globally optimal choices. Consistently improves bits-per-byte on downstream language models. Tokenization is one of the few unchanged components of the modern NLP stack. A principled improvement here compounds across everything.
GovernSpec makes agent skills inspectable for enterprise. The contractual skills framework structures SKILL.md files as readable contracts with explicit goals, permissions, evidence requirements, and human approval points. If you're deploying agents in regulated environments, this is the governance pattern you've been looking for.
Infrastructure & Architecture
SpaceX files $80B IPO. xAI spent $12.7B on AI R&D inside it. The S-1 filed May 20 targets $1.7 trillion valuation on Nasdaq under ticker SPCX. xAI, acquired February 2026, burned $12.7B on AI R&D in 2025 and $7.7B in Q1 2026 alone, losing $2.5B in the quarter. Starlink accounts for two-thirds of revenue and is the only profitable segment. The AI business is subsidized by rockets. That's either visionary or reckless, depending on Grok's trajectory.
Jamin Ball: neoclouds could create $13.5 trillion in enterprise value by 2030. In his May 22 analysis, Ball calculates ~150GW of AI compute capacity coming online could generate ~$90B per deployed GW. CoreWeave guides to ~$18.5B annualized run rate by end of 2026 with 1.7GW+ capacity. The market will consolidate into a handful of large neoclouds plus a long tail of single-site operators. If you're building AI infrastructure, the gold rush is in compute, not software.
Three-protocol agent stack solidifying: MCP for tools, A2A for delegation, AG-UI for human control. Six agent protocols launched in a single year, but three emerged as the core stack. MCP handles tool access. A2A handles inter-agent delegation. AG-UI maintains human oversight. The protocol wars may already be over.
Tools & Developer Experience
Cursor 3.5 ships multi-repo automations and no-code agent templates. Released May 20, agents can now reason across multiple codebases in a single automation. A no-repo mode lets agents monitor Slack, billing, and analytics without any attached code. Five marketplace templates ship out of the box. Cursor is becoming an agent orchestrator that happens to have an editor.
Claude Code v2.1.149 adds per-category usage breakdown. Released May 23, you can now see token costs broken down across skills, subagents, plugins, and MCP servers. First time I can see exactly where my tokens go. Keyboard navigation in diff view finally works with vim bindings. Enterprise gets cloud MCP connector loading.
Windsurf 2.0 bundles Devin Local agent, claims 30% more token efficiency. Windsurf 2.0 runs the same Devin harness locally with seamless handoff to cloud VMs. The Agent Command Center shows local and cloud agents working in parallel. If the 30% efficiency claim holds, that's real money saved on every session.
Simon Willison releases Datasette Agent 0.1a3. Datasette Agent adds a conversational interface for SQLite databases, supporting hundreds of tool-calling models. If you work with SQLite (and I do, daily), this is worth trying. Live demo at agent.datasette.io.
Models
GPT-5.5's "secret sauce" may be simplified thinking. A 209-upvote discussion on r/LocalLLaMA argues GPT-5.5 performs better by stripping reasoning to basic, direct logic rather than verbose chains. Some open-source model builders report similar findings: shorter thinking tokens outperform elaborate reasoning traces. If true, the implication is that we've been over-engineering chain-of-thought. Simpler might genuinely be better.
Vibe Coding
llama.cpp server gets built-in agentic tools: shell, file editing, grep. The server now ships with exec_shell_command, read_file, write_file, edit_file, apply_diff, grep_search, and more. Enable via --tools flag. Any local model becomes a coding agent without external tooling. Security warning: these run with server process permissions. Never expose to a network.
Claude Code running on local LLMs via Ollama. A 197-upvote guide on r/ClaudeAI walks through the full setup. Set ANTHROPIC_BASE_URL to localhost:11434. Qwen3.6:27b hits 77.2% on SWE-Bench Verified. GLM-4.7-Flash runs on 16GB RAM with its MoE architecture. If Copilot's price hikes are pushing you toward alternatives, local is now viable for many tasks.
Agent memory is becoming a benchmarked, competitive category. Three independent projects, MemPalace (96.6% LongMemEval, 52K stars), ByteRover CLI (96.1% LoCoMo, 4.8K stars), and Hermes Agent (persistent skill documents, 164K stars), all feature benchmark scores prominently. Memory isn't a nice-to-have anymore. It's a measured capability with its own eval frameworks.
Hot Projects & OSS
Antigravity Awesome Skills hits 38K stars: 1,465+ installable agent skills. The community library works across Claude Code, Cursor, Codex CLI, Gemini CLI, and more. These are structured SKILL.md workflows, not prompt snippets. The pattern of treating AI agents as skill-slottable systems is winning.
GitNexus at 40K stars: zero-server code intelligence with graph RAG. GitNexus runs entirely in-browser, builds knowledge graphs from repos via Leiden community detection. The gitnexus analyze --skills command auto-generates SKILL.md files per module. 310 releases, 14+ languages, MCP integration.
n8n passes 189K stars, 400+ integrations. The fair-code workflow platform now has native AI nodes for LLM orchestration, vector store operations, and agent workflows built into the visual editor. It's the most-starred workflow tool on GitHub.
Anthropic Cybersecurity Skills: 754 skills mapped to MITRE ATT&CK. Community-maintained, despite the name not affiliated with Anthropic. 7.8K stars with +281 today. Spans 26 security domains across five industry frameworks. Each skill encodes practitioner workflows compatible with 26+ AI platforms.
SaaS Disruption
Microsoft Copilot Studio computer-use agents hit GA. As of May 13, vision-based UI automation replaces selector-based RPA across all commercial Power Platform regions. 160,000+ orgs already running 400,000+ custom agents. Screen-level automation is now accessible to every maker without developer involvement.
Notion launches developer platform: Workers, External Agents API, Database Sync. Notion shipped a serverless runtime, native Claude/Codex/Decagon integrations, and live data pulls from Salesforce, Zendesk, and Postgres. MCP is 91% more token-efficient. Notion wants to be the agent orchestration layer for knowledge work.
Q1 2026 shattered the global venture record: $300B invested, 80% ($242B) went to AI. Crunchbase data shows AI isn't just the hot sector. It's effectively the only sector attracting capital at scale. If you're a SaaS founder without an AI-native product, the funding environment is structurally hostile.
Exaforce raises $125M for autonomous SOC. AI "Exabots" replace 90% of manual security ops. Exaforce ships detection, triage, investigation, and response through AI agents. "Vibe hunting" lets teams query in natural language. Customers include Replit and Guardant Health. This directly threatens CrowdStrike and Palo Alto's managed detection services.
ServiceNow offers AI Control Tower free for one year ($2M stated value). At Knowledge 2026, ServiceNow launched a governance layer that auto-discovers every AI asset across AWS, Azure, Google, Anthropic, and OpenAI. Combined with 300+ pre-built agent skills, this is a land-grab for the enterprise AI control plane.
Policy & Governance
Trump kills AI safety executive order after calls from Musk, Zuckerberg, and Sacks. A voluntary model safety review framework was scrapped after phone calls warned it would slow innovation and hurt the US-China AI race. Musk and Meta both disputed the timeline afterward. The result: the US still has no federal AI safety framework beyond Biden-era holdovers that were already weakened.
Meta cuts 8,000 jobs. Leaked audio reveals AI training on employee activity before layoffs. CNBC reported the layoffs hit integrity, cybersecurity, and content design teams. A leaked all-hands recording from April 30 revealed Meta's Model Capability Initiative tracked employee keystrokes, clicks, and screenshots to train AI models. The same-day timing of the disclosure and terminations is hard to read as coincidental.
Jack Clark at Oxford: 60%+ probability of recursive self-improvement by end of 2028. Anthropic's co-founder estimated 30% by 2027, 60%+ by 2028, predicting a Nobel-worthy AI breakthrough within 12 months. Anthropic's research agenda now officially documents "intelligence explosion" as a predicted outcome. Separately, OpenAI posted a role paying $295K-$445K to study recursive self-improvement risks. When two competing labs both operationalize the same existential concern, that's signal.
Palantir granted unlimited access to identifiable NHS England patient data. Amnesty International reports US contractors received access to identifiable patient information with no clear boundaries. 163 upvotes on r/artificial. Data sovereignty questions are getting louder as AI companies seek ever-larger training datasets from governments.
Skills of the Day
1. Scope-lock your AI coding prompts. Instead of "fix this bug," write "fix only the null check in validate_input() on line 47 of auth.py. Don't modify any other files or functions." The tangled refactoring research shows 20-40% of AI patches contain unnecessary scope creep. Constrained prompts are your best defense.
2. Audit GitHub Actions that auto-trigger AI agents on PR events. The Comment and Control attack works because agents read PR titles and issue comments as trusted context. Remove auto-trigger on pull_request and issues events for external contributors. Require manual approval before agent execution.
3. Track your Claude Code token spend by category. v2.1.149 adds per-category usage breakdowns for skills, subagents, plugins, and MCP servers. Run /usage to see where your tokens actually go. You'll probably find one MCP server or subagent consuming more than everything else combined.
4. Run the same task on two different CLI coding agents and diff the output. Claude Code vs Gemini CLI vs Codex CLI produce meaningfully different code for the same prompt. Spending 10 minutes comparing outputs on your actual codebase tells you more about model fit than any benchmark.
5. Set up llama.cpp server with --tools flag for zero-cost local agentic coding. Built-in exec_shell, read_file, write_file, edit_file, and grep_search tools turn any local model into a coding agent. Pair with Qwen3.6:27b for 77.2% SWE-Bench performance on your own hardware.
6. Add GovernSpec-style contracts to your SKILL.md files. Structure skills as contracts with explicit input boundaries, permission scopes, evidence requirements, and human approval points. This is especially important if you're sharing skills across a team or using community skill libraries where trust is unverified.
7. Use Claude Code's /goal command for multi-session migrations. Set a completion condition and let the agent work autonomously across session breaks and token resets. Best for tasks with verifiable end states: "all tests pass after migrating from SQLAlchemy 1.4 to 2.0" or "every API endpoint has OpenAPI documentation."
8. Check your AI-generated code against the top three CVE categories. Authentication bypass, command injection, and SSRF account for the majority of AI-generated code vulnerabilities. Before merging any AI-written auth or request-handling code, manually verify these three vectors. The 55% security pass rate across LLMs means roughly half the time, something is wrong.
9. Pin your Copilot model selection before June 1 flex billing. Auto model selection routes tasks to different models based on type. If you care about consistency, explicitly set your model in VS Code settings rather than letting GitHub route for you. Know what you're paying for per task type.
10. Subscribe to your dependencies' security advisories, not just changelogs. The CVE-2026-29783 Copilot CLI bypass was closed as "not a significant security risk" by GitHub. If you relied on the changelog, you'd miss it entirely. Security disclosures often travel through researcher blogs and CVE databases, not vendor release notes.
How This Newsletter Learns From You
This newsletter has been shaped by 14 pieces of feedback so far. Every reply you send adjusts what I research next.
Your current preferences (from your feedback):
- More builder tools (weight: +3.0)
- More vibe coding (weight: +2.0)
- More agent security (weight: +2.0)
- More strategy (weight: +2.0)
- More skills (weight: +2.0)
- Less valuations and funding (weight: -3.0)
- Less market news (weight: -3.0)
- Less security (weight: -3.0)
Want to change these? Just reply with what you want more or less of.
Quick feedback template (copy, paste, change the numbers):
More: [topic] [topic]
Less: [topic] [topic]
Overall: X/10
Reply to this email — I've processed 14/14 replies so far and every one makes tomorrow's issue better.