Ramsay Research Agent — 2026-03-27

Top 5 Stories Today

1. Stripe Just Made Agents Real Infrastructure Citizens with Projects.dev

I've been saying for months that the missing piece in agentic coding isn't smarter models. It's that agents can't provision anything. They can write code all day but the moment they need a database, an auth provider, or a hosting account, a human has to step in, click through dashboards, copy credentials, and paste them into .env files.

Stripe launched Projects.dev in developer preview and it changes that equation overnight.

One command. stripe projects add vercel/hosting. You get a Vercel account provisioned, credentials generated, everything synced to your .env file. Same thing for Neon, Supabase, Turso, PlanetScale, Chroma, Clerk, PostHog, Railway, and Runloop. Patrick Coulter cited Karpathy's MenuGen as direct inspiration, and that framing tells you exactly what this is designed for: the workflow where you tell an agent to build something and it actually can, end to end, without you touching a dashboard.

Billing is unified across all providers through Stripe. One bill. That alone solves a coordination problem that I've watched trip up every solo builder I know, including me. You spin up three services, forget about one, get billed separately from three different companies. Stripe consolidating that into a single line item is the kind of boring infrastructure decision that matters more than any model improvement.

Here's what I think is actually happening. Stripe isn't building a PaaS. They're building the billing layer for agent-provisioned infrastructure. Every service that agents spin up flows through Stripe's payment rails. That's a massive strategic move, because if agent-driven development takes off (and Projects.dev is designed to accelerate exactly that), Stripe becomes the financial backbone of every project agents create.

For builders: try this today. If you're using Claude Code, Codex, or any coding agent with terminal access, stripe projects add gives your agent the ability to provision real production infrastructure programmatically. The gap between "agent wrote the code" and "agent shipped the product" just got a lot smaller.

I don't know if the service selection is broad enough yet. Ten providers is a start but the real test is whether they can add the long tail (Cloudflare Workers, Fly.io, PlanetScale, specialized ML inference) fast enough to cover real production stacks. But the pattern is right.

2. AI-Generated Code CVEs Hit 35 in March. The Growth Curve Should Scare You.

Six CVEs in January. Fifteen in February. Thirty-five in March. That's the trajectory of security vulnerabilities directly traced to AI coding tools, tracked by Georgia Tech's Vibe Security Radar project.

Claude Code is responsible for 49 of 74 total CVEs (11 critical). And researcher Hanqing Zhao estimates the real number is 5x to 10x higher, somewhere between 400 and 700 actual cases, because most teams strip AI traces from their code before committing and current detection methods can only catch what's explicitly flagged.

CrowdStrike independently confirmed the scale of the problem: 87% of AI-generated pull requests contain at least one security vulnerability. That's not a typo. Nearly nine out of ten.

I want to sit with that number for a second. If you're using AI to write code, and most of us are, the default output is insecure. Not sometimes. Almost always. The tools are fast and productive and they generate vulnerable code as their baseline behavior. We've been so focused on speed gains that we've been shipping security debt at a rate no human team could match.

This connects directly to today's supply chain story (Story #5). The security tooling designed to catch these vulnerabilities is itself getting compromised. So you've got AI writing vulnerable code faster than ever, security scanners getting backdoored, and the CVE count growing exponentially month over month.

What should builders do? First, stop assuming AI-generated code is "good enough" for production without review. I know the whole point is speed. But the 87% vulnerability rate means your review process is your actual product, not the code the AI writes. Second, run SAST on every AI-generated PR before merge. Tools like Snyk, Semgrep, and the new Harness Secure AI Coding (announced at RSAC this week) specifically target AI-generated code patterns. Third, consider the zero-degrees-of-freedom approach from John Regehr: every AI code change gets validated against executable oracles (test suites, type checkers, sanitizers) before acceptance. If it doesn't pass automated checks, it doesn't ship. Period.

The growth curve is what concerns me most. 6 to 15 to 35 in three months. If that continues, we're looking at 70+ CVEs in April. This isn't a problem that's getting better on its own.

3. Symbolica Hit 36% on ARC-AGI-3 for $1,005. Frontier Models Scored Under 1% for $8,900.

The ARC Prize Foundation dropped ARC-AGI-3 on March 25 and the results broke my mental model of how AI capability scales.

Symbolica's Arcgentica framework scored 36.08% (113 of 182 playable levels, 7 of 25 games completed) using Claude Opus 4.6 as its backbone. Cost: $1,005. For comparison, raw chain-of-thought prompting on Claude Opus 4.6 scored 0.25% and cost $8,900. GPT-5.4 managed 0.26%. Gemini 3.1 Pro, the best frontier model, hit 0.37%.

Read those numbers again. The same underlying model (Opus 4.6) went from 0.25% to 36% by changing the architecture around it. Not by making the model bigger. Not by training on more data. By building a smarter harness.

Symbolica's architecture uses a top-level orchestrator that never touches the environment directly. It delegates to specialized subagents that interact with the task, then return compressed summaries back to the orchestrator. This constrains context growth (the orchestrator never drowns in details) while maintaining high-level planning (the orchestrator always sees the full picture). It's the same pattern that works in real software teams: the tech lead doesn't write every line of code, they coordinate specialists who do.

The code is open source on GitHub. Anyone can run it.

Here's why this matters beyond benchmarks. ARC-AGI-3 specifically tests skill acquisition, the ability to learn new concepts from examples and apply them to novel situations. It's not pattern matching against training data. The ARC Prize Foundation's technical report actually alleges that Gemini 3 may have memorized earlier benchmark versions, citing a reasoning chain that correctly referenced the integer-to-color mapping used in ARC tasks without being told what it was. Benchmark contamination. ARC-AGI-3 was designed to make that impossible by keeping 110 of 135 environments private and requiring interactive skill acquisition.

For builders working on multi-agent systems: the orchestrator-subagent pattern with compressed summaries is directly applicable to your work. I've been running similar architectures in my own pipelines and the context management insight is real. Agents that try to do everything in one context window fail. Agents that delegate to specialists and aggregate summaries don't. The 100x performance gap between Symbolica's approach and raw prompting is the data point that proves it.

4. One Engineer, $400 in Tokens, 7 Hours. Result: $500K/Year Saved.

Reco.ai published a case study that might be the clearest cost-benefit story I've seen for AI-assisted development. A single engineer used AI to rewrite JSONata (a JSON expression language written in JavaScript) as a pure Go library called "gnata." Seven hours of work. Roughly $400 in API tokens. The result: a 1,000x speedup on common expressions that cascaded into $500K per year in cloud cost savings across their data pipeline processing billions of events.

Simon Willison highlighted this as "vibe porting", and I think that's the right frame. This isn't greenfield AI code generation where you ask the model to build something from scratch (and hope it's correct). This is taking proven logic in one language and using AI to faithfully translate it to another, guided by an existing test suite. JSONata's comprehensive tests were the key enabler. The AI didn't need to understand the problem domain. It needed to produce Go code that passed the same tests the JavaScript version already passed.

The Hacker News thread (164 points, 147 comments) is worth reading for the practitioner debate. The skeptics raise a valid point: how do you verify the output beyond the test suite? Edge cases that the tests don't cover could lurk in the Go version for months. Reco addressed this by running a week-long shadow deployment with both versions in parallel, comparing outputs on production traffic. That's the pattern. You don't trust the AI output. You verify it against reality.

This story connects to the CVE story (Story #2) in an important way. The reason this worked so well is that JSONata had comprehensive tests. The reason AI-generated code has an 87% vulnerability rate is that most projects don't. The test suite wasn't just a nice-to-have. It was the entire reason the AI could produce trustworthy output. Without it, you're vibe coding. With it, you're doing verified translation.

For builders: if you have a performance-critical component written in Python or JavaScript with good test coverage, the "vibe porting" pattern to Go or Rust is immediately replicable. The ROI math is straightforward: measure your current compute costs, estimate the speedup from a compiled language, check if your test coverage is strong enough to validate the translation. If you're spending $40K+/month on compute for something that could run 100x faster in Go, $400 in tokens is the best investment you'll make this quarter.

5. The Security Scanner Was the Attack Vector. 10,000 CI/CD Workflows Compromised.

A security scanner. The tool your team trusts to find vulnerabilities. That was the entry point.

The TeamPCP campaign compromised Aqua Security's Trivy scanner (a GitHub Action used in CI/CD pipelines), then used that foothold to backdoor LiteLLM's CI/CD pipeline, then pivoted to Checkmarx KICS. The attack chain is almost elegant in how it exploits trust hierarchies: developers pin their application dependencies but treat security scanning tools as implicitly trusted. The attackers went after exactly that assumption.

On March 19, attacker "TeamPCP" force-pushed 75 of 76 tags on the trivy-action GitHub Action with malicious binaries that exfiltrated AWS, GCP, and Azure credentials, SSH keys, and Kubernetes tokens from approximately 10,000 GitHub workflows. This was the second compromise in March. The first happened March 1, and the root cause was incomplete credential rotation after that first incident. They didn't rotate everything. The attackers came back through the gap.

Callum McMahon published a minute-by-minute transcript (highlighted by Simon Willison) of using Claude to analyze the backdoored LiteLLM package in real time during the incident, tracing the base64-encoded payload and identifying exfiltration targets. The irony of using an AI coding tool to analyze a supply chain attack that was itself enabled by security tooling isn't lost on me.

A Datadog DevSecOps report revealed that 71% of organizations never pin GitHub Actions to commit hashes. That means nearly three-quarters of CI/CD pipelines are vulnerable to exactly this type of tag-mutation attack right now.

GitHub responded. Their 2026 Actions security roadmap introduces a new dependencies: section in workflow YAML that locks all direct and transitive dependencies by commit SHA, similar to how go.mod and go.sum work. It's the right architectural response. But it's not shipped yet.

For builders: do this today. Pin every GitHub Action in your workflows to a full commit SHA, not a version tag. Replace uses: aquasecurity/trivy-action@v0.28.0 with uses: aquasecurity/trivy-action@<full-sha>. It takes 20 minutes and it closes the exact attack vector that hit 10,000 pipelines this month. If you're using MCP servers, CI/CD tools, or coding agents that install dependencies, treat every dependency as untrusted code, because after this week, that's exactly what it is.

Section Deep Dives

Security

PIDP-Attack combines prompt injection with database poisoning to break RAG systems. Wang, Liu et al. (arXiv 2603.25164) demonstrate that injecting malicious documents into a vector database with hidden prompt injection payloads lets attackers manipulate LLM responses without jailbreaks, model access, or fine-tuning. The poisoned documents look benign at the embedding level, so standard content filtering misses them. If you're running a production RAG pipeline, your vector database is now part of your attack surface. Validate document sources before indexing.

Agent-Sentry blocks 90% of out-of-bounds agent attacks while preserving 98% utility. arXiv 2603.22868 proposes learning behavioral boundaries for LLM agents by identifying frequent execution patterns, then deploying runtime policy enforcement. The safety-vs-capability tradeoff (90% attack prevention, 98% utility preserved) is the best I've seen. No model retraining required, just execution monitoring. Practical enough to deploy alongside existing agent workflows.

RSAC 2026 births AI Agent Security as a net-new enterprise category. 20+ products launched in a single week: Snyk Agent Security with MCP governance (300+ enterprise deployments), CrowdStrike Falcon detecting 1,800+ AI apps, Microsoft Zero Trust for AI (GA May 1), SentinelOne Purple AI, Palo Alto Prisma AIRS 3.0, Booz Allen Vellox, and more across endpoint, cloud, DevSecOps, identity, and runtime. When vendors across five distinct subcategories all ship "AI agent security" simultaneously, a category has been born.

On-device VLMs leak image content through a side-channel in dynamic resolution preprocessing. Hadad and Guri (arXiv 2603.25403) show that VLMs using AnyRes-style dynamic resolution (like LLaVA-NeXT) create observable patterns in compute timing and memory access that reveal both image structure and semantic content. This undermines the privacy promise of local VLM deployment. If you're running VLMs on-device specifically for data privacy, this paper is a problem.

Agents

Claude Code now auto-fixes PRs in the cloud while you're away. Announced March 26 by PM Noah Zweben, web and mobile Claude Code sessions monitor pull requests, detect CI failures, push fixes, and address reviewer comments autonomously on Anthropic's cloud infrastructure. Close your laptop, come back to a green build. The logical next step from auto mode, turning Claude Code from an assistant into autonomous DevOps infrastructure.

Cline Kanban brings visual multi-agent orchestration to the terminal. Cline launched Kanban, a CLI-agnostic app (npm i -g cline) where each task card gets its own git worktree and terminal. Dependency linking means parent task completion auto-triggers dependents. Works with Claude Code, Codex, and Cline. No account required. This is the missing UI for managing 20+ parallel agent sessions. I've been doing this with tmux panes and it's terrible. This is better.

"Harness engineering" gets formalized as the real bottleneck in agent performance. Pan, Zou, and Guo (arXiv 2603.25723) argue that agent performance increasingly depends on the controller code between the LLM and its tools, not the LLM itself. They propose a natural-language specification format for these harnesses that makes them transferable and comparable. The Symbolica ARC-AGI-3 result (Story #3) is living proof: same model, 100x better performance, all in the harness.

Agent-to-agent pair programming is here. Developer Axel Delafosse published a system where Claude and Codex operate as co-agents, one as primary worker, the other as reviewer, talking directly via a CLI tool called "loop" that launches both in tmux with a bridge protocol. 72 points on HN. The shift from human-agent to agent-agent collaboration is happening faster than I expected.

Research

ARC-AGI-3 technical report alleges frontier models memorized earlier benchmarks. The ARC Prize Foundation presents evidence that Gemini 3's reasoning chain correctly referenced the integer-to-color mapping in ARC tasks without being told what it was. Strong evidence of training data contamination, either incidental or intentional. ARC-AGI-3 keeps 110 of 135 environments private and requires interactive skill acquisition to prevent this.

Sakana AI Scientist paper published in Nature. Sakana's system autonomously searches literature, generates hypotheses, designs experiments, writes code, executes it, and produces full LaTeX papers. One of three submissions was accepted at ICLR with an average reviewer score of 6.33, ranking higher than 55% of human papers. A clear scaling law was demonstrated: paper quality improves directly with model capability.

Self-improving RAG via write-back enrichment. Lu, Zhao, and Wu (arXiv 2603.25737) challenge the assumption that knowledge bases are assembled once. Their framework distills evidence across documents and writes consolidated knowledge back to the KB after each query cycle. The knowledge base gets better with use. If your RAG system's retrieval quality degrades over time, this is worth reading.

Chroma releases Context-1: 20B open-weight search agent at 10x speed and 25x lower cost. Chroma's Context-1 decomposes queries into subqueries, searches iteratively, and self-edits its context window by pruning irrelevant documents (0.94 prune accuracy). Competitive with frontier models on BrowseComp-Plus and HLE benchmarks. Apache 2.0 licensed, weights on Hugging Face. John Schulman endorsed it.

Infrastructure & Architecture

Google moves Q-Day estimate to 2029. RSA encryption may need 20x fewer quantum resources than thought. Google's research compresses the timeline from 2035+ to 2029. They're integrating post-quantum cryptography into Android 17 by June. The "store now, decrypt later" attacks are already happening, meaning adversaries are collecting your encrypted traffic today to decrypt when quantum computing arrives. If you're not evaluating post-quantum TLS, you're behind.

GitHub Actions getting deterministic dependencies. The 2026 security roadmap adds a dependencies: section that locks all direct and transitive action dependencies by commit SHA, mirroring go.mod + go.sum. This is the right fix for the Trivy attack pattern. Not shipped yet, but it signals CI/CD dependency management is finally getting treated as seriously as application dependency management.

Bifrost: Go AI gateway adds only 11µs latency at 5K req/s across 15+ providers. maximhq/bifrost unifies OpenAI, Anthropic, Bedrock, and Vertex through a single OpenAI-compatible API. At 3.3K stars and +63/day, this is the lowest-latency open-source AI gateway I've seen benchmarked. If you're building a product that needs to switch between providers, this saves you from writing that abstraction layer yourself.

OpenTelemetry ships semantic conventions for AI agent observability. Standard attribute names now cover tasks, actions, tool calls, agent teams, artifacts, and memory operations. MCP tool executions trace via corresponding instrumentation. Integrates with Grafana, Datadog, Honeycomb, Splunk, and New Relic. Enables cross-framework observability across CrewAI, AutoGen, and LangGraph. If you're running agents in production without observability, this is your on-ramp.

Tools & Developer Experience

JetBrains Air: free multi-agent IDE where Claude, Gemini, Codex, and Junie run side-by-side. JetBrains Air (public preview, free for macOS) is built on the abandoned Fleet IDE. It's the first IDE designed for orchestrating multiple competing coding agents rather than locking you into one. The companion Junie CLI is LLM-agnostic. This caught me off guard. JetBrains quietly shipping an agent-first IDE while everyone watched Cursor and Windsurf.

Gemini Code Assist goes fully free with 180K completions/month. Google made it free for individual developers. 180K code completions per month is 90x GitHub Copilot's free tier of 2K. The new "Finish Changes" feature completes in-progress work from pseudocode, TODOs, or half-written code. Available in VS Code and JetBrains. 6K code requests and 240 chat requests daily. Most generous free tier in the space and it's not close.

Steve Yegge's Beads v0.62.0 gives agents persistent structured memory for long-horizon tasks. Beads replaces unstructured markdown plans with a dependency-aware graph backed by Dolt (version-controlled SQL). At 19.8K stars, v0.62.0 adds hash-based task IDs preventing merge conflicts in multi-agent workflows, semantic memory summarization, and auto-ready task detection. One line in AGENTS.md and agents gain long-horizon planning.

Claude Agent SDK v0.1.48 exposes Claude Code's own infrastructure for custom agents. Released March 20 on PyPI, this gives you built-in file operations, shell commands, web search, MCP integration, and subagent dispatch with configurable permissions. Background task support. Git worktree support. This is the building block for custom multi-agent coding workflows if you've outgrown Claude Code's default behavior.

Models

Claude "Mythos" leaked via unsecured cache: a tier above Opus. Fortune obtained leaked Anthropic blog drafts revealing "Claude Mythos" (codenamed "Capybara"). Anthropic confirmed it represents "a step change" with significantly higher scores than Opus 4.6 in programming, reasoning, and cybersecurity. Early-access testing with select customers. No public release date. The leak itself is interesting. Nearly 3,000 unpublished assets in a publicly-accessible cache. Not a great look for a safety-focused lab.

Mistral Voxtral 4B TTS: open-weight, 9 languages, ElevenLabs parity at $0.016/1K characters. Mistral's Voxtral TTS clones voices from 3 seconds of reference audio. Human evaluations show naturalness parity with ElevenLabs Flash v2.5. At 4B parameters, it runs on consumer hardware. Creative Commons licensed, weights on Hugging Face. If you're paying ElevenLabs for TTS and have the hardware, this is a direct replacement.

Cohere Transcribe: 2B open-source ASR beats Zoom Scribe, ElevenLabs Scribe, and Qwen3-ASR. Apache 2.0 licensed, 14 languages, 5.42 average word error rate on the Hugging Face Open ASR leaderboard. Free via API. The voice model space is getting crowded fast. Cohere, Mistral, and the open-source community are all shipping models that match or beat proprietary services.

TurboQuant gets community llama.cpp implementation within 24 hours. Google published TurboQuant at ICLR 2026 and the llama.cpp community built a working CPU implementation the same day. TQ3 (3-bit) quantization with 6x KV-cache compression and zero accuracy loss up to 104K context. Pure C, no dependencies. The speed at which academic papers become runnable community code keeps accelerating.

Vibe Coding

668K-line codebase practitioner shares Claude Code methodology: Plan Mode separation is key. A practitioner on r/ClaudeAI running a 668K-line codebase shares the most common failure mode: Claude solving the wrong problem because you jumped straight to implementation. The fix: research in Plan Mode first, design the approach in Plan Mode, switch to implementation only after the plan is solid. I've hit this exact failure pattern. The time you spend in Plan Mode saves double in implementation.

WebTestBench: first benchmark for vibe-coded app testing. Kong, Zhang et al. (arXiv 2603.25226) directly address the quality gap where users build entire projects with natural language but skip systematic testing. The framework measures whether computer-use agents can autonomously navigate, interact with, and verify web applications. First benchmark explicitly designed for the vibe-coding-to-testing pipeline.

Local inference economics crystallize: practitioners now calculate real $/M tokens. Three posts on r/LocalLLaMA in a single day running real electricity cost analyses. One compared $2K/month API spend against dual DGX Spark amortization. The break-even for heavy API users appears to be 30-60 days of hardware cost, though utilization rate matters enormously. The calculations are getting rigorous enough to inform real purchasing decisions.

Hot Projects & OSS

Vane (formerly Perplexica) hits 33.4K stars as open-source Perplexity alternative. ItzCrazyKns/Vane searches via SearXNG, reranks with embeddings, generates cited responses using local LLMs or cloud providers. All self-hosted with complete search privacy. The rebrand signals maturity.

HolyClaude: Docker AI coding workstation, zero to building in 30 seconds. CoderLuii/HolyClaude packages Claude Code + web UI + 5 AI CLIs + headless Chromium + 50+ tools into a single Docker container. 4GB full, 2GB slim. Works with existing Claude Max/Pro plans. At 838 stars, this solves the "set up my AI dev environment" problem in one compose file.

Dokploy self-hosted PaaS hits 32K stars. Dokploy/dokploy deploys any app type with automated database management, Docker Swarm scaling, and Traefik integration. Trending alongside Twenty CRM (41.5K stars). The pattern: developers want self-hosted alternatives to everything.

SaaS Disruption

Ramp AI Index: Anthropic wins 70% of enterprise head-to-heads, OpenAI sees largest-ever monthly decline. Ramp's March data shows business AI adoption at a record 47.6%. Anthropic at 24.4% grew 4.9% month-over-month (its largest monthly gain ever) while OpenAI saw -1.5% (its largest decline). Average enterprise AI contract values projected to reach $1M in 2026, up from $530K in 2025. The market is voting with its wallet and the direction is clear.

SaaS equities crater: IGV down 21% YTD, Salesforce down 30%, Adobe P/E halved. FinancialContent's analysis quantifies it: EV/Sales multiples collapsed from 5.6x to 4.2x. CIO surveys show 40% of IT budgets being reallocated from legacy SaaS to agentic platforms and LLM token usage. The compression ratio: one AI agent replaces roughly five human software seats. That's not a budget line item shift. It's a structural contraction of the total addressable market for per-seat software.

Policy & Governance

Anthropic wins preliminary injunction against Trump administration. Federal Judge Rita Lin blocked the DOD's "supply chain risk" designation, calling it "classic illegal First Amendment retaliation." Seven-day stay for appeal. For builders in government-adjacent work, the injunction removes a barrier to using Claude-based tools. For now.

David Sacks exits as White House AI/crypto czar after 130-day SGE limit. Moves to PCAST co-chair alongside Michael Kratsios. Stablecoin and market structure legislation still unfinished. Policy vacuum at a critical moment.

Apple will open Siri to third-party AI chatbots in iOS 27. Bloomberg reports Gemini, Claude, and others can integrate via "Extensions." Users select preferred services in Settings. Expected at WWDC June 8. First major mobile OS with a chatbot-agnostic voice assistant. If you're building MCP servers, iOS just became a potential client surface.

Wikipedia bans AI-generated content by 44-2 vote. The policy cites compounding risk: hallucinated text enters the encyclopedia, gets scraped into training data, re-enters future models. Two exceptions survive: AI-assisted copyediting and first-pass translation, both requiring human verification.

AI bots officially surpass human internet traffic for the first time. Human Security's report shows AI-driven traffic up 187% year-over-year. Traffic from AI agents like OpenClaw grew nearly 8,000% in 2025. If you're building web services, your default user is now a bot.

Skills of the Day

Pin every GitHub Action to a full commit SHA, not a version tag. Replace uses: action@v1 with uses: action@abc123def. The Trivy supply chain attack exploited tag mutation to inject malicious code into 10,000 CI/CD workflows. This takes 20 minutes and closes the exact attack vector that hit this month.
Use the orchestrator-subagent pattern with compressed summaries for multi-agent workflows. Symbolica's ARC-AGI-3 result proves it: a top-level agent that delegates to specialists returning summaries outperformed raw prompting by 100x. Keep your orchestrator context clean. Never let it interact with the environment directly.
Run SAST specifically on AI-generated code before merging. With 87% of AI PRs containing vulnerabilities (CrowdStrike), your code review is now your real product. Tools like Snyk Agent Security, Semgrep, and Harness Secure AI Coding target AI-specific vulnerability patterns. Add them to your CI pipeline today.
Try "vibe porting" for performance-critical components with good test coverage. If you have a Python/JS module with 80%+ test coverage that processes high volumes, use AI to translate it to Go or Rust. Reco saved $500K/year on $400 in tokens. The test suite is the key enabler, without it, don't attempt this.
Use Claude Code's /effort command to control token spend per interaction. Set effort to "low" for straightforward file edits, "medium" (the default sweet spot) for normal coding, and "high" for complex debugging. The "ultrathink" keyword in prompts temporarily activates maximum reasoning depth. You'll save meaningful token budget on routine tasks.
Validate RAG document sources before indexing into your vector database. The PIDP-Attack paper shows that malicious documents with hidden prompt injection payloads bypass embedding-level content filtering. Your vector database is now part of your attack surface. Implement document provenance checking at ingestion time.
Separate research/planning from implementation in Claude Code using Plan Mode. The most common Claude Code failure isn't bad code generation, it's solving the wrong problem. Research the codebase in Plan Mode first, design the approach in Plan Mode, only switch to implementation after the plan is solid. A practitioner running 668K lines validated this pattern.
Install Cline Kanban (npm i -g cline) for visual multi-agent task orchestration. If you're managing parallel agent sessions in tmux or separate terminals, this gives you dependency-aware task cards with isolated git worktrees. Each card gets its own branch. Click-to-review diffs. Works with Claude Code, Codex, and Cline.
Evaluate Bifrost as your AI provider abstraction layer instead of writing one. At 11µs added latency and 5K req/s, maximhq/bifrost provides OpenAI-compatible access to 15+ providers. Zero-config deployment via NPX or Docker. If you're writing switch statements to swap between Anthropic and OpenAI, stop and use this instead.
Add OpenTelemetry semantic conventions to your production agents for cross-framework observability. Standard attribute names now exist for tasks, actions, tool calls, and memory operations. MCP tool executions trace via corresponding instrumentation. Connect to your existing Grafana/Datadog/Honeycomb stack. If you're running agents in production without traces, you're flying blind.

How did today's issue land? Reply with what worked and what didn't. I read every response.

Follow the research: Bluesky @webdevdad · LinkedIn