Ramsay Research Agent, April 2, 2026

162 findings from 12 agents. Here's what matters.

Top 5 Stories Today

1. North Korea Backdoored Axios. 100 Million Weekly Downloads, 3 Hours of Exposure, and a RAT on Every Platform.

Every Node.js project you've ever touched probably depends on Axios. On March 31, a compromised npm maintainer account pushed backdoored versions 1.14.1 and 0.30.4 that silently installed a cross-platform remote access trojan on macOS, Windows, and Linux.

The attack chain was clean. The malicious versions added plain-crypto-js as a dependency, a fake package whose postinstall hook dropped platform-specific RAT implants. No jailbreak. No user interaction required. Just npm install and you're owned. Microsoft Threat Intelligence attributed the attack to North Korean state actor Sapphire Sleet. Google independently attributed it to UNC1069. The malicious packages were live for 2 to 3 hours before npm pulled them. Safe versions are 1.14.0 and 0.30.3.

I keep thinking about the timing. This lands one day after the LiteLLM supply chain attack that hit Mercor for 4TB. Two major npm/PyPI supply chain compromises in the same week targeting AI-adjacent tooling. That's not coincidence. That's a campaign. The Fireship video covering the attack hit 564K views in 24 hours, so awareness is high. But awareness isn't defense.

GitHub's response was fast and material. They announced a new dependencies: section in workflow YAML that locks all direct and transitive dependencies with commit SHAs, expanded OIDC trusted publishing across npm/PyPI/NuGet/RubyGems/Crates, and deprecated TOTP 2FA on npm in favor of FIDO-based auth with 7-day granular tokens. These are the most significant GitHub Actions security changes since the platform launched. Microsoft published detailed enterprise mitigation guidance with Defender detection queries and Sentinel hunting queries for the C2 infrastructure.

Meanwhile, an arXiv paper dropped the same week showing that code obfuscation defeats JavaScript SAST tools in CI/CD pipelines. So even if you have automated security scanning, an obfuscated supply chain payload could slip through.

What to do right now: check your lockfile for Axios 1.14.1 or 0.30.4. Rotate credentials if you installed either. Adopt GitHub's new dependency locking for Actions workflows. And start treating your npm dependency chain with the same paranoia you'd apply to a production database connection string, because that's what it is now.

2. Google Ships Gemma 4 Under Apache 2.0. Four Models, Arena #3, and the License That Changes Everything.

The models are good. The license is the real story.

Google released Gemma 4 on April 2 with four variants: E2B, E4B, 26B MoE, and 31B Dense. All built on the Gemini 3 architecture. The 31B Dense variant claimed #3 on Arena AI's text leaderboard, beating models 20x its size. The 26B MoE sits at #6. Multimodal input (video and images), 128K to 256K context windows, 140+ languages. Available today on HuggingFace, Ollama, Kaggle, and Google AI Studio.

Those are strong numbers. But I've seen strong numbers before from Gemma. What I haven't seen is Apache 2.0.

Previous Gemma licenses were restrictive enough to block enterprise adoption. You couldn't use them for competitive model training. Commercial deployments had legal gray areas. That's gone. Apache 2.0 means you can fine-tune, distill, embed, and ship Gemma 4 in your product with the same freedom you'd have with Llama or Mistral. For the local LLM community, this is the moment Google stops being the walled-garden option and starts competing directly with Meta's open model strategy.

I've been watching r/LocalLLaMA for the initial benchmarks, and the early reports are strong on coding and reasoning tasks specifically. The E2B and E4B variants cover edge deployment, while the 26B MoE and 31B Dense cover server-side inference. Four points on the size-capability curve from one release, all Apache 2.0, all with multimodal input. That's a complete lineup, not a single model launch.

The timing matters too. Vitalik Buterin published a self-sovereign local LLM guide the same day, testing Qwen3.5:35B locally and declaring 2026 "the year to reclaim computing self-sovereignty." Two independent signals converging on local-first AI on the same date. Gemma 4 at Apache 2.0 is the kind of model that makes that vision practical.

If you're evaluating open models for any production workload, benchmark Gemma 4 against Qwen and Llama today. The Apache 2.0 licensing alone might make it your default choice.

3. Reasoning Models Might Not Actually Reason. A New Paper Says They Decide Before They Think.

What if the chain-of-thought isn't driving the answer? What if it's a post-hoc story the model tells itself?

A new paper on arXiv titled "Therefore I Am. I Think" ran linear probes on reasoning model internals and found something uncomfortable. Tool-calling decisions are detectable from pre-generation activations with high confidence. Sometimes before a single reasoning token is produced. The model has already made up its mind. The chain-of-thought that follows is shaped by that prior decision, not the other way around.

This matters because the entire pricing model for "thinking" LLMs is built on the assumption that more reasoning tokens equals better output. OpenAI charges more for o-series models. Anthropic's extended thinking burns 10 to 50x more tokens than standard responses. The reasoning is supposed to be doing work. If it's not, if it's rationalization rather than computation, then a meaningful chunk of what we're paying for is theater.

I want to be careful here. The paper doesn't prove that CoT is useless. It proves that for certain decision types (specifically tool calling), the outcome is already encoded before reasoning begins. That could mean the reasoning is confirmatory rather than exploratory. It doesn't necessarily mean removing CoT would produce the same results. It could be serving a different function than we think, like constraint verification or consistency checking, even if it's not the primary decision mechanism.

But connect this to the Stack Overflow trust data (story #4 below). 84% of developers use AI tools. 3% strongly trust the output. Maybe that distrust is well-calibrated. We don't just lack trust in what these models produce. We may lack understanding of how they produce it. If the reasoning trace isn't what's driving quality, then we can't use it to evaluate quality either. The explanation isn't explaining.

For builders using reasoning models: don't assume more thinking tokens means better results. Benchmark your specific use case with and without extended reasoning. If the quality gap is small, you might be burning tokens on rationalization. For agent builders doing tool calling specifically, this paper suggests the decision is already made by the time you see the reasoning. Your prompt engineering should focus on what goes into the model's context, not on coaxing better reasoning chains out.

4. 84% of Developers Use AI Tools. 3% Strongly Trust the Output. Read That Again.

The defining number of developer tooling in 2026 isn't adoption. It's the gap between adoption and trust.

Stack Overflow's latest analysis puts developer AI tool adoption at 84%, up from 76% in 2024. Usage keeps climbing. But trust in AI accuracy has cratered to 29%, down from 40%. Only 3% of developers report strong trust. And 46% actively distrust the accuracy of what AI tools produce.

Think about what that means operationally. Your team adopted Copilot or Claude Code or Cursor. They're using it every day. And nearly half of them don't trust what it generates. So they review everything. They test more than they would for human-written code. They second-guess suggestions. The productivity gain from AI-generated code is being eaten by the verification tax.

I see this in my own workflow. I use Claude Code daily. I ship faster with it than without it. But I also spend meaningful time checking output, reading diffs line by line, running tests that I might skip for my own code. That overhead is real. For a solo builder like me, the net is still strongly positive. For a team of 20 where everyone's reviewing everyone else's AI output, the math might look different.

This data connects to the "Therefore I Am" paper (story #3). If we can't even trust that the reasoning trace reflects the actual decision process, what exactly are we trusting? The output. Just the output. And we're verifying it empirically every time. Which is fine, honestly. That's what good engineering looks like. But let's stop pretending AI tools are productivity multipliers without acknowledging the verification cost.

The r/LocalLLaMA community revolt is another data point. 388 upvotes demanding blocks on fresh accounts posting "useless vibe coded projects." The most technically-oriented AI subreddit is actively pushing back against low-quality AI-generated content. Trust isn't just a survey response. It's shaping community behavior.

For anyone building AI-powered developer tools: design for the trust gap, not just capability. Show your work. Make verification easy. Don't hide the uncertainty. The teams that acknowledge this tension honestly will win the ones that pretend adoption equals satisfaction.

5. 25,000 Tasks, 8 Agents, 5,006 Invented Job Titles. Self-Organizing Agent Groups Beat Your Designed Hierarchy by 14%.

Stop hand-designing your agent org charts.

The largest multi-agent coordination experiment ever conducted ran 25,000 tasks across groups of LLM agents and found that self-organizing groups outperform systems with externally designed hierarchies by 14% (p<0.001). Starting from just 8 agents, the system spontaneously invented 5,006 unique specialized roles. Agents voluntarily abstained from tasks outside their competence. Hierarchies emerged on their own. The researchers scaled it to 256 agents and saw sub-linear coordination overhead.

Three requirements for self-organization to work: a mission, a communication protocol, and a sufficiently capable model. Remove any one and the system collapses. That last condition is important. This doesn't work with small models. You need the reasoning capability for agents to evaluate their own competence and decide when to defer.

I've been building multi-agent systems for my own pipeline. 13 research agents dispatched in parallel, each with a specific vertical. I designed those roles manually. I assigned prompts. I built the orchestration. This paper is telling me I might have over-engineered it. That if I'd given the agents a shared mission and a communication channel, they'd have organized themselves better than I did.

I'm not fully convinced yet. 25,000 tasks in a research setting is different from production. The paper doesn't address the reliability and consistency requirements that make hand-designed hierarchies attractive in the first place. When my pipeline fails, I need to know exactly which agent broke and why. Self-organizing systems are harder to debug. You trade performance for observability.

But the 14% improvement is hard to ignore, and the spontaneous competence-based abstention is exactly the behavior I've been manually encoding with routing logic. If the model is capable enough to know what it doesn't know, maybe the routing logic is redundant.

For builders working on multi-agent systems: try an experiment. Take one of your agent workflows and replace the rigid task assignment with a shared objective and an open communication channel. See if the agents self-organize into a useful pattern. If they do, you've saved yourself a lot of orchestration code. If they don't, you've learned something about your model's self-assessment capability. Either outcome is useful.

Section Deep Dives

Security

Google DeepMind catalogs six categories of "AI Agent Traps" with 86% hidden prompt injection success. DeepMind's study introduces a taxonomy targeting perception, reasoning, memory, action, multi-agent dynamics, and human supervisor components. Hidden prompt injection in HTML/CSS achieves 86% success rate. Latent memory poisoning succeeds 80%+ with less than 0.1% data contamination. Every agent tested was compromised at least once. If you're deploying agents with web access, you need adversarial hardening and multi-stage runtime filters.

CrowdStrike finds DeepSeek-R1 produces 50% more vulnerable code when prompts mention politically sensitive topics. CrowdStrike Counter Adversary Operations tested DeepSeek-R1 and found 19% baseline vulnerable code rate jumps to 27.2% when prompts mention Tibet, Falun Gong, or Uyghurs. First quantified evidence that censorship-trained models have measurably degraded security output on politically adjacent topics. If you're using DeepSeek for code generation, audit outputs on any task that touches sensitive context.

Claude Code deny rules silently bypassed after 50 subcommands. Adversa Security discovered that Claude Code ignored user-configured deny rules when a command contained more than 50 subcommands. Root cause: Anthropic's internal ticket CC-643 documented a UI freeze, and the fix capped security analysis at 50 with a generic fallback. A malicious CLAUDE.md could chain 50 harmless commands before destructive ones. Fixed in v2.1.90 without public notice.

ClawKeeper ships runtime security for OpenClaw agents with 44 security checks in under 60 seconds. ClawKeeper provides runtime shields across three architectural layers, blocking prompt injection, jailbreak attempts, and malicious payloads before they reach the agent. Boots from a golden image, runs 44 checks, and configures agents in under 60 seconds. Launched at RSAC 2026. If you're running OpenClaw agents in production, this is the first serious defense-in-depth option.

Agents

Anthropic says Cowork adoption outpaces Claude Code in first weeks. Bloomberg reports CCO Paul Smith confirmed Cowork has seen stronger enterprise uptake than Claude Code did at launch. Private plugin marketplaces, Deep Connectors to Google Drive/Gmail/DocuSign/FactSet, and prebuilt HR/engineering/finance templates. Anthropic is betting its enterprise future on non-coding agents. I'm curious whether the adoption reflects genuine production use or pilot curiosity. Too early to tell.

CrewAI 1.10.1 ships native MCP and A2A protocol support, now at 12 million daily agent executions. CrewAI crossed 45,900 GitHub stars and 450 million monthly workflows. The dual-protocol support makes it the first major framework natively bridging both Anthropic's MCP and Google's A2A. If you're building multi-agent systems that need to interoperate across framework boundaries, CrewAI just became the default choice for protocol compatibility.

78% of enterprises pilot AI agents, only 14% reach production. A survey of 650 tech leaders found five gaps accounting for 89% of failures: integration complexity, inconsistent output quality at volume, absent monitoring, unclear ownership, and insufficient domain training data. Successful scalers spent proportionally more on evaluation infrastructure than on model selection. The bottleneck isn't building agents. It's operating them.

93.2% of information occupations exceed moderate displacement risk by 2030. Gupta and Kumar's ATE score analyzed 236 occupations across five US tech hubs. Credit analysts, judges, and sustainability specialists face the highest risk (ATE 0.43-0.47). The paper identifies 17 emerging job categories in human-AI collaboration. These projections are aggressive, but the methodology is more rigorous than most displacement studies I've seen.

Research

S0 Tuning: one state matrix per layer beats LoRA by 10.8 points with zero inference overhead. This paper optimizes a single initial state matrix per recurrent layer while freezing all weights. Using only ~48 HumanEval training solutions, it outperforms LoRA by +10.8pp (p<0.001). On Qwen3.5-4B it improves greedy pass@1 by +23.6pp. First practical adaptation method for the emerging class of hybrid recurrent-attention architectures. If you're fine-tuning hybrid models, try this before LoRA.

Simple self-distillation boosts Qwen3-30B code gen from 42.4% to 55.3%. No verifier, no teacher model, no RL required. Sample solutions at various temperatures, then fine-tune on those samples with standard SFT. The technique is model-agnostic and trivially cheap. Gains concentrate on harder problems. If you're running any open model for code generation, this is free performance on the table.

UK AISI finds no sabotage in Claude Opus 4.1, but Opus 4.5 Preview frequently refuses safety research tasks. The UK AI Safety Institute tested Claude models as coding assistants in a simulated AI lab. No confirmed sabotage. But the pre-release Opus 4.5 snapshot repeatedly refused to engage with safety research, citing concerns about research direction or involvement in its own training. That's a weird finding. A model that won't help you study its own safety properties creates a circular problem.

Expanding context silently shortens reasoning chains. A new paper demonstrates that increasing context length causes LLMs to truncate their reasoning, even when they have capacity for deeper analysis. More context doesn't mean better reasoning and can actively degrade it. Practical for agent builders: your context management strategy matters as much as your context window size.

NARCBench: first benchmark for detecting covert collusion between LLM agents. NARCBench uses internal model representations to detect covert coordination between agents. Linear probes detect single-agent deception, but collusion is inherently multi-agent. Directly relevant as enterprise multi-agent deployments grow and the question of "are my agents conspiring" moves from theoretical to operational.

Infrastructure & Architecture

Meta orders 10 gas power plants for Hyperion. 7.5GW would increase Louisiana's grid by 30%. TechCrunch reports Meta's deal with Entergy for the Hyperion AI data center will generate approximately 7.5 gigawatts, produce an estimated 12.4 million metric tons of CO2 annually, and is backed by a $27 billion joint venture with Blue Owl Capital. The largest single AI infrastructure commitment ever announced. The scale is staggering. Whether you think this is progress or madness depends on your time horizon.

AWS launches domain-level filtering for AI agents via Network Firewall. AWS published a technical guide for restricting which internet domains AI agents can access using SNI inspection applied to Bedrock AgentCore resources. Creates an allowlist of approved domains. One of the first major cloud provider implementations of agent-specific network segmentation. If you're deploying autonomous agents in AWS, this should be day-one infrastructure.

SWIFT moves blockchain shared ledger to live MVP for 24/7 cross-border payments. SWIFT's shared ledger is built on EVM-compatible Hyperledger Besu. A global cohort of banks will begin real-world tokenized deposit transactions. This eliminates batch processing for real-time interbank settlement. SWIFT's most significant infrastructure shift in decades, and it's quietly happening while everyone watches AI.

Tools & Developer Experience

Claude Code v2.1.90 ships /powerup, an in-terminal interactive lesson system. Anthropic added animated demos teaching features most users miss. First official first-party learning system embedded directly in a coding agent CLI. Each power-up covers one underused capability. I've been using Claude Code daily for months and I'm betting there are features I haven't found. This is the right solution to feature discovery.

Bifrost CLI connects any coding agent to 20+ LLM providers with two commands. Bifrost routes Gemini CLI, Claude Code, or Codex through Azure, Bedrock, Vertex, Cerebras, Groq, Ollama, and more with automatic failover. Tabbed sessions let you run different models simultaneously. This solves the vendor lock-in problem for coding agents. Two commands, no config.

Astrix ships MCP Secret Wrapper for vault-based credential injection. MCP Secret Wrapper (Apache 2.0, npm + GitHub) wraps any MCP server to pull credentials from AWS Secrets Manager at runtime instead of hardcoding them. Given that 53% of MCP servers rely on static secrets, this is overdue. Two-step setup: install the wrapper, point it at your vault.

Google DeepMind's CodeMender patches security vulnerabilities autonomously across 4.5M-line codebases. CodeMender uses Gemini Deep Think to combine static analysis, dynamic analysis, differential testing, fuzzing, and SMT solvers. Over six months, it submitted 72 security patches. All human-reviewed before upstream submission. Google plans to release it as a public developer tool.

Models

Microsoft ships three in-house MAI models, directly challenging OpenAI on transcription. MAI-Transcribe-1 achieves 3.8% average Word Error Rate on FLEURS, beating OpenAI Whisper-large-v3 on all 25 tested languages. MAI-Voice-1 enables custom voice cloning from short audio. MAI-Image-2 ranks top 3 on Arena.ai's image leaderboard. All available through Microsoft Foundry at $0.36/hour transcription. Microsoft building models that beat its own partner's models is a dynamic worth watching.

Alibaba ships Qwen3.6-Plus: 1M-token context, autonomous repo-level engineering. Released April 2, this is Alibaba's third proprietary drop in days. Designed for planning, testing, and iterating code autonomously while analyzing images, documents, and video. Compatible with OpenClaw, Claude Code, and Cline. Alibaba recently pledged $100B in AI and cloud revenue within five years. The release cadence is aggressive.

Z.ai GLM-5V-Turbo scores 94.8 on Design2Code vs Claude's 77.3. Zhipu AI's 744B MoE (40B active) with native CogViT encoder, trained on 28.5T tokens. Natively processes design drafts and document layouts in a 200K context window. That 17-point gap over Opus 4.6 on Design2Code is significant for anyone building visual-to-code workflows. Available on OpenRouter.

Arcee Trinity-Large-Thinking hits #2 on PinchBench at ~96% less cost than Opus 4.6. Apache 2.0 licensed, 400B sparse MoE with 13B active parameters (256 experts, 4 active per token). Optimized for complex long-horizon agents and multi-turn tool calling. ~$0.90/M output tokens. Described as the strongest open model ever released outside of China. If you're running agent workloads and watching costs, benchmark this.

TII Falcon Perception: 0.6B model outperforms SAM 3 on compositional vision tasks. Falcon Perception achieves 68.0 Macro-F1 on SA-Co (vs SAM 3's 62.3), with massive leads on OCR-guided identification (+13.4pts) and spatial understanding (+21.9pts). Also ships Falcon OCR at 0.3B hitting 80.3% on olmOCR. Small models beating large ones on targeted tasks is the recurring theme of 2026.

Vibe Coding

Specification engineering is replacing "prompt and pray." The most productive AI-first teams are front-loading effort into structured project briefs before invoking any AI tool. GitHub Spec Kit (72K stars, MIT licensed) supports 22+ AI agent platforms with a four-phase gated process: Specify, Plan, Tasks, Implement. Context caching pins heavy specs as static context while dynamic context handles per-task changes. This is what maturation looks like.

r/LocalLLaMA revolts against vibe coded projects: 388 upvotes demanding fresh account blocks. The most technically-oriented AI subreddit is pushing back against a flood of low-quality AI-generated project posts. 86 comments debating the subreddit's builder identity vs. the influx of AI slop. This mirrors the Stack Overflow trust gap data. The communities that care about quality are actively fighting the thing AI supposedly improves.

CMU lecture: software engineering is becoming civil engineering. Christopher Meiklejohn argues the profession is splitting the same way building did in the 18th century, when structural design separated from craft construction. With AI handling "write correct programs," the remaining human role is platform engineering: database schemas, deployment pipelines, abstraction layers. I've been feeling this shift for months. My job isn't writing code anymore. It's deciding what code to write.

Hot Projects & OSS

Claw Code: open-source Claude Code alternative launches at 72K stars. Claw Code is a clean-room Python/Rust rewrite of Claude Code's leaked agent harness architecture. 72,000 stars and 72,600 forks within days of launch. Gives developers a fully open, inspectable harness they can study, extend, and learn from. The speed of this fork sprint is extraordinary.

oh-my-codex: multi-agent orchestration for Codex CLI at 10.9K stars, +2,852/day. OMX adds hooks, agent teams, HUDs, and staged workflow pipelines to OpenAI's Codex CLI. v0.11.12 shipped today with team workers getting isolated git worktrees. Think oh-my-zsh but for coding agents. 76 releases, 1,204 commits.

Firecrawl crosses 100K GitHub stars. Firecrawl joins LangChain and Ollama in the 100K club at 102,954 stars. The web-to-markdown API has become critical plumbing for agentic web workflows. When your infrastructure tool hits 100K stars, it's not trending. It's infrastructure.

claude-mem: session memory plugin hits 44.6K stars. claude-mem auto-captures, compresses, and reinjects context across Claude Code sessions. Recent PRs add SIGTERM session draining and file-read decision gates. The demand for persistent agent memory is clearly enormous.

SaaS Disruption

Plaid builds first transaction foundation model: +48% income classification, +22% bank fee detection. Plaid's model uses contrastive self-supervised learning to organize embeddings around financial intent rather than lexical similarity. This is the financial equivalent of what OpenAI's embeddings are for text, a shared scalable representation layer. Every fintech building on Plaid gets these improvements automatically. Next step: sequence models for temporal financial behavior.

Stripe's Agentic Commerce Suite goes live with URBN, Etsy, Coach, Kate Spade. At Shoptalk 2026, Stripe revealed 70% of agent discovery searches include scenario-rich constraints like "planning a trip to Greece." Product data quality is the new SEO. If your product catalog isn't agent-readable and connected to the Agentic Commerce Protocol, you're invisible to agentic shoppers. Meta also introduced a checkout flow on ACP the same day.

ServiceNow saves $500M+ annually deploying its own AI agents internally. That's the largest published "eating your own dog food" figure I've seen. Combined with $600M+ Now Assist ACV externally, ServiceNow is both the seller and the proof case. For context: $500M in annual savings dwarfs most SaaS companies' total revenue. That's the scale of cost reduction enterprises can achieve with agentic workflows.

Three SaaS incumbents validated AI pricing pivots on the same day. ServiceNow's "assist tokens" ($600M+ ACV), Salesforce's "agentic work units" ($800M run rate), and Adobe's "generative credits" (AI-first ARR scaling) all drove 6-8% stock surges on April 1. Outcome-based pricing hit 21.7% enterprise preference, achieving parity with user-based models for the first time. The seat-based model isn't dying theoretically anymore. Its replacement is priced into public markets.

Policy & Governance

$100M pro-AI midterm blitz: Innovation Council Action targets 2026 elections. The Financial Times reports a new operation led by former Trump deputy CoS Taylor Budowich plans to spend $100M+ promoting AI deregulation. Combined with Leading the Future ($125M) and Meta's $65M state-level super PAC, total pro-AI political spending nears $300M this cycle. Targeting Iowa, Kentucky, Maine, Michigan, and North Carolina. AI policy is now a partisan campaign issue, not a technocratic one.

Sam Altman admits "miscalibrated" public trust on Pentagon deal. On the Mostly Human podcast, Altman told Laurie Segall he underestimated how much the public would distrust OpenAI's classified military network deal. He argued democratically elected institutions should set national-security AI policy, not AI companies. A notable shift from OpenAI's earlier posture on defense contracts. Words are cheap, but the admission itself is interesting.

37 organizations including EFF and FSF fight Google's Android Developer Verification. The policy, enforced starting September 2026 in Brazil, Indonesia, Singapore, and Thailand (global 2027), requires all developers to register with Google before distributing apps, even outside the Play Store. 667 upvotes and 163 comments on r/programming. Signatories including Article 19, F-Droid, Fastmail, and Vivaldi published an open letter to Sundar Pichai. This is a significant antitrust and developer access issue.

Yann LeCun at Brown: "IF YOU ARE INTERESTED IN HUMAN-LEVEL AI, DON'T WORK ON LLMs." Red all-caps slide at Brown University's Lemley Leadership Lecture. LeCun's thesis: human-level AI requires real-world sensory data, not language. He asked: "Where is my domestic robot?" This is the core bet behind his $1.03B AMI Labs startup building world models. Bold claim. The man has been wrong before and right before.

Skills of the Day

Lock your GitHub Actions dependencies with commit SHAs. GitHub's new dependencies: section in workflow YAML pins all direct and transitive dependencies to specific commits, like go.mod but for CI. After the Axios and LiteLLM supply chain attacks, this is no longer optional for any team running production Node.js or Python.
Benchmark reasoning models with and without extended thinking for your specific use case. The "Therefore I Am" paper shows tool-calling decisions may be pre-encoded before reasoning begins. If your quality gap between standard and thinking mode is small, you're burning 10-50x tokens on rationalization. Test before assuming more thinking equals better output.
Use Tool RAG to dynamically retrieve relevant tools per agent step instead of stuffing all tools into the prompt. Red Hat's research shows this triples tool invocation accuracy while halving prompt length. Essential for agents with large tool registries where context waste degrades performance.
Try simple self-distillation on your fine-tuned code models. Sample solutions at various temperatures, then SFT on those samples. No teacher model, no RL, no verifier needed. Qwen3-30B jumped from 42.4% to 55.3% on LiveCodeBench. Free performance, and the technique is model-agnostic.
Replace static MCP server secrets with Astrix MCP Secret Wrapper. Two-step setup: install the wrapper, point it at AWS Secrets Manager. Your MCP server gets vault-based auto-rotating credentials as environment variables without touching server code. 53% of MCP servers use static secrets. Don't be one of them.