THE LEAD
Google DeepMind's AI agent autonomously solved 9 of 353 open Erdős problems in combinatorics and graph theory at a cost of roughly $100–300 per problem — a result significant enough that Demis Hassabis immediately walked back the hype, arguing it falls short of "true invention." Meanwhile, the METR AI time horizons benchmark — one of the most-cited graphs in AI capability discourse — is under fire for severe methodological errors identified by researcher Nathan Witkin, casting doubt on a foundational data source used by labs, VCs, and policymakers. Two pillars of the capability narrative shifted in 48 hours.
TOP STORIES
Google DeepMind's AI agent worked through 353 open Erdős problems in combinatorics and graph theory, resolving 9 — problems that had stumped mathematicians for decades — at a cost of a few hundred dollars each. The results are verifiable, peer-checkable mathematical proofs, not approximations. Demis Hassabis publicly tempered expectations, noting the work doesn't constitute genuine scientific invention.
Why it matters: This is the clearest demonstration yet that frontier AI agents can produce original, publishable mathematical results at near-zero marginal cost — a direct challenge to the "stochastic parrot" framing and a signal that math research workflows are changing now, not in five years.
Researcher Nathan Witkin published a detailed critique of METR's Long Tasks benchmark, identifying severe methodological flaws in the widely-circulated AI time horizons graph that shows exponential growth in AI task completion over time. The graph has been cited extensively by AI labs, investors, and safety researchers to argue that autonomous AI agents are advancing on a predictable trajectory.
Why it matters: If the benchmark is structurally broken, every capability roadmap and investment thesis built on it needs revisiting — this affects how labs communicate progress, how VCs price risk, and how policymakers calibrate urgency.
The Financial Times reports that a tool called Heretic removes safety guardrails from Meta's Llama 3.3 in fewer than 10 minutes, with over 3,500 decensored model variants created and downloaded 13 million times collectively. The story is now mainstream financial press.
Why it matters: This is the clearest evidence yet that open-weight model releases create a permanent, un-patchable jailbreak surface — the safety fine-tuning on open models is cosmetic at scale, and Meta's policy decisions have global consequences it cannot recall.
Anthropic released 31 pre-built workflow Skills for small businesses, replacing manual Zapier/Notion/CRM integrations with one-click setups inside Claude. The 382,000 day-one downloads suggest latent demand well beyond developer-class users.
Why it matters: Anthropic is moving aggressively down-market into the SMB segment that Microsoft Copilot and Google Workspace have been targeting — this is a direct distribution play, not a research announcement, and the download numbers validate product-market fit.
Figure AI livestreamed its humanoid robots performing package sorting continuously for eight days straight, uninterrupted and unscripted. This is not a controlled demo environment — it's sustained, observable production operation.
Why it matters: Eight days of continuous operation destroys the "staged demo" objection that has followed humanoid robotics for years; Figure AI is now running what amounts to a live reliability audit, and the data it's generating on failure rates and edge cases is worth more than any benchmark.
Researchers demonstrate that inaudible ultrasonic commands embedded in YouTube videos, podcasts, or music can silently trigger unauthorized actions on AI voice assistants without any indication to the user. The attack vector requires no interaction — passive media consumption is sufficient exposure.
Why it matters: Every AI voice assistant deployed at scale — Alexa, Google Assistant, ChatGPT voice mode — is potentially vulnerable to weaponized media content, and there is no obvious patch short of redesigning how these systems process audio input.
PATTERNS
- Mathematical AI results are being stress-tested in real time: DeepMind's Erdős result, Hassabis's own pushback, and the accompanying chart of AI math progress all appeared within 48 hours, suggesting the field is actively negotiating what "solving math" means before the narrative hardens.
- Embodied AI moved from demo to operations this week: Figure AI's 8-day package-sort livestream and LimX Dynamics' Luna launch both signal that humanoid robotics companies are now competing on operational uptime, not just capability showcases.
- The open-weight safety problem is hitting mainstream press: The Heretic/Llama story moving from LocalLLaMA to the Financial Times marks an inflection where open-model jailbreaking is no longer a niche technical discussion — regulators read the FT.
SIGNAL vs NOISE
- Signal: The METR benchmark critique is underreported relative to its importance. If Long Tasks methodology is flawed, the industry's primary quantitative tool for tracking autonomous agent progress is unreliable — affecting safety timelines, lab roadmaps, and policy arguments simultaneously. This deserves a full independent replication effort, not a Reddit thread.
- Noise: The 99%-of-CEOs-expect-AI-layoffs survey is a recycled anxiety metric with no operational specificity — CEOs have been saying this for three years, the survey design rewards agreement, and it tells you nothing about which roles, which timelines, or which companies are actually cutting headcount.
WATCH
Track whether METR responds to Witkin's methodology critique with a replication or retraction — if the benchmark holds, ignore the noise; if it doesn't, every capability claim citing it needs to be repriced.