Executable World Models for ARC-AGI-3 in the Era of Coding Agents
Coding agent with executable Python world models, verification, and simplicity-bias refactoring solves 25 public ARC-AGI-3 games without task-specific logic.
Search the full wire by company, model, lab, or keyword. Every story we have ever aggregated.
Coding agent with executable Python world models, verification, and simplicity-bias refactoring solves 25 public ARC-AGI-3 games without task-specific logic.
Also, rate limits will double for Pro and Max users of tools like Claude Code.
Long-horizon LLM agents depend on intermediate information-gathering turns, yet training feedback is usually observed only at the final answer, because process-level rewards require high-quality human annotation. Existing turn-level shaping methods reward turns that increase the likelihood of a gold answer, but they require answer supervision or stable task-specific verifiers. Conversely, label-free RL methods extract self-signals from output distributions, but mainly at the answer or trajectory level and therefore cannot assign credit to intermediate turns. We propose Self-Induced Outcome Po...
Anthropic launches 10 finance-focused AI agents via Claude Cowork and Managed Agents for KYC screening, pitchbook generation, and month-end close workflows.
SAP plans to buy German AI startup Prior Labs and invest heavily in it. It is also prohibiting customers' agents use to a select few like Nvidia's NemoClaw.
OpenSeeker-v2: SFT on informative trajectories achieves frontier LLM search agent capabilities without full RL pipeline.
MOSAIC-Bench evaluates coding agents' vulnerability to multi-stage attack chains that decompose malicious goals into innocuous sequential tasks, exposing alignment gaps in deployed systems.
Anthropic releases ten Cowork and Claude Code plugins plus Microsoft 365 integrations and MCP app for financial services.
The automotive cockpit is undergoing a fundamental shift from rule-based interfaces to agentic, multimodal AI systems capable of reasoning, planning, and... The automotive cockpit is undergoing a fundamental shift from rule-based interfaces to agentic, multimodal AI systems capable of reasoning, planning, and acting. In most vehicles on the road today, in-vehicle assistants still rely on fixed command-response patterns: interpret a phrase, trigger an action, reset. While effective for well-defined tasks, this approach doesn’t scale to modern… Source
Argues frontier AI failures in open-ended tasks (scientific assistance, agents, personalization) stem from objective ambiguity rather than capability gaps; proposes contextual multi-objective optimization.
Generative AI’s explosive first chapter was defined by humans sending requests and models responding. The agentic chapter is different. Agents don't... Generative AI’s explosive first chapter was defined by humans sending requests and models responding. The agentic chapter is different. Agents don’t follow a pre-determined sequence of actions. They call tools, spawn sub-agents with different tasks and models, retain information in memory, manage their own context window, and decide for themselves when they’re finished. In doing so… Source
ProgramBench: 200-task evaluation showing agents struggle to rebuild large binaries from scratch without cheating vulnerabilities.
The Seattle-based startup's Series A round was led by Glilot Capital, NFX and SignalFire, TechCrunch has exclusively learned.
OpenAI and PwC partner to deploy AI agents for enterprise finance automation, forecasting, and CFO workflow modernization.
FlexSQL agent flexibly explores schemas and data during text-to-SQL generation, enabling recovery from early mistakes.
Opinion piece argues LLM agents require jointly formulated actions and plans with human actors rather than isolated architectures.
ORPilot: open-source agentic system translating ambiguous business problems into solver-ready optimization models with conversational and data collection agents.
Zero-trust authorization framework for LLM agents with hybrid inspection and task-based access control to mitigate tool-use and resource-access risks.
Novel framework for bandwidth-efficient remote control via minimal information transmission between controller and agents in continuous action spaces.
DataEvolver implements closed-loop agent-driven visual data generation and refinement for image editing, supporting masks, depth, poses, and trajectory artifacts.
Momentum integrates runtime procedural content generation and autonomous agent evaluation in endless-runner gameplay to assess generated terrain balance and solvability.
Iterated negotiation benchmark tests LLM agents' ability to repair grounding failures in dynamic multi-turn interaction.
GRAVITY module injects relational, temporal, and thematic structure into conversational memory retrieval for long-horizon agents.
AutoMat benchmark evaluates LLM agents on reproducing computational materials science findings, requiring domain knowledge and result interpretation beyond code quality.
MemCoE: cognition-inspired two-stage memory optimization for LLM agents to learn personalized long-term user preferences within context windows.
Analysis of agentic AI specialization: coding agents (Codex-style) for knowledge work, Claude for creative tasks; discusses agents escaping operational boundaries.
Intern-Atlas introduces structured methodological evolution graphs as research infrastructure for AI agents to navigate scientific knowledge beyond citation links.
Link lets users connect cards, banks, and subscriptions, then authorize AI agents to spend securely via approval flows.
STEF enables schema-agnostic evaluation of text-to-SQL agents in production without ground-truth queries, addressing real-world deployment gaps.
Study finds persona prompting in multimodal LLMs produces stable but limited behavioral variation in urban sentiment judgment tasks.