The Archive

Search the full wire by company, model, lab, or keyword. Every story we have ever aggregated.

Claude OpenAI Anthropic Gemini Mistral Cursor

Executable World Models for ARC-AGI-3 in the Era of Coding Agents

Coding agent with executable Python world models, verification, and simplicity-bias refactoring solves 25 public ARC-AGI-3 games without task-specific logic.

Sergey Rodionov·1 month ago

Ars Technica AI· PRESS

Anthropic's Claude Managed Agents can now "dream," sort of

Also, rate limits will double for Pro and Max users of tools like Claude Code.

Samuel Axon ·1 month ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Self-Induced Outcome Potential: Turn-Level Credit Assignment for Agents without Verifiers

Long-horizon LLM agents depend on intermediate information-gathering turns, yet training feedback is usually observed only at the final answer, because process-level rewards require high-quality human annotation. Existing turn-level shaping methods reward turns that increase the likelihood of a gold answer, but they require answer supervision or stable task-specific verifiers. Conversely, label-free RL methods extract self-signals from output distributions, but mainly at the answer or trajectory level and therefore cannot assign credit to intermediate turns. We propose Self-Induced Outcome Po...

Senkang Hu·1 month ago

r/ClaudeAI· COMMUNITY

Anthropic’s new finance AI agents feel like a bigger move than just “better chat”

Anthropic launches 10 finance-focused AI agents via Claude Cowork and Managed Agents for KYC screening, pitchbook generation, and month-end close workflows.

u/Roaring_lion_·1 month ago·20 pts / 15 comm

TechCrunch AI· PRESS

SAP bets $1.16B on 18-month-old German AI lab and says yes to NemoClaw

SAP plans to buy German AI startup Prior Labs and invest heavily in it. It is also prohibiting customers' agents use to a select few like Nvidia's NemoClaw.

Anna Heim·1 month ago

arXiv (cs.AI/CL/LG)· ACADEMIA

OpenSeeker-v2: Pushing the Limits of Search Agents with Informative and High-Difficulty Trajectories

OpenSeeker-v2: SFT on informative trajectories achieves frontier LLM search agent capabilities without full RL pipeline.

Yuwen Du·1 month ago

arXiv (cs.AI/CL/LG)· ACADEMIA

MOSAIC-Bench: Measuring Compositional Vulnerability Induction in Coding Agents

MOSAIC-Bench evaluates coding agents' vulnerability to multi-stage attack chains that decompose malicious goals into innocuous sequential tasks, exposing alignment gaps in deployed systems.

Jonathan Steinberg·1 month ago

Anthropic· FRONTIER

Agents for financial services

Anthropic releases ten Cowork and Claude Code plugins plus Microsoft 365 integrations and MCP app for financial services.

Anthropic·1 month ago

NVIDIA Dev Blog· INFRA

How to Build In-Vehicle AI Agents with NVIDIA: From Cloud to Car

The automotive cockpit is undergoing a fundamental shift from rule-based interfaces to agentic, multimodal AI systems capable of reasoning, planning, and... The automotive cockpit is undergoing a fundamental shift from rule-based interfaces to agentic, multimodal AI systems capable of reasoning, planning, and acting. In most vehicles on the road today, in-vehicle assistants still rely on fixed command-response patterns: interpret a phrase, trigger an action, reset. While effective for well-defined tasks, this approach doesn’t scale to modern… Source

Felix Friedmann·1 month ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Contextual Multi-Objective Optimization: Rethinking Objectives in Frontier AI Systems

Argues frontier AI failures in open-ended tasks (scientific assistance, agents, personalization) stem from objective ambiguity rather than capability gaps; proposes contextual multi-objective optimization.

Jie Zhou·1 month ago

NVIDIA Dev Blog· INFRA

Building for the Rising Complexity of Agentic Systems with Extreme Co-Design

Generative AI’s explosive first chapter was defined by humans sending requests and models responding. The agentic chapter is different. Agents don't... Generative AI’s explosive first chapter was defined by humans sending requests and models responding. The agentic chapter is different. Agents don’t follow a pre-determined sequence of actions. They call tools, spawn sub-agents with different tasks and models, retain information in memory, manage their own context window, and decide for themselves when they’re finished. In doing so… Source

Eduardo Alvarez·1 month ago

r/LocalLLaMA· COMMUNITY

ProgramBench: Can we really rebuild huge binaries from scratch? (doesn't look like it)

ProgramBench: 200-task evaluation showing agents struggle to rebuild large binaries from scratch without cheating vulnerabilities.

u/klieret·1 month ago·41 pts / 18 comm

TechCrunch AI· PRESS

CopilotKit raises $27M to help devs deploy app-native AI agents

The Seattle-based startup's Series A round was led by Glilot Capital, NFX and SignalFire, TechCrunch has exclusively learned.

Ram Iyer·1 month ago

OpenAI· FRONTIER

OpenAI and PwC collaborate to reimagine the office of the CFO

OpenAI and PwC partner to deploy AI agents for enterprise finance automation, forecasting, and CFO workflow modernization.

OpenAI·1 month ago

arXiv (cs.AI/CL/LG)· ACADEMIA

FlexSQL: Flexible Exploration and Execution Make Better Text-to-SQL Agents

FlexSQL agent flexibly explores schemas and data during text-to-SQL generation, enabling recovery from early mistakes.

Quang Hieu Pham·1 month ago

arXiv (cs.AI/CL/LG)· ACADEMIA

AIs and Humans with Agency

Opinion piece argues LLM agents require jointly formulated actions and plans with human actors rather than isolated architectures.

David Mumford·1 month ago

arXiv (cs.AI/CL/LG)· ACADEMIA

ORPilot: A Production-Oriented Agentic LLM-for-OR Tool for Optimization Modeling

ORPilot: open-source agentic system translating ambiguous business problems into solver-ready optimization models with conversational and data collection agents.

Guangrui Xie·1 month ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Hybrid Inspection and Task-Based Access Control in Zero-Trust Agentic AI

Zero-trust authorization framework for LLM agents with hybrid inspection and task-based access control to mitigate tool-use and resource-access risks.

Majed El Helou·1 month ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Remote Action Generation: Remote Control with Minimal Communication

Novel framework for bandwidth-efficient remote control via minimal information transmission between controller and agents in continuous action spaces.

Szymon Kobus·1 month ago

arXiv (cs.AI/CL/LG)· ACADEMIA

DataEvolver: Let Your Data Build and Improve Itself via Goal-Driven Loop Agents

DataEvolver implements closed-loop agent-driven visual data generation and refinement for image editing, supporting masks, depth, poses, and trajectory artifacts.

Qisong Zhang·1 month ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Runtime Evaluation of Procedural Content Generation in an Endless Runner Game Using Autonomous Agents

Momentum integrates runtime procedural content generation and autonomous agent evaluation in endless-runner gameplay to assess generated terrain balance and solvability.

Rishabh Kar·1 month ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Talk is Cheap, Communication is Hard: Dynamic Grounding Failures and Repair in Multi-Agent Negotiation

Iterated negotiation benchmark tests LLM agents' ability to repair grounding failures in dynamic multi-turn interaction.

Yiheng Yao·1 month ago

arXiv (cs.AI/CL/LG)· ACADEMIA

GRAVITY: Architecture-Agnostic Structured Anchoring for Long-Horizon Conversational Memory

GRAVITY module injects relational, temporal, and thematic structure into conversational memory retrieval for long-horizon agents.

Yushi Sun·1 month ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Can Coding Agents Reproduce Findings in Computational Materials Science?

AutoMat benchmark evaluates LLM agents on reproducing computational materials science findings, requiring domain knowledge and result interpretation beyond code quality.

Ziyang Huang·1 month ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Learning How and What to Memorize: Cognition-Inspired Two-Stage Optimization for Evolving Memory

MemCoE: cognition-inspired two-stage memory optimization for LLM agents to learn personalized long-term user preferences within context windows.

Derong Xu·1 month ago

Latent Space· ANALYST

[AINews] Agents for Everything Else: Codex for Knowledge Work, Claude for Creative Work

Analysis of agentic AI specialization: coding agents (Codex-style) for knowledge work, Claude for creative tasks; discusses agents escaping operational boundaries.

Latent Space·1 month ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Intern-Atlas: A Methodological Evolution Graph as Research Infrastructure for AI Scientists

Intern-Atlas introduces structured methodological evolution graphs as research infrastructure for AI agents to navigate scientific knowledge beyond citation links.

Yujun Wu·1 month ago

TechCrunch AI· PRESS

Stripe introduces Link, a digital wallet that autonomous AI agents can use, too

Link lets users connect cards, banks, and subscriptions, then authorize AI agents to spend securely via approval flows.

Sarah Perez·1 month ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Agent-Agnostic Evaluation of SQL Accuracy in Production Text-to-SQL Systems

STEF enables schema-agnostic evaluation of text-to-SQL agents in production without ground-truth queries, addressing real-world deployment gaps.

Taslim Jamal Arif·1 month ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Stable Behavior, Limited Variation: Persona Validity in LLM Agents for Urban Sentiment Perception

Study finds persona prompting in multimodal LLMs produces stable but limited behavioral variation in urban sentiment judgment tasks.

Neemias B da Silva·1 month ago

← Front Page30 matches

← Newer Older →