Topic

Agents

Every story matching this topic across titles and summaries, newest first.

Open in archive search

TechCrunch AI· PRESS

OpenAI’s new voice mode makes it to the ChatGPT desktop app

ChatGPT Voice on desktop can work with both ChatGPT Work and Codex to complete tasks and control agents.

Ivan Mehta·2 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

OpenForgeRL: Train Harness-native Agents in Any Environment

OpenForgeRL enables end-to-end training of harness-native agents with open infrastructure, addressing limitation of complex inference harnesses like Claude Code.

Xiao Yu·3 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Agentic coding without the cloud: evaluating open-weight large language models on longitudinal data preparation tasks

Open-source evaluation framework for open-weight LLM agents on longitudinal data tasks, addressing privacy constraints in research deployments.

Mack Nixon·3 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

VoLN: Vision-Only Long-Horizon Navigation---Paradigm, Benchmark, and Method

VoLN: vision-only navigation benchmark and method for embodied agents without language instructions in GPS-denied environments.

Jiabin Lou·3 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Regulating autonomous and agentic AI

Paper examines regulatory frameworks for autonomous AI agents, arguing supply-chain governance and proactive risk management replace traditional retrospective oversight.

Chris Reed·3 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Toward cryptographically verifiable authorization for autonomous AI agents: A security hypothesis, preliminary formal model, and proof-of-concept implementation

Formalizes cryptographically verifiable authorization for autonomous agents, binding agent principal, request, and policy context with formal proof-of-concept.

M. Llambí-Morillas·3 days ago

Ars Technica AI· PRESS

OpenAI says its AI agent broke out of testing sandbox to hack Hugging Face

"This is day one for cybersecurity in the age of agents," Hugging Face CEO says.

Kyle Orland ·4 days ago

NVIDIA Dev Blog· INFRA

Make Long-Running NVIDIA TensorRT Engine Builds Observable and Cancelable in Python or C++

A TensorRT engine build can take seconds to many minutes. Large strongly typed models, deep tactic search, and a cold timing cache on a brand-new GPU SKU can... A TensorRT engine build can take seconds to many minutes. Large strongly typed models, deep tactic search, and a cold timing cache on a brand-new GPU SKU can leave developers, end users, or AI agents staring at a frozen terminal with no idea whether to wait, retry, or kill the process. Most NVIDIA TensorRT integrations report nothing during a build or provide no way to abort early. Source

Michelle Horton·4 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

PoTRE: Test-Time Reasoning inspired by Cognitive Heterogeneity

PoTRE framework deploys four heterogeneous agents (adversarial, hierarchical, spectrum search, direct) with task-adaptive aggregation for complex LLM reasoning.

Anmol Kankariya·4 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

The Ethics of Autonomous AI Agents for Offensive Security

Analysis of safety challenges in autonomous LLM-driven offensive security agents: non-deterministic policies resist ex-ante review and enable attribution evasion.

Andreas Happe·4 days ago

TechCrunch AI· PRESS

Glow emerges from stealth at $1.2B valuation to challenge endpoint security in the AI era

Glow is targeting a new class of endpoint risks created by the rapid adoption of AI agents and developer tools inside enterprises.

Jagmeet Singh·4 days ago

OpenAI· FRONTIER

Introducing OpenAI Presence

OpenAI launches Presence, an enterprise agent platform for deploying voice and chat agents in customer-facing and internal workflows.

OpenAI·4 days ago

The Verge AI· PRESS

OpenAI says it accidentally hacked Hugging Face with a new AI system

OpenAI CEO Sam Altman. | Bloomberg via Getty Images OpenAI says its AI models mistakenly breached open-source AI platform Hugging Face during internal testing. In a blog post on Tuesday, OpenAI writes that GPT-5.6 Sol and "an even more capable pre-release model" discovered vulnerabilities within their sandboxed testing environment, allowing them to gain access to the internet and target Hugging Face. On July 16th, Hugging Face disclosed a security incident that it says was driven by "an autonomous AI agent system." Hugging Face's AI agents detected and stopped the breach, which OpenAI has now...

Emma Roth·5 days ago

TechCrunch AI· PRESS

Jack Dorsey is taking on Slack with Buzz, a group chat platform for teams and their AI agents

Buzz is a group chat platform for the workplace that puts humans and their AI agents in the same conversation.

Amanda Silberling·5 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

CodeRescue: Budget-Calibrated Recovery Routing for Coding Agents

CodeRescue optimizes cost-aware routing for coding agents, determining when to retry vs. escalate after execution failures.

Qijia He·5 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Agents in the Wild: Where Research Meets Deployment

Survey/tutorial on agentic LLM systems in production, covering reasoning, planning, multi-agent coordination, robustness, and deployment challenges.

Grace Hui Yang·5 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

ResearchArena: Evaluating Sabotage and Monitoring in Automated AI R&D

ResearchArena framework evaluates AI control and monitoring for detecting sabotage in automated AI R&D agents across safety/capability post-training and optimization tasks.

Lena Libon·5 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

BioSecBench-Surveillance: A Verifiable Benchmark for AI Agents in Pathogen Genomic Surveillance

BioSecBench-Surveillance: 100-task verifiable benchmark for AI agents inferring pathogen genomic analysis pipelines from raw data.

Harmon Bhasin·5 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

PathAgentBench: Benchmarking Evidence-Seeking Vision-Language Models on Whole-Slide Pathology Image

PathAgentBench: benchmark for vision-language agents on gigapixel whole-slide pathology images evaluating multi-scale evidence-seeking.

Dankai Liao·5 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

S3: Stable Subgoal Selection by Constraining Uncertainty of Coarse Dynamics in Hierarchical Reinforcement Learning

S3 improves hierarchical RL subgoal selection by constraining dynamics uncertainty in high-level agents.

Kshitij Kumar Srivastava·5 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Agentic Real2Sim: Physics-based World Modeling with Vision-Language Agents

Agentic Real2Sim: VLM-based framework automating conversion of real robot videos to executable physics simulations for scene geometry, object state, and parameters.

Guanxiong Chen·5 days ago

NVIDIA Dev Blog· INFRA

NVIDIA Vera CPU: Olympus Cores Built for Maximum Single-Thread Performance in Agentic AI

Agentic AI shifts more of the critical execution path onto the CPU. Agents operate in sandboxes to execute code, invoke tools, retrieve context, interact with... Agentic AI shifts more of the critical execution path onto the CPU. Agents operate in sandboxes to execute code, invoke tools, retrieve context, interact with databases, and analyze results before returning information to the model. As these loops run concurrently across an AI factory, CPU performance increasingly shapes both per-agent responsiveness and overall factory throughput. Source

Michelle Horton·5 days ago

Simon Willison· ANALYST

Reverse-engineering is cheap now

Coding agents lower ROI threshold for reverse-engineering home automation, shifting economics of personal automation projects despite maintenance risk.

Simon Willison·6 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

FlashRT: Agent Harness for Guiding Agents to Deploy Real-Time Multimodal Applications

FlashRT: agent harness guides coding agents to optimize real-time multimodal pipeline deployment with dynamic placement and streaming.

Krish Agarwal·6 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

WorldCupArena: Fine-Grained Evaluation of Language Models and Deep-Research Agents on Football Forecasting

WorldCupArena: dynamic benchmark for LLMs and research agents on real-time sports forecasting with 2026 FIFA World Cup.

Zhaokai Wang·6 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Autoresearch with Coding Agents: Generalizers and Metric-Maximizers on Quran Recitation Data

Study of autoresearch agents (Claude Code) on Quranic speech-recognition tasks reveals metric-gaming vs. intent-alignment tradeoffs.

Nursultan Askarbekuly·6 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

The Shared Discovery Paradox: How a One-Answer Rule Turns Better Information into Worse Search

Theoretical analysis of pooled information reducing search coverage via one-answer rule; solvable benchmark with 16 boxes and 8 agents.

Yohei Nakajima·6 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Code-Poisoning Property Inference Attacks

First code-level property inference attack (CPPIA) exploits coding agents and ML training data to leak private dataset attributes.

Xukun Luan·9 days ago

VentureBeat AI· PRESS

The agent security gap: 54% of enterprises have already had an AI agent incident, and most still let agents share credentials

Across 107 enterprises, AI agents are being given real access to systems and data while the controls meant to contain them lag behind. More than half have already had a confirmed agent security incident or a near-miss; only about a third give every agent its own scoped identity, and most agents still share credentials; and only three in ten isolate their highest-risk agents. The security stack is overwhelmingly borrowed from the model providers and hyperscalers rather than purpose-built for agents, spending remains a thin slice of the security budget, and enterprises are evenly split on wheth...

VentureBeat AI·10 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Beyond Success Rate: Cost-Aware Evaluation of Offensive and Defensive Security Agents

Cost-aware evaluation framework for security agents measures offensive/defensive capability under realistic inference budget constraints vs. peak performance.

Paul Kassianik·10 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Bridge Evidence: Static Retrieval Utility Does Not Predict Causal Utility in Multi-Step Agentic Search

ReAct-style agents show static document relevance scores fail to predict utility in multi-step reasoning; gap measured on HotpotQA.

Debayan Mukhopadhyay·10 days ago

VentureBeat AI· PRESS

The AI context gap: Enterprise AI organizations have a trust problem, not a retrieval problem — and most are still building the fix

Across 101 enterprises, the infrastructure that feeds AI agents their business context is being built faster than it can be trusted. Retrieval-augmented generation is already the default context source, and provider-native retrieval has quietly overtaken the dedicated vector databases that define the category — yet a majority of enterprises have already watched their agents produce confident, wrong answers traced to missing or inconsistent context. A governed semantic layer is emerging as the fix, but most are still building it; the field is converging on hybrid retrieval; and even as provide...

VentureBeat AI·10 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

BadWAM: When World-Action Models Dream Right but Act Wrong

BadWAM introduces adversarial attacks exploiting visual-action drift in world-action models, revealing safety vulnerabilities in embodied AI agents.

Qi Li·10 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Plover: Steering GUI Agents through Plan-Centric Interaction

Plover externalizes planning in vision-based GUI agents, enabling user inspection and correction of task plans for autonomous interface automation.

Madhumitha Venkatesan·10 days ago

VentureBeat AI· PRESS

The agent evaluation gap: Enterprise AI organizations have a reality-alignment problem, not a coverage problem — and most are shipping to production anyway

Across 157 enterprises, organizations are granting AI agents more autonomy while trusting the evaluations meant to gate that autonomy less. Half have already shipped an agent that passed their internal evaluations and then failed a customer in production; only one in twenty fully trusts automated evaluation today; and the most-cited weakness is that evaluations do not align with real-world outcomes. Yet two-thirds already allow, or are actively engineering toward, deploying agent changes to production on automated evaluation alone — with no human in the loop. The result is an evaluation gap —...

VentureBeat AI·10 days ago

NVIDIA Dev Blog· INFRA

Integrating Context-Aware Video AI Agents Into Enterprise Workflows

A video analytics AI agent that can perceive, reason, and act based on massive amounts of video footage must be integrated with existing workflows and... A video analytics AI agent that can perceive, reason, and act based on massive amounts of video footage must be integrated with existing workflows and applications to be useful. These include content management systems, messaging platforms, databases, ticket queue, and escalation paths. This integration is challenging because video systems, enterprise knowledge bases… Source

Tanya Lenz·10 days ago

NVIDIA Dev Blog· INFRA

Scaling Agentic AI Factories Through Extreme Co-Design with NVIDIA BlueField

Agentic AI changes the infrastructure pattern for AI factories. One request can trigger many model calls, tool calls, memory lookups, policy checks, storage... Agentic AI changes the infrastructure pattern for AI factories. One request can trigger many model calls, tool calls, memory lookups, policy checks, storage accesses, and network transfers before a final answer is produced. As more agents run at once and carry context across steps, users, tools, services, and sessions, infrastructure must move, protect, retrieve, and reuse data fast enough to keep… Source

Michelle Horton·10 days ago

TechCrunch AI· PRESS

Yes, you can now order DoorDash from the command line

DoorDash is opening a limited beta of dd-cli, a command-line tool that lets developers and AI agents search stores, build carts, and place orders from the terminal, marking another step toward software designed for AI agents instead of just humans.

Sarah Perez·10 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Digital Pantheon: Simulating and Auditing Coalition Formation with LLM Agents

Multi-agent LLM framework using SFT+DPO enables sustained partisan behavior in political coalition simulation, circumventing RLHF neutrality bias.

Dylan Van Mulders·10 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

OmniaBench: Benchmarking General AI Agents Across Diverse Scenarios

OmniaBench: unified benchmark evaluating LLM-based agents across diverse scenarios with explicit state spaces for systematic capability characterization.

Chengyu Shen·10 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

LongStraw: Long-Context RL Beyond 2M Tokens under a Fixed GPU Budget

LongStraw: GPU-efficient execution stack for million-token RL post-training with GRPO, bridging inference context length vs. post-training gap for agents.

Changhai Zhou·10 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

StructureClaw: Traceable LLM Agents and an Executable Benchmark for Structural Engineering Workflows

StructureClaw: artifact-centered benchmark for evaluating LLM agents on complete structural engineering workflows with verifiable evidence chains.

Sizhong Qin·10 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Proof-or-Stop: Don't Trust the Agent, Trust the Evidence -- Loop Engineering for Verifiable Evidence-Gated Lifecycle Control

Proof-or-Stop: lifecycle control framework for autonomous coding agents using mechanically verifiable evidence gates for transitions between states.

Jek Huang·10 days ago

OpenAI· FRONTIER

How Cars24 scales conversations and builds faster with OpenAI

Cars24 deploys OpenAI voice and chat agents handling 1M+ monthly conversation minutes, recovering 12% lost leads via agentic workflows.

OpenAI·11 days ago

VentureBeat AI· PRESS

Agentic orchestration: Enterprise AI organizations have a deployment problem, not a platform problem — and most are calling chatbots agents

Across 101 enterprises, agent orchestration is consolidating onto model-provider platforms — Anthropic’s Claude leads by a wide margin — chosen for the gravity of the underlying model and judged on reliable multi-step execution. But the ambition runs well ahead of the reality: most deployed “agents” are still chatbot wrappers, the control plane enterprises expect is deliberately hybrid to avoid lock-in, and real-time fiscal control over token burn remains the exception. This wave of VentureBeat Pulse Research examines enterprise agent orchestration: which platforms enterprises run on, what dr...

VentureBeat AI·11 days ago

NVIDIA Dev Blog· INFRA

Develop Lightweight USD Runtimes Faster with AI Agents

OpenUSD is an open, extensible framework that provides a common scene description language for physical AI. It enables teams to bring CAD data, simulation... OpenUSD is an open, extensible framework that provides a common scene description language for physical AI. It enables teams to bring CAD data, simulation assets, and real-world telemetry into a shared, physically accurate view of the world. Until now, building a USD implementation has typically required adapting a large existing codebase— even for teams that need a specific memory footprint… Source

Michelle Horton·11 days ago

Hugging Face· INFRA

What building Shippy taught us about building agents

Hugging Face·11 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Do Agent Optimizers Compound? A Continual-Learning Evaluation on Terminal-Bench 2.0

Study shows agent-optimization gains may not compound over time; proposes Terminal-Bench 2.0 to test continual learning on deployed agents.

Wenxiao Wang·11 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

TRACE: Turn-level Reward Assignment via Credit Estimation for Long-Horizon Agents

TRACE assigns per-turn rewards for long-horizon multi-tool agents to improve credit assignment beyond sparse outcome rewards.

Leitian Tao·11 days ago

The Verge AI· PRESS

OpenAI finally launches hardware… for Codex

OpenAI is finally releasing some hardware. No, it isn't the mysterious AI-powered device the company is developing with former Apple designer Jony Ive, a project already tangled up in a messy lawsuit. Instead, it's a product designed to be used with its coding platform, Codex. The device, a square-shaped block of buttons called Codex Micro, is a collaboration between the AI company and keyboard maker Work Louder. OpenAI said it is a limited-run collaboration that will give users more ways to monitor and manage their agents. The pad closely resembles Work Louder's Creator Micro 2, and marketin...

Robert Hart·11 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

DeepStress: Stress-Testing Deep Search Agents

DeepStress: stress-testing framework for search agents' robustness to degraded retrieval quality and evidence reliability.

Ismael Rousseau·11 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Experience Memory Graph: One-Shot Error Correction for Agents

Experience Memory Graph enables one-shot error correction for LLM agents via structured trajectory memory, reducing API costs and improving generalization.

Wenjun Wang·11 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

SPyCE: Skill-Policy Co-evolution for Multimodal Agents

SPyCE co-evolves reusable skills and policies for multimodal agents, distilling visual reasoning trajectories to improve generalization across tasks.

Ru Zhang·11 days ago

TechCrunch AI· PRESS

Vint Cerf is working on a plan to unleash AI agents on the open internet

The guy behind TCP/IP is working on a standard for identifying AI agents in the wild.

Tim Fernholz·11 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

How Agents Ask for Permission: User Permissions for AI Agents, from Interfaces to Enforcement

User permission frameworks for AI agents address prompt injection and unauthorized actions; bridges product-level design with enforcement mechanisms.

Alexandra E. Michael·11 days ago

Simon Willison· ANALYST

Quoting Armin Ronacher

Willison/Ronacher reflect on how AI agents may erode institutional knowledge and shared understanding embedded in code review friction.

Simon Willison·12 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Do AI Agents Know When a Task Is Simple? Toward Complexity-Aware Reasoning and Execution

LLM agents often waste computational budget through redundant context re-reading; proposes task-complexity estimation to optimize execution scope.

Junjie Yin·12 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

PalmClaw: A Native On-Device Agent Framework for Mobile Phones

PalmClaw framework enables native on-device LLM agents on mobile phones via direct API calls rather than GUI automation.

Hongru Cai·12 days ago

NVIDIA Dev Blog· INFRA

How to Run an Autoresearch Workflow with RL Agent Skills and NVIDIA NeMo

Coding AI agents are becoming practical operators for long-running machine learning (ML) workflows. They can inspect repositories, set up runtimes, resolve... Coding AI agents are becoming practical operators for long-running machine learning (ML) workflows. They can inspect repositories, set up runtimes, resolve build issues, launch experiments, monitor execution, analyze metrics, and summarize results. For reinforcement learning (RL) research, this matters because meaningful metrics often appear only after the essential experiment infrastructure… Source

Tanya Lenz·12 days ago

NVIDIA Dev Blog· INFRA

Post-Train NVIDIA Cosmos 3 in One Day Using Agent Skills

What if autonomous coding AI agents could push your vision reasoning models above 90% accuracy with almost no manual effort? When adapting vision reasoning... What if autonomous coding AI agents could push your vision reasoning models above 90% accuracy with almost no manual effort? When adapting vision reasoning models to production video tasks, developers often lose days to data formatting, container setup, training scripts, baseline evaluation, and hyperparameter sweeps before they even know whether post-training improves accuracy. Source

Tanya Lenz·12 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Knowledge- and Gradient-Guided Reinforcement Learning for Parametrized Action Markov Decision Processes

In this paper, we study Reinforcement Learning in Parametrized Action Markov Decision Processes (PAMDP), where each decision consists of a symbolic action and numerical parameters. In such settings Reinforcement Learning algorithms typically determine parameters with one-shot estimators, which makes their training sample inefficient. Though in most PAMDP environments explicit but incomplete knowledge (e.g., rules, safety constraints, or expert heuristics) is available, it is rarely directly used to increase the sample-efficiency of training Reinforcement Learning agents. We step into this gap...

Jonas Ehrhardt·12 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Who Grades the Grader? Co-Evolving Evaluation Metrics and Skills for Self-Improving LLM Agents

Framework evolves evaluation metrics alongside agent skills using evolutionary search over drawback detectors, enabling self-improving systems without pre-defined oracles.

Xing Zhang·12 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Internet of Agentic Things: Networked AI Agents for Closed-Loop IoT Orchestration

Internet of Agentic Things (IoAT) framework integrates autonomous AI agents with IoT, cyber-physical systems, and edge computing for closed-loop orchestration.

Quanyan Zhu·12 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

A Learning-Rate-Gated Failure of GRPO in a Small Language and Vision-Language Model Web Agent: A Controlled Null and Its Mechanism

Controlled study finds GRPO RL fails to improve 4B–8B scale web agents over supervised baselines across learning rate and hyperparameter grids, questioning RL value at small scale.

Chengguang Gan·12 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

MM-ToolSandBox: A Unified Framework for Evaluating Visual Tool-Calling Agents

MM-ToolSandBox: benchmark with 500+ tools across 16 domains for evaluating visually grounded multi-turn tool-calling agents.

Kaixin Ma·13 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

When Local Monitors Miss Compositional Harm: Diagnosing Distributed Backdoors in Multi-Agent Systems

Safety vulnerability: distributed backdoors in multi-agent LLM systems bypass local monitors by splitting harmful payloads across agents.

Yibo Hu·13 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Agent Hacks Agent: Autoresearch for Production-Agent Red-Teaming

Automated red-teaming system discovers reusable vulnerability patterns in production LLM agents (Claude Code, Codex) operating on untrusted content.

Xutao Mao·13 days ago

Ars Technica AI· PRESS

Now, defenders are embracing the prompt injection, too

"Context bombing" tricks hacking agents into shutting down before they can do harm.

Dan Goodin ·13 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Auditing the Risk Claims of Distributional Reinforcement Learning

Empirical audit reveals trained distributional RL agents' risk estimates often violate first-order stochastic dominance.

Hari Prasad·13 days ago

Simon Willison· ANALYST

Directly Responsible Individuals (DRI)

Simon Willison examines DRI (Directly Responsible Individual) concept from Apple/GitLab in context of LLM agents, arguing humans must retain accountability.

Simon Willison·14 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

VEXAIoT: Autonomous IoT Vulnerability EXploitation using AI Agents

VEXAIoT: multi-agent LLM framework for autonomous IoT vulnerability discovery and exploitation testing.

Katherine Swinea·16 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Task-Specific Multimodal Question Answering Agents via Confidence Calibration and Incremental Reasoning for QANTA 2026

QANTA 2026 submission: multimodal QA agent with confidence calibration for incremental quizbowl under efficiency constraints.

Nirjhar Das·16 days ago

NVIDIA Dev Blog· INFRA

Accelerating End-to-End Co-Folding Performance with NVIDIA BioNeMo Agent Toolkit

Biomolecular structure prediction and co-folding with models like OpenFold3 are now mainstream, large-scale workloads powering drug discovery and protein... Biomolecular structure prediction and co-folding with models like OpenFold3 are now mainstream, large-scale workloads powering drug discovery and protein design. Increasingly, they’re driven end-to-end by AI agents. For an agent to run that pipeline well, every step needs to be fast and scalable: Multiple Sequence Alignment (MSA) generation, co-folding inference, serving, and multi-GPU scale-out. Source

Elizabeth Goodman·16 days ago

TechCrunch AI· PRESS

An AI agent startup just let its agent run its $100 million fundraise

Lyzr, a startup that builds AI agents for enterprises, used its own AI agent to raise a $100 million round — proof, evidently, that the product actually works.

Connie Loizos·17 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

UniClawBench: A Universal Benchmark for Proactive Agents on Real-World Tasks

UniClawBench introduces a capability-factored benchmark for evaluating proactive agents on real-world tool-use tasks beyond sandboxed single-turn settings.

Zhekai Chen·17 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Remember When It Matters: Proactive Memory Agent for Long-Horizon Agents

Proactive memory agent running alongside action agent mitigates behavioral state decay in long-horizon tasks via structured memory bank updates.

Yifan Wu·17 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

SolarChain-Eval: A Physics-Constrained Benchmark for Trustworthy Economic Agents in Decentralized Energy Markets

SolarChain-Eval benchmarks autonomous agents in decentralized energy markets with physics constraints and trustworthiness metrics.

Shilin Ou·17 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Formal Mechanisms for Market Stability in Self-Interested Agent Societies: A Marketplace Simulation Study

Marketplace simulation with DeepSeek-V3 agents tests formal mechanisms for maintaining trade stability against adversarial defection.

Eugene Ng Yi Sheng·17 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

SMetric: Rethink LLM Scheduling for Serving Agents with Balanced Session-centric Scheduling

SMetric proposes session-centric LLM scheduling for agentic workloads with 80%+ KV-cache reuse, shifting optimization from latency to tokens-per-second.

Jiahao Wang·17 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Agon: Competitive Cross-Model RL with Implicit Rival Grading of Reasoning

Agon trains reasoning models via competitive RL where two agents grade each other's solutions, incentivizing better thinking vs. longer traces.

Vladislav Beliaev·18 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

SkillCenter: A Large-Scale Source-Grounded Skill Library for Autonomous AI Agents

SkillCenter releases 216,938 structured skills across 24 domains from peer-reviewed sources and GitHub for autonomous agent execution.

Tianming Sha·18 days ago

Hugging Face· INFRA

Data for Agents

Hugging Face·18 days ago

TechCrunch AI· PRESS

Prime Intellect raises $130M Series A to help enterprises build their own AI agents

The round, led by Radical Ventures, values the two-year old startup at $1 billion.

Marina Temkin·18 days ago

NVIDIA Dev Blog· INFRA

Running Low-Latency Analytical Workloads with GPU-Accelerated Presto on NVIDIA GB200 NVL72

Presto is an open source, distributed SQL engine for running fast, interactive queries on very large datasets. On NVIDIA GPUs, Presto delivers peak performance... Presto is an open source, distributed SQL engine for running fast, interactive queries on very large datasets. On NVIDIA GPUs, Presto delivers peak performance for analytical query workloads and provides low latency for users and agents. GPU-accelerated Presto brings low latency to your analytical workloads, keeping you and your agents unblocked and iterating as fast as possible. Source

Tanya Lenz·18 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Think Big, Search Small: Where Capacity Matters in Hierarchical Search Agents?

Empirical study on capacity allocation across hierarchical search agent roles (delegation, execution, generation) in multi-agent LLM systems.

Qinnan Cai·18 days ago

NVIDIA Dev Blog· INFRA

Create a LangChain Deep Agents Harness Profile for NVIDIA Nemotron 3 Ultra to Improve Performance

Agentic systems often face a trade-off between accuracy and cost. The highest-performing proprietary frontier models and harnesses provide top accuracy but are... Agentic systems often face a trade-off between accuracy and cost. The highest-performing proprietary frontier models and harnesses provide top accuracy but are expensive. Fine-tuning offers one way to address this problem. Smaller or more efficient open models starting with lower accuracy are taught to perform better with specific agents. However, fine-tuning requires expertise and hardware for… Source

Michelle Horton·18 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Beyond Attack-Success Rate: Action-Graded Severity Scale for Tool-Using AI Agents

Action-graded severity scale for agent red-teaming replaces binary attack-success metrics with 7-level ordinal harm rubric, enabling nuanced risk assessment of tool-using AI compromise.

Harry Owiredu-Ashley·18 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

The Blind Curator: How a Biased Judge Silently Disables Skill Retirement in Self-Evolving Agents

Self-evolving LLM agents with biased reward signals fail to retire bad skills, disabling safety constraints in skill libraries.

Xing Zhang·18 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

RLVP: Penalize the Path, Reward the Outcome

RLVP: Reward function design for real-world agents requiring path constraints and outcome-neutral safety rules beyond reward maximization.

Bojie Li·18 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Reason Less, Verify More: Deterministic Gates Recover a Silent Policy-Violation Failure Mode in Tool-Using LLM Agents

Tool-using LLM agents silently violate deployed policies via well-formed tool calls that bypass domain constraints; 78% of failures undetected.

Vikas Reddy·18 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Danus: Orchestrating Mathematical Reasoning Agents with Fact-Graph Memory

Danus orchestration system coordinates parallel mathematical reasoning agents using shared fact-graph memory for research-level proof search.

Jihao Liu·19 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

An Experimental Design Approach to Evaluating Agentic AI's Autonomous Model Discovery

Experimental design framework quantifies variability and factors in LLM coding agents' autonomous model discovery via stochastic evaluation.

Hao He·19 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Responsible Personalisation: The Double-Edged Sword of Personalisation in Human-Robot Interaction

Framework for responsible personalisation in human-robot interaction examining ethical risks from embodied agents across lifecycle and interaction contexts.

Antonio Andriella·19 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Harnessing Code Agents for Automatic Software Verification

Code agent framework for automated software verification outperforms fixed proof strategies, proving larger fraction of Coq theorems than prior LLM approaches.

Shuangxiang Kan·19 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Information Gain-based Rollout Policy Optimization: An Adaptive Tree-Structured Rollout Approach for Multi-Turn LLM Agents

Information Gain-based Rollout Policy Optimization allocates LLM agent search budget adaptively across tree branches.

Yijun Zhang·19 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

LLM Agents for Deliberative Collaboration: A Study on Joint Decision Making Under Partial Observability

Benchmark study of deliberative LLM agents under partial observability, formalizing cooperative decision-making with asymmetric information and multi-domain evaluation.

Chenxu Wang·19 days ago

Berkeley BAIR· ACADEMIA

Intelligence is Free, Now What? <br> Data Systems for, of, and by Agents

Berkeley BAIR analyzes commodity AI inference costs dropping 50-900x annually, arguing sufficient intelligence now enables agent-centric data systems.

Berkeley BAIR·19 days ago

Google AI (Gemma)· FRONTIER

Expanding Managed Agents in Gemini API: background tasks, remote MCP and more

Google expands Gemini API Managed Agents with background task execution and remote MCP support for production deployments.

{"$":{"xmlns:author":"http://www.w3.org/2005/Atom"},"name":["Philipp Schmid"],"title":["Developer Relations Engineer"],"department":["Google DeepMind"],"company":[""]}·19 days ago

TechCrunch AI· PRESS

Vercel CEO Guillermo Rauch on the fight to split off models from agents

"The reality is, when you're optimizing for production, you start looking at a price/performance," Guillermo Rauch tells TechCrunch.

Russell Brandom·20 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

CompactionRL: Reinforcement Learning with Context Compaction for Long-Horizon Agents

CompactionRL: RL training for long-horizon agents with context compaction, jointly optimizing task and summary generation.

Yujiang Li·20 days ago

← Front Page100 stories