The Archive

Search the full wire by company, model, lab, or keyword. Every story we have ever aggregated.

Claude OpenAI Anthropic Gemini Mistral Cursor

Cohere· FRONTIER

Building AI agents that reshape financial services

Cohere showcases AI agents for financial services compliance, efficiency, and customer trust.

Cohere·28 days ago

Notion just turned its workspace into a hub for AI agents

Notion’s new developer platform lets teams connect AI agents, external data sources, and custom code directly into their workspace as the company pushes deeper into agentic productivity software.

Sarah Perez·28 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

EVA-Bench: A New End-to-end Framework for Evaluating Voice Agents

EVA-Bench provides end-to-end evaluation framework for voice agents, generating realistic bot-to-bot conversations and measuring voice-specific failure modes.

Tara Bogavelli·28 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

ScioMind: Cognitively Grounded Multi-Agent Social Simulation with Anchoring-Based Belief Dynamics and Dynamic Profiles

ScioMind integrates anchoring-based belief dynamics with LLM agents for cognitively-grounded social opinion simulation.

Yitian Yang·29 days ago

NVIDIA Dev Blog· INFRA

Transform Video Into Instantly Searchable, Actionable Intelligence with AI Agents and Skills

In today’s data-driven world, organizations increasingly rely on video to capture critical information, yet extracting meaningful, real-time insights from... In today’s data-driven world, organizations increasingly rely on video to capture critical information, yet extracting meaningful, real-time insights from massive amounts of footage remains a challenge. NVIDIA Metropolis Blueprint for video search and summarization (VSS) overcomes this hurdle by transforming millions of live video streams or hours of recorded video into instantly searchable… Source

Samuel Ochoa·29 days ago

r/ClaudeAI· COMMUNITY

Struggling to see how truly autonomous agents are the future????

Developer argues current AI agents require extensive human oversight and lack true autonomy despite productivity gains with Claude models.

u/Silverwolf90·29 days ago·20 pts / 45 comm

arXiv (cs.AI/CL/LG)· ACADEMIA

LongMemEval-V2: Evaluating Long-Term Agent Memory Toward Experienced Colleagues

LongMemEval-V2 benchmark evaluates whether agent memory systems enable agents to internalize environment-specific workflows and interface affordances in web tasks.

Di Wu·29 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

ToolCUA: Towards Optimal GUI-Tool Path Orchestration for Computer Use Agents

ToolCUA framework trains computer-use agents to optimally interleave GUI actions and tool API calls via trajectory-level supervision and synthetic data generation.

Xuhao Hu·29 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Predicting Decisions of AI Agents from Limited Interaction through Text-Tabular Modeling

Text-tabular modeling predicts unfamiliar AI agent decisions in negotiation from limited interaction, tested on bargaining games.

Eilam Shapira·29 days ago

TechCrunch AI· PRESS

AI voice startup Vapi hits $500M valuation after winning Amazon Ring over 40 rivals

Vapi says its enterprise business has grown 10-fold since early 2025 as companies shift customer support and sales calls to AI agents.

Jagmeet Singh·30 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

When Simulation Lies: A Sim-to-Real Benchmark and Domain-Randomized RL Recipe for Tool-Use Agents

RobustToolBench benchmark exposes tool-use agent failures from deployment noise; domain-randomized RL improves robustness.

Xiaolin Zhou·30 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

IPI-proxy: An Intercepting Proxy for Red-Teaming Web-Browsing AI Agents Against Indirect Prompt Injection

IPI-proxy toolkit enables red-teaming web-browsing AI agents against indirect prompt injection attacks embedded in whitelisted domain HTML.

Chia-Pei·30 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

GEAR: Granularity-Adaptive Advantage Reweighting for LLM Agents via Self-Distillation

GEAR enables fine-grained credit assignment in RL-trained LLM agents via adaptive-granularity advantage reweighting at token and segment levels.

Sijia Li·30 days ago

r/ClaudeAI· COMMUNITY

Converted Karpathy's coding skill from Pro to free plan. Here's the full thing:

The Karpathy coding skill is locked behind Pro. It doesn't use any Pro-only features, so I rewrote it for free plan chat workflows. Same philosophy, tuned for no terminal, no subagents, and a shorter context window where mistakes are expensive. Paste the whole thing into a Project's custom instructions or use it as a system prompt. It auto-triggers on any coding request. --- name: karpathy-coding description: Apply Karpathy-inspired coding discipline to any programming task. Use this skill whenever the user asks you to write, fix, refactor, extend, or review code — even casually...

u/flarenz·30 days ago·22 pts / 6 comm

r/ClaudeAI· COMMUNITY

Claude Code just shipped a "run until done" mode. Upgrade to v2.1.139 for /goal.

Morning Everyone! Big one today (**104 changes!**): Claude Code just went async. The new `/goal` command lets you set a completion condition ("all tests pass and the PR is ready"), then Claude keeps grinding across turns until it's hit. The new `claude agents` view shows every session you've got running: working, blocked on you, or done. Translation: kick off a goal -> let claude cook -> come back later. First proper fire-and-forget loop CC has shipped. Pretty huge unlock if you've been juggling multiple sessions and losing track of which one needs you. Full notes: [https://www.luk...

u/oh-keh·30 days ago·29 pts / 8 comm

OpenAI· FRONTIER

What Parameter Golf taught us about AI-assisted research

OpenAI's Parameter Golf competition engaged 1,000+ researchers on AI-assisted ML workflows, coding agents, and model optimization under resource constraints.

OpenAI·1 month ago

Simon Willison· ANALYST

Quoting James Shore

James Shore argues AI coding agents must reduce maintenance costs inversely to productivity gains or risk long-term debt; doubling output without halving maintenance costs creates net negative ROI.

Simon Willison·1 month ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Dynamic Skill Lifecycle Management for Agentic Reinforcement Learning

SLIM: dynamic skill lifecycle management for LLM agents enabling non-monotonic skill activation based on task and stage.

Junhao Shen·1 month ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Shepherd: A Runtime Substrate Empowering Meta-Agents with a Formalized Execution Trace

Shepherd: runtime substrate for meta-agents with formalized execution traces in Lean, enabling 5× faster forking and state replay.

Simon Yu·1 month ago

arXiv (cs.AI/CL/LG)· ACADEMIA

WildClawBench: A Benchmark for Real-World, Long-Horizon Agent Evaluation

WildClawBench: native-runtime benchmark of 60 real-world, long-horizon CLI agent tasks (8+ min each) for LLM/vision-language agents.

Shuangrui Ding·1 month ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Engineering Robustness into Personal Agents with the AI Workflow Store

AI agents need formal SE practices (testing, staging, adversarial eval) beyond on-the-fly synthesis for high-stakes deployment.

Roxana Geambasu·1 month ago

arXiv (cs.AI/CL/LG)· ACADEMIA

AssayBench: An Assay-Level Virtual Cell Benchmark for LLMs and Agents

AssayBench: benchmark for LLMs and agents on virtual cell phenotypic screening combining textual inputs with diverse cellular outputs.

Edward De Brouwer·1 month ago

arXiv (cs.AI/CL/LG)· ACADEMIA

From Controlled to the Wild: Evaluation of Pentesting Agents for the Real-World

Pentesting agents evaluated on real-world targets show current benchmarks miss complexity and strategic decision-making required in practice.

Pedro Conde·1 month ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Towards On-Policy Data Evolution for Visual-Native Multimodal Deep Search Agents

Framework enables visual-native multimodal search agents with on-policy data evolution and persistent visual evidence reuse.

Shijue Huang·1 month ago

arXiv (cs.AI/CL/LG)· ACADEMIA

MaD Physics: Evaluating information seeking under constraints in physical environments

MaD Physics benchmarks agents on resource-constrained scientific discovery with real measurement trade-offs and planning.

Moksh Jain·1 month ago

r/ClaudeAI· COMMUNITY

The Claude Platform on AWS is now generally available.

Anthropic's Claude Platform reaches general availability on AWS with managed agents, code execution, web search, batch processing, and same-day feature parity with native API.

u/ClaudeOfficial·1 month ago·66 pts / 5 comm

r/Anthropic· COMMUNITY

Anthropic just dropped a financial services agent repo and it's worth a look

I checked out GitHub after Anthropic's announcement last week and came across their new financial services reference repo (github.com/anthropics/financial-services). It packages 10 pre-built workflow agents for financial services firms you can run them through the Claude Cowork plugin or via the Managed Agents API, which is a useful bit of flexibility depending on how your stack is set up. The 10 agents cover a decent spread of the typical pain points: * **Pitch Agent** \- builds fully branded pitch decks from comps, precedent transactions, and LBO analysis * **Meeting Prep Agent** \- draft...

u/Efficient_Degree9569·1 month ago·10 pts / 3 comm

r/ClaudeAI· COMMUNITY

First MCPs, then Skills, now Memories are next

This was a really good talk, especially for anyone who's built things like the Karpathy wiki, Serena, or SQLite databases as memory for Claude. For any senior devs out there, are you spotting the solutions already implemented in distributed systems being reused? If many agents are working in parallel, how do you get them from stepping on each others toes? I can imagine logical clocks, consensus, deduplication, idempotency, and eventual vs causal consistency being applied. If you're on the Anthropic team, I'm curious how much different distributed systems algos were experimented with.

u/fsharpman·1 month ago·28 pts / 5 comm

r/OpenAI· COMMUNITY

Openclaw ia trending down and will disappear soon

Reddit discussion questioning OpenAI's hiring of OpenClaw creator and value of agents initiative; post contains factual error later corrected.

u/CartographerFeisty66·1 month ago·55 pts / 26 comm

r/ClaudeAI· COMMUNITY

Claude just hallucinated again and changed the whole workflow of my app. Do not run them autonomously 24/7.

Claude Max still produces hallucinations causing production failures; autonomous agents unsuitable for unsupervised deployment without guardrails.

u/heysankalp·1 month ago·26 pts / 27 comm

← Front Page30 matches

← Newer Older →