Building AI agents that reshape financial services
Cohere showcases AI agents for financial services compliance, efficiency, and customer trust.
Search the full wire by company, model, lab, or keyword. Every story we have ever aggregated.
Cohere showcases AI agents for financial services compliance, efficiency, and customer trust.
Notion’s new developer platform lets teams connect AI agents, external data sources, and custom code directly into their workspace as the company pushes deeper into agentic productivity software.
EVA-Bench provides end-to-end evaluation framework for voice agents, generating realistic bot-to-bot conversations and measuring voice-specific failure modes.
ScioMind integrates anchoring-based belief dynamics with LLM agents for cognitively-grounded social opinion simulation.
In today’s data-driven world, organizations increasingly rely on video to capture critical information, yet extracting meaningful, real-time insights from... In today’s data-driven world, organizations increasingly rely on video to capture critical information, yet extracting meaningful, real-time insights from massive amounts of footage remains a challenge. NVIDIA Metropolis Blueprint for video search and summarization (VSS) overcomes this hurdle by transforming millions of live video streams or hours of recorded video into instantly searchable… Source
Developer argues current AI agents require extensive human oversight and lack true autonomy despite productivity gains with Claude models.
LongMemEval-V2 benchmark evaluates whether agent memory systems enable agents to internalize environment-specific workflows and interface affordances in web tasks.
ToolCUA framework trains computer-use agents to optimally interleave GUI actions and tool API calls via trajectory-level supervision and synthetic data generation.
Text-tabular modeling predicts unfamiliar AI agent decisions in negotiation from limited interaction, tested on bargaining games.
Vapi says its enterprise business has grown 10-fold since early 2025 as companies shift customer support and sales calls to AI agents.
RobustToolBench benchmark exposes tool-use agent failures from deployment noise; domain-randomized RL improves robustness.
IPI-proxy toolkit enables red-teaming web-browsing AI agents against indirect prompt injection attacks embedded in whitelisted domain HTML.
GEAR enables fine-grained credit assignment in RL-trained LLM agents via adaptive-granularity advantage reweighting at token and segment levels.
The Karpathy coding skill is locked behind Pro. It doesn't use any Pro-only features, so I rewrote it for free plan chat workflows. Same philosophy, tuned for no terminal, no subagents, and a shorter context window where mistakes are expensive. Paste the whole thing into a Project's custom instructions or use it as a system prompt. It auto-triggers on any coding request. --- name: karpathy-coding description: Apply Karpathy-inspired coding discipline to any programming task. Use this skill whenever the user asks you to write, fix, refactor, extend, or review code — even casually...
Morning Everyone! Big one today (**104 changes!**): Claude Code just went async. The new `/goal` command lets you set a completion condition ("all tests pass and the PR is ready"), then Claude keeps grinding across turns until it's hit. The new `claude agents` view shows every session you've got running: working, blocked on you, or done. Translation: kick off a goal -> let claude cook -> come back later. First proper fire-and-forget loop CC has shipped. Pretty huge unlock if you've been juggling multiple sessions and losing track of which one needs you. Full notes: [https://www.luk...
OpenAI's Parameter Golf competition engaged 1,000+ researchers on AI-assisted ML workflows, coding agents, and model optimization under resource constraints.
James Shore argues AI coding agents must reduce maintenance costs inversely to productivity gains or risk long-term debt; doubling output without halving maintenance costs creates net negative ROI.
SLIM: dynamic skill lifecycle management for LLM agents enabling non-monotonic skill activation based on task and stage.
Shepherd: runtime substrate for meta-agents with formalized execution traces in Lean, enabling 5× faster forking and state replay.
WildClawBench: native-runtime benchmark of 60 real-world, long-horizon CLI agent tasks (8+ min each) for LLM/vision-language agents.
AI agents need formal SE practices (testing, staging, adversarial eval) beyond on-the-fly synthesis for high-stakes deployment.
AssayBench: benchmark for LLMs and agents on virtual cell phenotypic screening combining textual inputs with diverse cellular outputs.
Pentesting agents evaluated on real-world targets show current benchmarks miss complexity and strategic decision-making required in practice.
Framework enables visual-native multimodal search agents with on-policy data evolution and persistent visual evidence reuse.
MaD Physics benchmarks agents on resource-constrained scientific discovery with real measurement trade-offs and planning.
Anthropic's Claude Platform reaches general availability on AWS with managed agents, code execution, web search, batch processing, and same-day feature parity with native API.
I checked out GitHub after Anthropic's announcement last week and came across their new financial services reference repo (github.com/anthropics/financial-services). It packages 10 pre-built workflow agents for financial services firms you can run them through the Claude Cowork plugin or via the Managed Agents API, which is a useful bit of flexibility depending on how your stack is set up. The 10 agents cover a decent spread of the typical pain points: * **Pitch Agent** \- builds fully branded pitch decks from comps, precedent transactions, and LBO analysis * **Meeting Prep Agent** \- draft...
This was a really good talk, especially for anyone who's built things like the Karpathy wiki, Serena, or SQLite databases as memory for Claude. For any senior devs out there, are you spotting the solutions already implemented in distributed systems being reused? If many agents are working in parallel, how do you get them from stepping on each others toes? I can imagine logical clocks, consensus, deduplication, idempotency, and eventual vs causal consistency being applied. If you're on the Anthropic team, I'm curious how much different distributed systems algos were experimented with.
Reddit discussion questioning OpenAI's hiring of OpenClaw creator and value of agents initiative; post contains factual error later corrected.
Claude Max still produces hallucinations causing production failures; autonomous agents unsuitable for unsupervised deployment without guardrails.