The Archive

Search the full wire by company, model, lab, or keyword. Every story we have ever aggregated.

Claude OpenAI Anthropic Gemini Mistral Cursor

AutoLab: Can Frontier Models Solve Long-Horizon Auto Research and Engineering Tasks?

Scientific and engineering progress is fundamentally a long-horizon iterative process: proposing changes, running experiments, measuring outcomes, and continuously refining artifacts. Yet existing benchmarks for frontier models primarily evaluate either single-turn responses or short-horizon agent trajectories, failing to capture the challenges of sustained iterative improvement over extended time horizons. To address this gap, we introduce AutoLab, a new benchmark for ultra long-horizon closed-loop optimization. AutoLab consists of 36 realistic, expert-curated tasks spanning four diverse dom...

Zhangchen Xu·2 months ago

The Archive

AutoLab: Can Frontier Models Solve Long-Horizon Auto Research and Engineering Tasks?

Fast & Faithful Function Vectors

Learning What Not to Impute: An Uncertainty-Aware Diffusion Framework for Meaningful Missingness

RIDE: An Open Dataset and Benchmark for Train Delay Prediction

FLAGG: Flexible Autoregressive Graph Generation

UniCAD: A Unified Benchmark and Universal Model for Multi-Modal Multi-Task CAD

Boosting Self-Consistency with Ranking

Amazon&#8217;s search bar will invent AI-generated products you can&#8217;t buy

Graph Cascades: Contagion-Based Mesoscopic Rewiring for Structure-Aware Graph Machine Learning

Learning Control-Affine Reduced-Order Models via Autoencoders

Strabo: Declarative Specification and Implementation of Agentic Interaction Protocols

In-Context Graphical Inference

Self-Reflective APIs: Structure Beats Verbosity for AI Agent Recovery

Imbuing Large Language Models with Bidirectional Logic for Robust Chain Repair

Validity Threats for Foundation Model Research

Amazon will show AI product images when you search for some reason

Invariant Gradient Alignment for Robust Reasoning Distillation

Enhancing the MADDPG Algorithm for Multi-Agent Learning via Action Inference and Importance Sampling

TaDA: Calibrated Probe Gating for Task-Domain LoRA Merging

Depth-Attention: Cross-Layer Value Mixing for Language Models

DAR: Deontic Reasoning with Agentic Harnesses

M$^3$Eval: Multi-Modal Memory Evaluation through Cognitively-Grounded Video Tasks

SharedRequest: Privacy-Preserving Model-Agnostic Inference for Large Language Models

GARL: Game-Theoretic Reinforcement Learning for Multi-Agent Strategic Prioritisation

New Benchmarking Shows Limited Generalization Power of TCR Antigenic Epitope Prediction Models

From Agent Traces to Trust: Evidence Tracing and Execution Provenance in LLM Agents

DeliChess: A Multi-party Dialogue Dataset for Deliberation in Chess Puzzle Solving

AlphaQ: Calibration-Free Bit Allocation for Mixture-of-Experts Quantization

Probing Outcome-Level Resemblance and Mechanism-Level Alignment in LLM Risk Decisions: Evidence from the St. Petersburg Game

These two founders left Goldman and Meta to build voice AI for markets everyone else overlooked

Amazon’s search bar will invent AI-generated products you can’t buy