The Archive

Search the full wire by company, model, lab, or keyword. Every story we have ever aggregated.

Claude OpenAI Anthropic Gemini Mistral Cursor

HERO'S JOURNEY: Testing Complex Rule Induction with Text Games

We introduce HERO'S JOURNEY, a benchmark for rule induction in goal-directed episodic tasks, where agents must infer hidden rules from demonstrations and act on them through multi-step execution. HERO'S JOURNEY covers eight tasks across attribute and procedural induction families, each with four structural rule forms, controllable lexical grounding, and identifiability conditions. Evaluating state-of-the-art LLMs, we find that models show evidence of rule induction, but the ability is limited and uneven across tasks. Meanwhile, process execution adds an execution bottleneck for models, wherea...

Anshun Asher Zheng·16 days ago

The Archive

HERO'S JOURNEY: Testing Complex Rule Induction with Text Games

Modeling Depth Ambiguity: A Mixture-Density Representation for Flying-Point-Free Depth Estimation

SN-WER: Script-Normalized WER for Multi-Script Indic ASR Evaluation

Transferable Self-Harm Surveillance from Emergency Department Triage Notes Using an Evidence-Augmented Machine Learning Approach

SimSD: Simple Speculative Decoding in Diffusion Language Models

SkillHarm: Lifecycle-Aware Skill-Based Attacks via Automated Construction

From 15 hours to one minute: How AI/ML is speeding up GM's development

Tracking the Behavioral Trajectories of Adapting Agents

SafeSteer: Localized On-Policy Distillation for Efficient Safety Alignment

Auditing Asset-Specific Preferences in Financial Large Language Models: Evidence from Bitcoin Representations and Portfolio Allocation

Why Not Hyperparameter-Friendly Optimisation? A Monotonic Adaptive Norm Rescaling Approach For Long-Tailed Recognition

FigSIM: A Dataset for Fine-grained Suicide Severity and Figurative Language in Suicide Memes

Moment-Video: Diagnosing Temporal Fidelity of Video MLLMs on Momentary Visual Events

Drifting Preference Optimization for One-Step Generative Models

A Biconvex Formulation for Stable Transport of Mixture Models with a Unique Solution

When Rating Scales Fall Short: LLM-Assisted Discovery of ADHD Signals in Turkish Teacher Narratives

Towards Automated Discovery: A Review of Generative Models, Multimodal Learning and Closed-Loop Workflows in Inverse Materials Design

Allegedly trashing Airbnbs to test robots puts startup in legal trouble

CRAM: Centroid-Routing and Adaptive MoE for Multimodal Continual Instruction Tuning

Bridging the Last Mile of Time Series Forecasting with LLM Agents

Monitoring Agentic Systems Before They're Reliable

Not What, But How: A Communicative Audit of LLM Response Framing

Expressivity of congruence-based architectures for DNNs on positive-definite matrices

Our views on AI policy and political advocacy

RASER: Recoverability-Aware Selective Escalation Router for Multi-Hop Question Answering

Towards Multidisciplinary Summarization of Hospital Stays: Efficient Sentence-Level Clinical Provenance Categorization

Iteris: Agentic Research Loops for Computational Mathematics

Ghost Tool Calls: Issue-Time Privacy for Speculative Agent Tools

Physics-Informed Residuals for Adaptive Mesh Refinement in Finite-Difference PDE Solvers

MCP-Persona: Benchmarking LLM Agents on Real-World Personal Applications via Environment Simulation