The famous METR AI time horizons graph contains numerous severe errors [D]
Nathan Witkin critiques METR's Long Tasks benchmark methodology, identifying severe flaws in the widely-cited AI time horizons graph.
Every story tagged with this topic, ordered by date.
Nathan Witkin critiques METR's Long Tasks benchmark methodology, identifying severe flaws in the widely-cited AI time horizons graph.
DiscoverPhysics benchmark tests LLM reasoning by having agents discover physics laws in simulated worlds with non-standard dynamics.
Claw-Anything benchmark evaluates LLM agents as always-on assistants with access to long-horizon histories and interdependent backend services.
Auto Benchmark Audit uses agentic framework to systematically audit AI benchmarks for hidden dependencies, specification gaps, and grading flaws.
StakeBench evaluates financial language understanding using real market commitments from 560K+ Polymarket and Manifold comments instead of human labels.
WSADBench unifies weakly supervised anomaly detection evaluation across incomplete, inexact, inaccurate supervision for 36 algorithms and 4 modalities.
CityRep benchmark evaluates urban representation learning across cities and modalities using spatially-structured splits to prevent data leakage.
CausaLab environment benchmarks LLM agents on interactive causal discovery with validation of both solutions and underlying causal mechanisms.
NeoBERT evaluated on dementia detection from Filipino-English code-switched speech, first systematic study in this low-resource clinical NLP setting.
Framework for systematizing GenAI evaluation concepts (reasoning, fairness, creativity) into measurable definitions using AI assistance.
Deployment-complete benchmarking: framework ensuring benchmark evidence resolves deployment decisions via conformal coverage.
68-cell empirical study: LLM agents show +19.69pp higher sensitivity to semantic noise vs. surface noise across reasoning tasks.
QUIET benchmark for evaluating LLM creative generation (not discriminative ability) via multi-blank cascaded story cloze with objective scoring.
Step-TP: step-level dataset with CoT reasoning for LLM-guided tensor program optimization, enabling composable transformation decisions.
MiniCPM5-1B released on HuggingFace: 1B-parameter model from CPM team, likely competitive efficiency benchmark for edge deployment.
Hugging Face revives PapersWithCode.co with new SOTA tracking features for agents, vision, and time-series; week 1 update.
Community-built open-source TTS benchmark suite with Windows/Mac results; Linux results pending, covers known local TTS tools as of May 2026.
Benchmark of vision LLMs vs. OCR pipelines on 30 long, image-heavy PDFs from MMLongBench-Doc shows LlamaCloud + Azure premium achieving 59.6%–58.5% accuracy; agentic RAG and native PDF vision approaches compared on cost and accuracy.
Reddit user reports informal reasoning test comparison between Claude Sonnet 4.6 and Opus 4.7; incomplete results, anecdotal evidence.
Shannon Scaling Law models LLM training as noisy-channel information transmission, explaining non-monotonic phenomena like catastrophic overtraining.
SpaceNum benchmark tests whether VLMs genuinely ground numerical outputs in spatial perception via dynamic and static reasoning tasks.
PGT generates procedural geometric tasks to improve MLLM fine-grained visual grounding and diagnose perception failure sources.
Claude Code achieves 98.8% specification validity and 87.5% implementation certification on CLEVER program verification benchmark via agentic proving.
Historical review of NLG evaluation evolution from 1990–2026, highlighting LLM-as-Judge methods and emerging safety evaluation needs.
Metadata Prior Dominance Score and evidence-intervention auditing protocol detect weak-label benchmark shortcut learning.
ChartFI benchmark evaluates faithfulness and insightfulness of MLLM-generated chart descriptions beyond fact enumeration.
OnePred benchmark and method for next-query prediction in multi-turn LLM conversations using recursive intent memory to avoid linear token growth.
OpenSkillEval automated framework audits open-source skill-augmented LLM agent systems for quality, compatibility, and cost-performance trade-offs.
PCSP shared RL policy for persona-consistent NPC control in life sims achieves 22x faster inference than LLM baselines on 300-persona benchmark.
Corpus-linguistic evaluation framework measures linguistic humanlikeness of LLM outputs via register-aware patterns; addresses underexplored text quality dimension.
Social choice analysis reveals multi-task benchmark vulnerability to gaming; benchmark-specific training treated as election manipulation.
Google Embeddings 2 outperforms five open-source dense retrieval models on BEIR and RAG benchmarks but faces latency tradeoff.
MetaEvaluator: meta-learning framework for label-free, cost-effective evaluation of unseen models across architectures.
PushBench evaluates quantitative goal persistence in long-horizon LLM agents via work-unit completion.
SupraLabs released Supra-50M, a 50M-parameter Llama-style language model trained on 20B educational tokens with competitive benchmark performance.
MARS improves model evaluation by weighting ranks with performance margins instead of discarding magnitude differences in Critical Difference diagrams.
GPT-5.2 performs comparably to top human peer reviewers in 82-paper Nature study, though with identified limitations.
ChronoMedKG adds temporal reasoning to biomedical knowledge graphs for age-dependent clinical diagnosis; 460K evidence-linked triples.
CUSP benchmark evaluates AI's ability to forecast scientific progress across 4,760 events via feasibility, mechanistic reasoning, and temporal prediction.
Inverse scaling detected: more capable LLMs forecast worse on superlinear/regime-change time series; ForecastBench-Sim benchmark released.
WorkstreamBench evaluates LLM agents on end-to-end spreadsheet construction in finance workflows, filling gap in agent evaluation.
Interactive guide matching 50+ LLMs to 60+ hardware builds with throughput metrics and YouTube reviewer citations.
Gemini 3.5 Flash achieves top score on APEX-Agents-AA benchmark, exceeding larger model performance on agent tasks.
Comparative evaluation of coding agents (GitHub Copilot, Pi, Claude Code, OpenCode) using Qwen 3.6 27B isolates model vs. harness performance.
Google Gemini 3.5 Flash tops Zapier Automation Bench, outperforming competitors at lower cost.
OpenAI model disproves Erdős geometry conjecture; researcher claims breakthrough significance will compound through 2025.
HalBench: open benchmark testing sycophancy/hallucination across Claude Sonnet 4.6, Grok 4.3, GPT-5.4, Gemini 3.1 Pro on 3,200 false-premise prompts.
OpenAI's reasoning model discovered a counterexample to Erdős's unit-distance conjecture in discrete geometry, disproving a decades-old bound.
OpenAI's general-purpose model autonomously solved the planar unit distance problem, a famous 80-year-old Erdős conjecture, discovering new constructions outperforming grid-based solutions.
DeepWeb-Bench introduces harder evaluation for frontier LLMs on deep research requiring massive cross-source evidence and reasoning.