The Archive

Search the full wire by company, model, lab, or keyword. Every story we have ever aggregated.

Claude OpenAI Anthropic Gemini Mistral Cursor

GitHub's plan for Agents — Kyle Daigle, GitHub

GitHub pioneered the modern AI coding era with Copilot, and the resulting explosion in agentic coding has led to notable strains on the most popular developer platform in the world. Here's the plan.

Latent Space·15 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Beyond Encoder Accumulation: Measuring Encoder Roles in Multi-Encoder VLMs

As foundation models scale toward fusing more heterogeneous visual streams, understanding how diverse encoders interact under joint training becomes a prerequisite for principled design. Yet large vision-language models (LVLMs) currently lack the tools to do so, and parameter-efficient encoder configurations remain hard to identify before training. To re-examine encoder roles under joint training, on the 16-benchmark Cambrian-1 suite we retrain and evaluate all 31 non-empty subsets of five common vision encoders under a unified pipeline (~20k GPU-hours total), and report three findings. First...

Wei Ding·15 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Privacy-Robust Incrementality Measurement for Advertising Systems under Signal Loss

Advertising platforms use randomized lift tests to measure incrementality, but privacy-preserving reporting systems degrade the observed signal through match-rate loss, linkability loss, attribution-window loss, aggregation-threshold suppression, randomized reporting noise, and segment-heterogeneous signal loss. This paper formulates privacy-constrained advertising measurement as a robust causal decision problem under the mentioned signal losses. Given a randomized experiment and an ambiguity set for privacy-induced degradation, the framework projects the observation-compatible fiber of clean...

Prashant Shekhar·15 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

From 'What' to 'How' and 'Why': Sharing LLM-Generated Retrospective Summaries of Older Adults' Passive Tracking Data with Remote Family Members

With the growing prevalence of modern ubiquitous computing technologies, multi-modal tracking systems hold promise for providing timely awareness and reassurance to stakeholders such as remote family members (RFMs) of older adults, who play a central role in care coordination. However, combining heterogeneous data streams into high-level, meaningful content - such as retrospective summaries - remains challenging. While recent work has demonstrated the promise of large language models (LLMs) for interpreting multi-modal tracking data, less attention has been given to generating narrative accou...

Jiachen Li·15 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Visual Instruction Tuning Aligns Modalities through Abstraction

Visual instruction tuning effectively adapts a pre-trained Large Language Model (LLM) to process image information alongside text. Yet, it remains unclear how visual features are embedded into the layer-wise hierarchy of abstractions of the LLM backbone. Across a diverse set of vision-language architectures, we show that instruction tuning primarily serves as a bridge, embedding visual features directly into the intermediate semantic layers of the LLM, bypassing the early layers devoted to unimodal processing. With probing analyses and causal interventions, we show that these intermediate lay...

Luis Palacios·15 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

A Training-Free Mixture-of-Agents Framework for Multi-Document Summarization using LLMs and Knowledge Graphs

Multi-Document Summarization (MDS) plays a critical role in distilling essential information from collections of textual data. Existing approaches often struggle to capture complex inter-document relationships, rely heavily on large amounts of labeled data for supervised training, or exhibit limited generalization across domains and languages. To address these limitations, we present a training-free mixture-of-agents framework for MDS that leverages the complementary strengths of large language models (LLMs) and knowledge graphs. Our approach decomposes summarization into specialized agent ta...

Cuong Vuong Tuan·15 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Taiji: Pareto Optimal Policy Optimization with Semantics-IDs Trade-off for Industrial LLM-Enhanced Recommendation

Scaling recommender systems via large language models (LLMs) has become a prominent trend in the industry. However, aligning the LLM's semantic space with the recommender's ID space via post-training (e.g., SFT and RL) remains challenging. Existing LLM4Rec paradigms are bottlenecked by two main issues: (1) the difficulty of measuring and improving chain-of-thought (CoT) quality in open-domain recommendation during SFT, and (2) the neglect of the trade-off between LLM semantic rewards and recommendation preference rewards during RL alignment. Inspired by these challenges, we present Taiji, a n...

Yuecheng Li·15 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Explainable Forecasting of Scientific Breakthroughs from Concept Network Dynamics

We introduce an explainable machine-learning approach that forecasts the structural precursors of scientific breakthroughs -- the emergence and intensification of links between research concepts -- by modelling how OpenAlex concept networks evolve over time. Using 59 semantic and topological features, a two-stage LightGBM model jointly predicts the formation and the future weight of concept pairs, adding a regression stage that quantifies expected intensity to prior link-existence forecasts. Relative to the state of the art, the approach improves accuracy and explainability at once: comparati...

Thomas Maillart·15 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

PyraMathBench: Evaluating and Improving Mathematical Capability in Large Language Models

Despite the pivotal role of numerical reasoning as the cornerstone of mathematical capabilities in large language models (LLMs) across applications, few benchmarks evaluate LLMs by integrating numerical processing and mathematical reasoning, hindering the interpretability of failures in math tasks. We introduce PyraMathBench, a comprehensive hierarchical benchmark with 32,505 questions derived from 7,404 math word problems, spanning 4 key cognitive aspects, 14 subcategories, and 2 modalities. Experiments reveal that LLMs' performance is severely compromised by inadequate numerical computation...

Zetian Ouyang·15 days ago

The Verge AI· PRESS

Microsoft created the mini Surface dev box that Qualcomm couldn’t

Microsoft only just announced a new Surface Laptop Ultra at the weekend, and it's now revealing a miniature Surface PC aimed at developers. The new Surface RTX Spark Dev Box is powered by Nvidia's new Arm-based RTX Spark chips, just like the Surface Laptop Ultra, and is optimized for sustained workloads and local AI tasks. The Surface RTX Spark Dev Box looks a little like the top of an Xbox Series X console, with an aluminum chassis that also doubles as a heatsink. It has a 100 watt thermal envelope, slightly more than the 45 watt to 80 watt thermal envelopes for Nvidia's RTX Spark laptops. T...

Tom Warren·15 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

FLARE: Fine-Grained Diagnostic Feedback for LLM Code Refinement

Large language models often generate code with bugs. Existing methods rely on feedback signals such as test failures and self-critiques to iteratively refine the generated code. Such signals are either too coarse-grained or too high-level, which is not sufficient to inform the model where to fix the bug. In this work, we present Flare, an iterative framework with a lightweight diagnostic model that predicts line-level suspiciousness signals for bug localization and code refinement. Given the inherent uncertainty of diagnostic predictions, Flare searches over the top-k suspicious regions and s...

Yinsheng Yao·15 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Two-Action Apple Tasting with Switching Costs

We study the two-action apple-tasting problem with switching costs against an oblivious adversary. In an equivalent normalized formulation, at each round the learner chooses between a revealing action and a blind action: the revealing action gives reward $0$ and reveals the hidden value $x_t\in[-1,1]$ of the blind action; the blind action gives reward $x_t$ but reveals nothing. The learner pays one unit whenever they switches actions, and regret is measured against the best fixed action in hindsight. General feedback-graph algorithms with switching costs give $\widetilde O(T^{2/3})$ regret gu...

Tommaso Cesari·15 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Clustered Self-Assessment: A Simple yet Effective Method for Uncertainty Quantification in Large Language Models

Large language models (LLMs) demonstrate remarkable performance across diverse tasks, but they often generate responses that appear plausible while being factually incorrect. This problem is compounded by the lack of explicit uncertainty estimates, which makes it difficult for users to judge the reliability of model outputs. Existing uncertainty quantification methods typically rely on indirect signals, such as entropy across sampled generations. These signals can be difficult to interpret and do not fully leverage the model's ability to assess its own uncertainty. We propose a simple yet eff...

Qi Cao·15 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Re-Evaluating Continual Learning with Few-Shot Adaptation

Continual learning methods aim to maximize the stability and plasticity of machine learning models that are trained on a sequence of tasks. The standard measure of stability (i.e., forgetting) is the 0-shot performance of a model on previously learned tasks, and plasticity, the performance on the most recently learned task. However, 0-shot evaluation does not fully measure a model or method's ability to retain learned information or adapt quickly to new information, as it requires perfect recall across multiple tasks. In this paper, we propose few-shot evaluation as a more comprehensive asses...

Amogh Inamdar·15 days ago

TechCrunch AI· PRESS

Trump signs narrower executive order on AI oversight after industry objections

After industry objections, President Trump signed a revised AI executive order requiring only voluntary prerelease government reviews of advanced models.

Rebecca Bellan·15 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

EvoDS: Self-Evolving Autonomous Data Science Agent with Skill Learning and Context Management

Recent progress in Large Language Model (LLM) agents has enabled promising advances in automated data science. However, existing approaches remain fundamentally limited by their static action sets and lack of principled long-horizon context management, hindering their ability to accumulate reusable experience across tasks and operate reliably in multi-stage, iterative data science pipelines. To address these challenges, we introduce EvoDS, a self-evolving autonomous data science agent that learns to expand its skills and adaptively managing long-term context through agentic reinforcement lear...

Zherui Yang·15 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Text-attributed Graph Condensation via Text Selection and Attribute Matching

Text-Attributed Graph (TAG) is an important type of graph structured data, where each node has a text description. TAG models usually train a Graph Neural Network (GNN) and language model jointly, which leads to high space and time consumption, especially on large datasets. To mitigate this, we propose TAGSAM, a condensation method that compresses TAGs while preserving training accuracy. TAGSAM comes with two key designs, i.e., subgraph text Selection and Attribute similarity Matching, which compress the text description and graph topology of TAG, respectively. For the texts, subgraph text se...

Haowei Han·15 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Online Learning with Gradient-Variation Interval Regret

This paper investigates non-stationary online learning using the metric of interval regret, which requires an online algorithm to perform well over every time interval. We propose the first online learning algorithm that achieves an interval regret bound scaling with gradient variation, a fundamental measure of the cumulative change in online function gradients, which relates to various problem-dependent quantities and is closely connected to stochastic optimization and other problems. Our method employs a simple and efficient two-layer online ensemble structure that achieves strong theoretic...

Yan-Feng Xie·15 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

BigFinanceBench: A Workflow-Grounded Benchmark for Financial-Research Agents

Financial-research answers are decision-relevant only when another analyst can audit how they were produced: which source was chosen, which period and accounting definition were used, which assumptions were made, and how the calculation was performed. Existing finance benchmarks largely evaluate isolated subskills or final answers, leaving the auditable derivation itself under-measured. We introduce BigFinanceBench, a 928-item expert-authored benchmark of open-ended financial-research tasks in which each item pairs a ground-truth reference answer with a point-weighted rubric that decomposes t...

Alex Wang·15 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Conditional Latent Diffusion Model with Fourier-based Motion Modelling for Virtual Population Synthesis

In-silico trials of medical devices require the generation of virtual populations of anatomies. In cardiovascular applications, virtual anatomy is typically represented as a 3D+t mesh sampled from a generative model. However, most existing mesh generators focus on static anatomy, while sequence models often lack explicit periodicity. To this end, we propose 4D F-MeshLDM, a conditional generative framework comprising a convolutional mesh VAE to encode meshes, a structural latent space that parameterises motion using a truncated Fourier series, and a diffusion prior that learns the latent distr...

Shaokun Lan·15 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Dynamic Short Convolutions Improve Transformers

Transformers have become the dominant architecture for large language models, largely due to the scalability and flexibility of attention, feed-forward layers, residual connections, and normalization. This paper introduces dynamic short convolutions as an additional neural network primitive for improving Transformers. Unlike static short convolutions, dynamic convolutions use input-dependent filters, which preserves the locality bias of convolution while increasing expressivity. Motivating experiments show that applying dynamic short convolutions to key, query, and value representations impro...

Oliver Sieberling·15 days ago

TechCrunch AI· PRESS

OpenAI launches new Codex tools for white-collar work

OpenAI is getting serious about courting enterprise users. On Tuesday, the AI lab released a new set of capabilities for Codex, meant to expand the agentic tool’s uses in the workplace. Together with the new tools, the company released an internal report on how Codex is being used for knowledge work, finding its uses go […]

Russell Brandom·15 days ago

NVIDIA Dev Blog· INFRA

Deploy Self-Evolving Agents for Faster, More Secure Research with a Hermes Agent and NVIDIA NemoClaw

AI agents are a powerful tool for synthesizing data to accelerate research, summarize information, and help teams make decisions faster. But combining internal... AI agents are a powerful tool for synthesizing data to accelerate research, summarize information, and help teams make decisions faster. But combining internal data with public sources poses security challenges. This post shares an open source example using Hermes Agent with NVIDIA NemoClaw for product research across Outlook, Slack, and GitHub. NVIDIA OpenShell enforces a security-approved… Source

Sam Pastoriza·15 days ago

The Verge AI· PRESS

Microsoft Build 2026: All the news about Windows, AI, RTX Spark and more

Microsoft’s annual developer conference is kicking off on June 2nd in San Francisco with the keynote presentation streaming live at 12:30PM ET / 9:30AM PT, and we will be following along here with everything as it’s announced. The Verge’s Tom Warren reports that we can expect to hear about new AI models and agentic OpenClaw-like tools, plus a Copilot “super app” to go along with some of the major changes to Windows 11 that have already started appearing. Microsoft just announced the new Surface Laptop Ultra, powered by Nvidia’s RTX Spark, so there could be more Windows on ARM news in store. F...

Stevie Bonifield·15 days ago

TechCrunch AI· PRESS

Anthropic scales Claude Mythos to critical infrastructure in 15+ countries

Anthropic is expanding Project Glasswing, its security vulnerability program, and access to Mythos to 150 organizations across 15 countries — targeting critical infrastructure in power, water, healthcare, and communications where a cyberattack could affect 100 million people.

Rebecca Bellan·15 days ago

Hugging Face· INFRA

Holo3.1: Fast & Local Computer Use Agents

Hugging Face·15 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

From Answers to States: Verifiable Process-Level Evaluation of Chemical Reasoning in Large Language Models

Large language models are increasingly used as chemistry assistants, yet most chemistry benchmarks still score only final answers. This masks a critical failure mode: a model may output the correct molecule, product, or option while its reasoning violates chemical logic. Existing process-level evaluators are hard to scale because LLM judges and human step-level process annotation are costly, inconsistent, and vulnerable to hallucination. We introduce ChemCoTBench-V2, a rule-verifiable diagnostic benchmark for low-cost, auditable evaluation of structured, verifier-addressable chemical reasonin...

Hongyu Guo·15 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Diagnosing Knowledge Gaps in LLM Tool Use: An Agentic Benchmark for Novel API Acquisition

Large language models for code generation often need to use APIs that are absent from their pretraining data. This requires more than recalling a function name: models must coordinate signatures, module paths, input-output contracts, semantics, and executable usage patterns. Existing novel-API benchmarks are typically static, rely on coarse pass/fail metrics, or use synthetic APIs that may not reflect real library evolution. We introduce NovelAPIBench, a fully automated dynamic benchmark that, for any base model and target library, discovers novel APIs, extracts decomposed knowledge bundles, ...

Jinnuo Liu·15 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Towards Non-Monotonic Entailment in Propositional Defeasible Standpoint Logic

Recent work in defeasible reasoning has seen notions of preferential semantics and entailment in the style of Kraus et al. applied to modal logics. However, work in this field has focussed primarily on satisfiability checking, and monotonic notions of entailment, which may be inferentially weak. One particular modal logic where this has been introduced is propositional standpoint logics, where modalities can express the views of different viewpoints. This has resulted in the formalisation of propositional defeasible standpoint logic (PDSL). In this paper, we propose a means of lifting the cla...

Nicholas Leisegang·15 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

CoEval: Ranking Language Models for Custom Tasks Without Labeled Data or Trustworthy Benchmarks

Choosing or ranking language models for a specific application is hardest when no task-specific labeled data exists, and standard public benchmarks cannot be trusted, their items having likely leaked into pretraining, so scores reflect memorization rather than fitness. We present CoEval, an open-source, reusable framework that closes this gap end to end: from only a description of a task or domain, teacher models synthesize a fresh, attribute-controlled benchmark with no human labels, contamination-free because items are generated anew on each run, and a cross-family judge ensemble ranks cand...

Alexander Apartsin·15 days ago

← Front Page30 stories

← Newer Older →

The Archive

GitHub's plan for Agents — Kyle Daigle, GitHub

Beyond Encoder Accumulation: Measuring Encoder Roles in Multi-Encoder VLMs

Privacy-Robust Incrementality Measurement for Advertising Systems under Signal Loss

From 'What' to 'How' and 'Why': Sharing LLM-Generated Retrospective Summaries of Older Adults' Passive Tracking Data with Remote Family Members

Visual Instruction Tuning Aligns Modalities through Abstraction

A Training-Free Mixture-of-Agents Framework for Multi-Document Summarization using LLMs and Knowledge Graphs

Taiji: Pareto Optimal Policy Optimization with Semantics-IDs Trade-off for Industrial LLM-Enhanced Recommendation

Explainable Forecasting of Scientific Breakthroughs from Concept Network Dynamics

PyraMathBench: Evaluating and Improving Mathematical Capability in Large Language Models

Microsoft created the mini Surface dev box that Qualcomm couldn&#8217;t

FLARE: Fine-Grained Diagnostic Feedback for LLM Code Refinement

Two-Action Apple Tasting with Switching Costs

Clustered Self-Assessment: A Simple yet Effective Method for Uncertainty Quantification in Large Language Models

Re-Evaluating Continual Learning with Few-Shot Adaptation

Trump signs narrower executive order on AI oversight after industry objections

EvoDS: Self-Evolving Autonomous Data Science Agent with Skill Learning and Context Management

Text-attributed Graph Condensation via Text Selection and Attribute Matching

Online Learning with Gradient-Variation Interval Regret

BigFinanceBench: A Workflow-Grounded Benchmark for Financial-Research Agents

Conditional Latent Diffusion Model with Fourier-based Motion Modelling for Virtual Population Synthesis

Dynamic Short Convolutions Improve Transformers

OpenAI launches new Codex tools for white-collar work

Deploy Self-Evolving Agents for Faster, More Secure Research with a Hermes Agent and NVIDIA NemoClaw

Microsoft Build 2026: All the news about Windows, AI, RTX Spark and more

Anthropic scales Claude Mythos to critical infrastructure in 15+ countries

Holo3.1: Fast & Local Computer Use Agents

From Answers to States: Verifiable Process-Level Evaluation of Chemical Reasoning in Large Language Models

Diagnosing Knowledge Gaps in LLM Tool Use: An Agentic Benchmark for Novel API Acquisition

Towards Non-Monotonic Entailment in Propositional Defeasible Standpoint Logic

CoEval: Ranking Language Models for Custom Tasks Without Labeled Data or Trustworthy Benchmarks

Microsoft created the mini Surface dev box that Qualcomm couldn’t