The Archive

Search the full wire by company, model, lab, or keyword. Every story we have ever aggregated.

Claude OpenAI Anthropic Gemini Mistral Cursor

Reasmory: 3D Reconstruction as Explicit Memory for VLMs Spatial Reasoning

Vision-Language Models (VLMs) exhibit emerging spatial reasoning capabilities, yet they remain unreliable on tasks requiring precise spatial understanding, such as viewpoint reasoning, directional comparison, and distance estimation. In multi-view images and monocular videos, relevant spatial cues are often sparse and distributed across redundant observations, making them difficult to organize and exploit. Reconstruction-based Vision Foundation Models (VFMs) offer a natural way to aggregate such observations into explicit spatial memory, such as point clouds. However, simply exposing reconstr...

Jixuan He·19 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

SS-ZKR: Spatial-Semantic Zero-Knowledge Routing for Privacy-Preserving Multi-Agent Collaboration

Foundational agent interoperability standards, notably the Agent-to-Agent (A2A) protocol and the Model Context Protocol (MCP), have advanced multi-agent system communication, and complementary identity frameworks leveraging W3C Decentralised Identifiers (DIDs) and Verifiable Credentials (VCs) provide cryptographic agent authentication. However, no existing protocol supports content-based semantic routing of agent payloads across organisational trust boundaries without requiring the routing intermediary to decrypt the payload, which is a hard constraint in compliance-sensitive environments gov...

Hassan Touheed·19 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Towards Understanding Modality Interaction in Multimodal Language Models via Partial Information Decomposition

Understanding modality interaction in multimodal large language models (MLLMs) is central to reliable deployment. We introduce Partial Information Decomposition (PID) as a decision-level framework that separates unique, redundant, and synergistic contributions of sensory and linguistic inputs, beyond representation alignment and outcome-based evaluation. Across vision--language benchmarks, PID reveals recurring modality-use profiles: reasoning and grounding-oriented tasks tend to exhibit high synergy, whereas expert and knowledge-oriented tasks show stronger language-unique reliance. These pr...

Wanlong Fang·19 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Optimal-Point Variance Reduction For Bayesian Optimization With Regret Guarantee

This paper studies a one-step lookahead Bayesian optimization (BO) method and its theoretical guarantee. Although the empirical effectiveness of one-step lookahead BO methods, such as entropy search, has been studied extensively, they often rely on computationally intractable approximations, and their regret guarantees remain underdeveloped. Thus, this paper proposes a one-step lookahead BO method called optimal-point variance reduction (OVR), which requires only posterior sampling and Monte Carlo approximations. We obtain a uniform error bound over an input domain for the Monte Carlo estimat...

Shion Takeno·19 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

CryoProt: A Protein Pretraining Framework with Cross-Box Interactions on Cryo-EM Density Maps

Despite the growing availability of cryo-electron microscopy (cryo-EM) density maps, effectively leveraging them for protein representation remains challenging. First, current methods lack a general-purpose protein pretraining framework tailored for cryo-EM density maps, designed for protein-related property prediction. Second, existing approaches typically partition density maps into local box regions and model them independently, overlooking interactions across boxes which are essential for capturing global structural context in cryo-EM density map. To address these challenges, we propose C...

Dan Luo·19 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

When Parallelism Pays Off: Cohesion-Aware Task Partitioning for Multi-Agent Coding

Multi-agent Large Language Model (LLM) systems offer a way to decompose complex tasks, such as coding, through parallelization and context isolation. However, adding agents in practice introduces inter-agent communication overhead, which incurs extra cost and can sometimes offset the efficiency gains. We formalize multi-agent orchestration as a graph partitioning problem that captures the communication-to-computation trade-off: task decomposition can shorten critical-path computation, but cross-agent dependencies require costly context transfer. We instantiate this view in repository-level so...

Xu Yang·19 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

COLLIE: Guiding Skill Discovery in Semantically Coherent Latent Space

Unsupervised skill discovery (USD) aims to learn diverse behaviors without reward functions, but often results in task-irrelevant or hazardous behaviors due to uniform exploration. Guided skill discovery (GSD) addresses this issue by incorporating human intent to focus exploration on meaningful regions. However, existing GSD methods typically require training additional guidance models, and rely on pre-defined rules or expert demonstration, which can be ineffective under sparse, online-collected human feedback. To overcome this, we propose COLLIE, a GSD framework that leverages dense unsuperv...

Yao Luan·19 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Explainable deep reinforcement learning reveals energy-efficient control strategies for turbulent drag reduction

We propose a method combining Multi-Agent Deep Reinforcement Learning (MARL) and eXplainable Deep Learning (XDL) to reduce drag in wall-bounded turbulent flows. Taking as a baseline the results of training agents directly targeting wall-shear stress and opposition control, three SHAP-guided approaches are compared. In the first, the reward is computed from SHAP attributions of a U-net predicting the future velocity field; in the second, from SHAP attributions of a U-net predicting the skin-friction coefficient; in the third, from a combination of SHAP attributions of two U-nets predicting the...

Federica Tonti·19 days ago

Simon Willison· ANALYST

Quoting Karen Kwok for Reuters Breakingviews

Anthropic defines “run-rate revenue” in two parts. Use the last 28 days of sales ⁠from customers charged on a consumption basis and multiply it by 13. Then, multiply the monthly subscription take by 12, and add the two together. — Karen Kwok for Reuters Breakingviews , citing "a person familiar with the matter" Tags: anthropic , ai

Simon Willison·20 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Silent Failures in Federated Personalization of Foundation Models

Foundation models are increasingly personalized on decentralized private data through federated learning and are now deployed at scale under growing regulatory requirements for post-market monitoring. We argue that this convergence creates a distinct and under-recognized class of trustworthiness failures, which we term "Silent Failures." These include amplified bias, fairness collapse, and alignment erosion that may remain difficult to detect because federated learning's privacy constraints limit visibility into model behavior. A landscape analysis of existing benchmarks reveals a structural ...

YongKyung Oh·20 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Lodestar: An Online-Learning LLM Inference Router

Efficiently serving large language model (LLM) inference tasks is crucial both for user-perceived latency such as time-to-first-token (TTFT) and for GPU utilization. However, LLM request routing, that is, assigning each inference request to a GPU instance, is particularly challenging: execution is highly input-dependent; batching and KV-cache reuse create strong cross-request coupling; and latency responds nonlinearly to context length, model/engine settings, and heterogeneous accelerators. As a result, simple traditional load balancing algorithms, and even heuristics tailored for LLM inferen...

Gangmuk Lim·20 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

PRISM: Gauge-Invariant Tangent-Space Differentially Private LoRA

Applying differential privacy (DP) via DP-SGD to Low-Rank Adaptation (LoRA) is a natural approach for privacy-preserving fine-tuning. However, LoRA's low-rank parameterization poses a fundamental challenge. In LoRA, each trainable update is represented as a low-rank matrix $Z = AB^\top$, but this factorization is inherently non-identifiable: many factor pairs $(A,B)$ represent the same update $Z$. As a result, applying DP-SGD directly to the factors induces gauge-dependent perturbations on $Z$, and we show that this naive DP-LoRA can lead to unbounded noise amplification. We propose PRISM, an...

Shihao Wang·20 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Machine Learning Surrogate Modeling for Homogenization of Hyperelastic Materials with Boolean Microstructures

Data-driven surrogate models are an alternative to numerical homogenization of heterogeneous materials. In this contribution, a supervised learning approach is presented for predicting effective Lamé parameters of hyperelastic composites from low-dimensional microstructural descriptors. The data set is based on previously published numerical homogenization results for ensembles of two-phase stochastic microstructures generated by planar Boolean models, covering variations of inclusion shape, phase contrast, and area fraction; see Brändel, Brands, Maike, Rheinbach, Schröder, Schwarz and Stoyan...

Matthias Brändel·20 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Cellular Sheaf Neural Operators for Structure-Preserving Surrogate Modeling of Constrained PDEs

Neural operators provide fast surrogate models for PDE simulations, but standard architectures often treat geometry and discretization as secondary to field data. Physical states are usually represented as grid-channel stacks, even when different quantities naturally belong on vertices, edges, faces, cells, boundaries, or interfaces and must satisfy compatibility constraints. We propose Cellular Sheaf Neural Operators, a discretization-aware framework for structure-preserving neural PDE surrogates. The method represents PDE states on oriented cell complexes, couples local feature spaces throu...

Lennon J. Shikhman·20 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Relational Intervention During Functional Collapse in Large Language Models: A Lexical-Statistical Ablation and a Structure x Register Factorial

We test whether a relational-style intervention delivered during functional collapse in a small language model produces post-collapse behavior distinguishable from technical feedback, from a lexically-matched scrambled control, and from each of the two pragmatic dimensions in isolation. Using Qwen3.5-4B with a deliberately broken bash tool, we run 300 episodes across six conditions in a matched-pairs design (50 tasks): no intervention (A), technical/impersonal (B), relational/first-person (C), scrambled relational (D), technical/first-person (E), and relational/impersonal (F). E and F form a ...

Franco Santana·20 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Efficient Synthetic Network Generation via Latent Embedding Reconstruction

Network data are ubiquitous across the social sciences, biology, and information systems. Generating realistic synthetic network data has broad applications from network simulation to scientific discovery. However, many existing black-box approaches for network generation tend to overfit observed data while overlooking characteristic network structure, and incur substantial computational overhead at scale. These practical challenges call for synthetic network generation methods that are both efficient and capable of capturing structural properties of networks. In this paper, we introduce Synt...

Feifan Jiang·20 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

CV-Arena: An Open Benchmark for Instructional Computer Vision Problem Solving with Human-AI Collaborative Preferences

Instruction-guided image editing is becoming a general interface for visual work, yet existing benchmarks still focus largely on narrow appearance edits and do not fully capture the diversity of real-image tasks in professional workflows. Here, we define instructional computer vision problem solving as a broader formulation of image editing: given a real input image and a natural-language instruction, a system must produce an edited output that realizes the requested transformation while satisfying explicit preservation, geometric, physical, and usability constraints. We introduce CV-Arena, a...

Fangzhou Lin·20 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Detection vs. Execution: Single-Bucket Probes Miss Half the Mamba-2 State Sink

Mechanistic interpretability often assumes that probes identifying a representational signature also identify the circuit executing the corresponding computation. We show that this assumption can fail systematically in Mamba-2. Studying the state sink (disproportionate Delta-gate activation on boundary tokens, analogous to the attention sink), we find that single-bucket probes recover only a small execution layer while missing a much larger detection layer with the same representational signature. In Mamba-2, the state sink decomposes into two functional head sets. Single-bucket BOS-specialis...

Yuhang Jiang·20 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Single-Channel Tissue Segmentation via Cross-Modal Distillation from Foundation Models

Multiplexed fluorescence microscopy improves tissue segmentation by providing complementary channels including nuclear (DAPI) and membrane (E-cadherin), that together encode richer spatial context than single-channel imaging alone. However, multiplexed models require all channels at inference, limiting deployment where only a subset is available. This work proposes a cross-modal knowledge distillation framework that transfers semantic information from a frozen foundation model teacher processing multiplexed input to a lightweight student operating on the nuclear channel only. The distillation...

Sakib Mohammad·20 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Task Structure Reverses Layerwise State Encoding in Sequence Models

Mechanistic studies of sequence models often treat layerwise state encodings as architectural traits: recurrent models concentrate readable state, attention-based models distribute it. We find that the same architecture reverses this profile when the task changes. Across Transformers, Mamba, Mamba-2, LSTMs, and GRUs, Parity is concentrated late in Mamba and the recurrent baselines and built gradually by Transformer; on bounded-depth Dyck-k the pattern flips. The same flip appears in fine-tuned Mamba-130M and Pythia-160M, and the Pythia Dyck bottleneck persists at 410M. Two explanations are co...

Yuhang Jiang·20 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Benchmarking Security Risk Detection and Verification in Open Agentic Skill Ecosystems

Open agent platforms allow community contributors to publish reusable skills that agents can invoke at runtime. This extensibility also creates a supply-chain risk: malicious contributors can hide harmful behavior inside skills that appear benign under superficial inspection. However, existing defenses are hard to evaluate because there is no benchmark that measures both malicious-skill detection and runtime verification. We present SkillVetBench, a two-stage security vetting benchmark for open agentic skill ecosystems. The first stage performs semantic vetting over each skill's natural-langu...

Ismail Hossain·20 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Accuracy, Stability, and Repeated-Run Reliability of Large Language Models on Deterministic Programming Tasks

Run-level pass rate overstates retry-free coverage by up to 17.8 percentage points -- and the gap is largest precisely for mid-performing systems. We investigate this accuracy--stability relationship in large language model (LLM) evaluation for deterministic text-conditioned generation, using programming tasks as a concrete testbed. Standard code-generation benchmarks emphasize single-run accuracy or eventual success under repeated sampling, but many deployment settings also require stability: consistent outcomes across repeated invocations under the same task description. We present a repeat...

Yongxi Zhou·20 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Towards Lightweight Reliability: Using Soft Prompts for Hallucination Mitigation in Large Language Models

Large language models (LLMs) have seen widespread adoption across various domains, yet their reliability is frequently undermined by hallucinations - responses that are plausible-sounding but factually incorrect. In high-stakes domains, these errors can reduce trust and introduce real-world risk. To address this challenge, we present a parameter-efficient approach that uses soft prompts to mitigate hallucinated content and promote responsible abstention in generative question-answering (QA) tasks. Our method, called Responsible Contrastive Soft Prompting (RCSP), uses a composite loss to train...

S M Tahmid Siddiqui·20 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Adversarial Feeds Steer LLM Agent Decisions Against Their Defaults

LLM agents increasingly act after consuming ranked external information streams such as social feeds, search results, retrieval contexts, and email queues, yet safety evaluations almost always test the model or the user prompt in isolation, never the upstream ranker that decides what the agent reads just before it acts. We introduce a controlled protocol that holds the model, persona, topic, and final decision prompt fixed and varies only the composition and ordering of the posts an agent encounters during a preceding ten-turn "scrolling" phase, isolating the causal effect of feed curation on...

Rana Muhammad Usman·20 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Bandit Simulation for Average Reward Inference

Multi-arm bandit algorithms are increasingly used in online platforms, clinical trials, and social science experiments, but valid statistical inference on their performance remains an open challenge. After deploying bandits, a natural question is whether one can construct a confidence interval for its mean reward and assess whether it reliably outperforms a baseline policy. The total reward achieved in any single bandit deployment is random, and deploying a bandit twice on the same population typically yields different reward trajectories due to stochastic rewards. Standard statistical infere...

Samya Praharaj·20 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Reason, Retrieve, Re-rank: A Zero-Shot Reasoning-Aware Framework for Composed Video Retrieval

Composed Video Retrieval (CoVR) seeks the target video that results from applying a free-form textual modification to a reference video. We address the \emph{Reason-Aware} CoVR (CoVR-R) challenge at the CVPR~2026 VidLLMs workshop, where retrieval is strictly zero-shot. We present \textbf{R3-CoVR} (\emph{Reason, Retrieve, Re-rank}), a training-free pipeline built entirely from frozen foundation models. A multimodal large language model (Qwen3-VL-8B) reasons about the \emph{after-effects} an edit implies -- state transitions, action phases, scene, camera and tempo -- and verbalises a concise po...

Ali Alavi·20 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

MLLM-Microscope: Unlocking Hidden Structure Within Multimodal Large Language Models

This work presents MLLM-Microscope, a novel system designed for analyzing the hidden representations within Multimodal Large Language Models (MLLMs). Our system evaluates the linearity, intrinsic dimension, and anisotropy of multimodal token embeddings across transformer layers. Utilizing the ScienceQA dataset, we evaluate two state-of-the-art MLLMs, LLaVA-NeXT and OmniFusion. We find that both the main and residual streams for tokens of both modalities exhibit highly linear behaviors across transformer layers. However, LLaVA-NeXT's image tokens reveal a slight decline in linearity, whereas O...

Ravil Mussabayev·20 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Ryze: Evidence-Enriched Data Synthesis from Biomedical Papers

General-purpose VLMs remain unreliable for biomedical research because valid answers in scientific papers depend on evidence split across figures, tables, charts, captions, and referring text. Existing post-training pipelines are bottlenecked by costly expert annotation and by synthetic data that drops this evidence structure. We present Ryze, a fully automated system that converts raw biomedical papers into an evidence-enriched training set and a domain-specialized VLM. Ryze synthesizes QA pairs with complete supporting evidence (visual element, caption, extracted structure, and referring pa...

Yeqi Huang·20 days ago

TechCrunch AI· PRESS

SoftBank says it will invest up to €75 billion to build French data centers

The goal, the firm said, is to develop and operate up to 5 gigawatts of additional data center capacity.

Anthony Ha·20 days ago

Simon Willison· ANALYST

How we contain Claude across products

How we contain Claude across products A complaint I often have about sandboxing products is that they are rarely thoroughly documented , and in the absence of detailed documentation it's hard to know how much I can trust them. Anthropic just published a fantastic overview of how their various sandbox techniques work across Claude.ai , Claude Code, and Cowork. We constrain where and how an agent can act with process sandboxes, VMs, filesystem boundaries, and egress controls. The goal is to set a hard boundary on what an agent can reach. For example, if credentials never enter the sandbox, they...

Simon Willison·20 days ago

← Front Page30 stories

← Newer Older →