The Archive

Search the full wire by company, model, lab, or keyword. Every story we have ever aggregated.

People are increasingly using AI for creative tasks such as writing. While adoption continues to grow, this form of use risks undermining individual creativity locally and reducing the heterogeneity of creative output at scale. In response, we introduce the Semantic Repulsion Technique (SRT) and evaluate it both computationally and through a study with 16 participants who regularly use AI for creative tasks. Our computational assessment reveals that SRT increases semantic diversity by 85--167\% while reducing consensus phrases by 43--95\% across task modes. In the user study, SRT outputs rece...

Muhammad Haris Khan·2 months ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Optical Reasoning: Rethinking Images as an Expressive Reasoning Medium Beyond Text

Chain-of-Thought (CoT) improves the performance of Large Language Models (LLMs) and has been extended to Multimodal Large Language Models (MLLMs). More recent work further moves from text-based multimodal reasoning toward interleaved-modal reasoning, where intermediate steps can incorporate both textual rationales and visual evidence. In this work, we propose a bolder and more ambitious idea: could images alone serve as the reasoning medium for both language and multimodal tasks? To explore this, we propose optical reasoning, which treats images as a standalone reasoning medium. We instantiat...

Yutong Bian·2 months ago

arXiv (cs.AI/CL/LG)· ACADEMIA

On Choosing the $μ$ Parameter in Gaussian Differential Privacy

Recent work argues for using Gaussian differential privacy (GDP) to report the privacy guarantees in privacy-preserving machine learning. We provide principled mappings from pure-DP $\varepsilon$ to GDP $μ$ by matching the worst-case success of a strong-adversary membership inference attack in terms of three metrics: multiplicative advantage at fixed FPR, precision at fixed recall, and the standard privacy profile. We tabulate $μ$ values across a useful range of parameters and recommend $μ\approx \varepsilon/5$ as a conservative general-purpose conversion.

Bogdan Kulynych·2 months ago

arXiv (cs.AI/CL/LG)· ACADEMIA

TABVERSE: Benchmarking Cross-Format Table Understanding in LLMs and VLMs

Large Language Models (LLMs) and Vision-Language Models (VLMs) are increasingly evaluated on table reasoning tasks, but the role of table representation remains under-explored. In practice, the same table content may appear in different structural formats, such as HTML, Markdown, and LaTeX, or as rendered images. However, existing evaluations often let content, format, layout, and modality vary together, making it difficult to isolate representation effects. We introduce TABVERSE, a controlled multimodal table benchmark that aligns the same table content across multiple structural formats and...

Momina Ahsan·2 months ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Code Is More Than Text: Uncertainty Estimation for Code Generation

Large language models (LLMs) are increasingly deployed as code generators, where silently wrong programs pose real safety and reliability risks. Reliable uncertainty estimation (UE) is essential for selective prediction, human-in-the-loop review, and downstream agentic decisions. Yet most existing code UE methods are inherited from natural language (NL) generation and ignore properties that make code distinct. We argue that code differs from NL in three ways: a single wrong token can break an entire program (token fragility); algorithmic intent and concrete implementation can disagree indepen...

Yuling Shi·2 months ago

arXiv (cs.AI/CL/LG)· ACADEMIA

CT-VAM: A Cerebello-Thalamic-Inspired Vision-Action Model for Efficient Visuomotor Control

Vision-language-action models have shown strong promise for robot manipulation, yet raw language is primarily needed to specify task intent rather than to be repeatedly processed during high-frequency low-level execution. Motivated by this separation, we propose a cerebello-thalamic-inspired vision-action model (CT-VAM) for efficient task-conditioned visuomotor control. CT-VAM acts as a compact local execution policy that predicts action chunks from dualview visual observations, proprioception, and a lightweight task condition, potentially enabling a practical cloud-edge paradigm in which hig...

Jiacheng Li·2 months ago

arXiv (cs.AI/CL/LG)· ACADEMIA

UXBench: Benchmarking User Experience in AI Assistants

As AI assistants serve millions of users daily, evaluating user experience (UX) beyond general model capability has become increasingly important. We present UXBench, the first user-centric benchmark grounded in real user feedback signals for evaluating preference alignment and dialogue generation. The benchmark consists of three interconnected tasks, UX Judge, UX Eval, and UX Recovery, with 7,400 test instances extracted from over 70K interaction logs of a mainstream Chinese AI assistant. The dataset closely reflects real user distributions, covering 8 scenarios, 83 domains, and diverse fail...

Mengze Hong·2 months ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Self-Explainability in Self-Adaptive and Self-Organising Systems: Status and Research Directions

The growing complexity of self-adaptive and self-organising systems, fuelled by advances in Artificial Intelligence (AI), has made them increasingly difficult to understand and trust. While Explainable AI aims to provide insight into AI decision-making, a more advanced goal is for systems to explain themselves - an ability referred to as Self-Explainability (SX). This article presents a systematic literature review on SX, analysing existing approaches, including their domains, targets, and evaluation methods. The review develops a unified definition and taxonomy of SX and introduces Levels of...

Tom Beyer·2 months ago

arXiv (cs.AI/CL/LG)· ACADEMIA

PRISM: Recovering Instruction Sets from Language Model Activations

As LLMs are deployed as agents, reliable monitoring requires knowing not only what they output, but which instructions are steering their behavior. This is difficult when models infer unintended subgoals, follow contextual cues, or are influenced by prompt injections and hidden objectives. While activation-to-language methods suggest that hidden states can reveal natural-language information, existing approaches are not designed to recover the full set of simultaneous instructions, constraints, prohibitions, and subgoals active in agentic settings. We formalize this problem as instruction set...

Gilad Gressel·2 months ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Safe-RULE: Safe Reinforcement UnLEarning

Offline safe reinforcement learning (Safe RL) enables policy learning without online interactions, making it suitable for safety-critical systems such as robotics systems. However, its reliance on static datasets exposes offline Safe RL to data poisoning attacks, where adversaries inject malicious samples that compromise safety and induce unsafe policy behavior. In this work, we propose a new learning paradigm, named safe reinforcement unlearning (Safe-RULE), used as a defense framework to remove the influence of poisoned data without retraining from scratch or requiring access to the origina...

Shixiong Jiang·2 months ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Integrating gene regulatory priors into Transformer attention with scTransformer for interpretable scRNA-seq analysis

Motivation: Transformer-based models are increasingly applied to large-scale single-cell transcriptomics, showing strong performance through self-supervised learning on millions of cells. However, most existing approaches treat genes as independent features, and largely ignore prior biological knowledge, which limits interpretability and robustness. In this paper, we explore whether explicitly incorporating gene regulatory information can improve both model performance and biological insight. Results: We present scTransformer, the first Transformer-based approach that builds a priori knowledg...

Mikele Milia·2 months ago

arXiv (cs.AI/CL/LG)· ACADEMIA

AI Scientists Are Only as Good as Their Evidence: A Stratified Ablation of Proprietary Data and Reasoning Skills in Drug-Asset Valuation

AI Scientist agents are often evaluated as if capability were mainly a function of model quality, prompting, or reasoning scaffolds. We test a different hypothesis in drug-asset valuation: for knowledge-intensive scientific decisions, the limiting factor is often the evidence substrate the agent can access. We run a controlled three-arm ablation on a production valuation agent: A is a plain web-only LLM analyst, B adds public structured tools plus a 14-dimension valuation playbook, verifier, objectivity policy and red-team, and C adds the proprietary Noah AI corpus of curated pipeline, trial ...

Yinan Wang·2 months ago

arXiv (cs.AI/CL/LG)· ACADEMIA

OpenBibleTTS: Large-Scale Speech Resources and TTS Models for Low-Resource Languages

Recent advances in neural text-to-speech (TTS) and multilingual speech generation have substantially improved synthetic speech quality, yet these gains remain unevenly distributed across the world's languages. Existing models are still dominated by a small set of high-resource languages, while many studies of low-resource TTS are simulated on artificially downsampled high-resource corpora that do not reflect the orthographic variation and limited phonetic coverage encountered in genuinely underrepresented settings. As such, we introduce OpenBibleTTS, which is a large-scale benchmark for low-r...

David Guzmán·2 months ago

arXiv (cs.AI/CL/LG)· ACADEMIA

FuseFSS: Efficient Secure LLM Inference with Function Secret Sharing

Two-server secure inference allows a client to query a hosted large language model (LLM) without revealing prompts or embeddings. Recent GPU systems based on function secret sharing (FSS) make linear layers efficient, but fixed-point nonlinearities and helper operations remain a bottleneck because each operator is typically implemented as a bespoke protocol with its own comparisons, wrap-around corrections, and preprocessing material. We present FuseFSS, a compiler that replaces per-operator protocol design with a single compilation pipeline. For each scalar fixed-point operator, a compact sp...

Yuhan Ma·2 months ago

arXiv (cs.AI/CL/LG)· ACADEMIA

SecureClaw: Clawing Back Control of LLM Agents

Tool-using large language model (LLM) agents face two distinct security failures: unauthorized external actions and exposure of sensitive plaintext inside the runtime before any final output check can intervene. Existing defenses usually protect one boundary, either the planner/runtime or the action sink, and therefore do not by themselves secure both surfaces. We present SecureClaw, a dual-boundary architecture that places authorization at the effect sink and plaintext confinement at the read boundary. Sensitive reads pass through a trusted gateway that replaces raw values with opaque handle...

Yuhan Ma·2 months ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Model Poisoning Against Federated Model Adaptation with Chain of Bit-Flips

Federated Learning (FL) allows a set of clients to collectively train a global model without sharing local training data. Giving the responsibility of the training to decentralized actors may lead to poisoning attacks: clients controlled by malicious third party potentially poison the training dataset to install a backdoor in neural networks. In FL, these backdoor attacks rely solely on algorithmic approach, however, recent advances in hardware faults threats (e.g, Rowhammer) have widen the overall attack surface. In the context of federated model adaptation, we introduce a novel category of ...

Bastien Vuillod·2 months ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Streaming Interventions: Can Video Large Language Models Correct Mistakes as They Occur?

Learning everyday skills, like cooking a dish, relies increasingly on instructional media such as online videos. This opens the door to the use of video (and multimodal) large language models (LLMs) as task guidance assistants. A crucial capability for the real-world success of a prospective task guidance assistant is it's ability to intervene proactively as soon as a mistake is apparent in order to guide the user. To evaluate this crucial capability, we introduce Ego-MC-Bench (Mistake Corrections), a benchmark for evaluating reactive, step-by-step task guidance in realistic cooking scenarios...

Apratim Bhattacharyya·2 months ago

arXiv (cs.AI/CL/LG)· ACADEMIA

From Genes to Tokens: a GWAS-inspired Approach for Interpretable Stylometric Analysis

This short paper introduces a stylometric interpretation method inspired by genome-wide association studies (GWAS). Each "gene" token's association with "phenotype" authorship is tested using logistic regression with multiple-comparison correction. Applied to English, German, and Russian corpora, the method detects statistically significant lexical markers distinctive of individual authors.

Dmitry Pronin·2 months ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Automating the Expert Eye: A System-Agnostic Deep Learning Framework for Rare Event Discovery in Imbalanced Force Spectroscopy

Single-Molecule Force Spectroscopy (SMFS) provides unprecedented insights into biomolecular mechanics, yet the high-throughput generation of force-extension trajectories creates a severe data curation bottleneck. Identifying rare molecular unbinding events within thousands of noise-dominated curves traditionally relies on tedious, non-scalable manual auditing. Here, we present a system-agnostic, interpretable deep learning framework tailored to overcome extreme class imbalance in automated SMFS triage. Utilizing 1D-to-2D rasterized geometric matrices, we deployed a modified ResNet18 architect...

Jorge Rodriguez-Ramos·2 months ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Efficient Traffic Prediction at Scale: A Systematic Study of STGCN Architectural Depth

Spatio-temporal graph neural networks (STGNNs) have become the dominant approach for traffic prediction, yet their computational requirements pose challenges for practical deployment in intelligent transportation systems (ITS). While recent work has proposed efficient alternatives to STGNNs, a fundamental question remains unexplored: are these architectures themselves over-parameterised? We examine this question using the Spatio-Temporal Graph Convolutional Network (STGCN), one of the most widely adopted models in this domain. Through systematic experiments across four diverse traffic dataset...

Soban Nasir Lone·2 months ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Overcoming Decoder Inconsistencies in Whisper for Dravidian and Low-Resource Languages

Multilingual ASR models such as Whisper perform well on high-resource languages but exhibit substantially higher Word Error Rates (WER) for Dravidian languages compared to Indo-Aryan ones. Through linguistic and dataset analysis, we show that Dravidian languages have longer words, higher vocabulary diversity, and lower repetition, resulting in sparse token distributions and frequent character-level substitution errors. Baseline fine-tuning further reveals decoder imbalance between self-attention (linguistic context) and cross-attention (acoustic cues). Although synthetic token-repetition expe...

Chowdam Venkata Kumar·2 months ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Interpretable Crisis Behavior Analysis Using Mobility and Social Media Data

Crises alter both how people move and how they communicate. During emergencies such as wildfires and pandemics, changes in mobility patterns and online emotional discourse evolve jointly, yet they are typically studied in isolation. This paper presents a unified and interpretable pipeline that integrates mobility and social media data to identify cross-domain behavioral patterns in crisis settings. The framework is evaluated through two case studies: a short-horizon analysis of the January 2025 Los Angeles wildfires (prototype case) and a longitudinal analysis of UAE COVID-19 behavior from Ma...

Muhammad Hamza Arshad Majeed·2 months ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Emergence of Context Characteristics Sensitivity in Large Language Models

During instruction fine-tuning (IFT), large language models (LLMs) learn to follow instructions by using the provided context to answer a query. While prior work has studied how context characteristics correlate with context usage by the LLM, this analysis has been limited to inference time, leaving open how these relationships are acquired in the first place. Here, we measure how models' sensitivity to such characteristics shifts across successive IFT stages: supervised fine-tuning (SFT), direct preference optimization (DPO), and reinforcement learning with verifiable rewards (RLVR). Experim...

Nadya Yuki Wangsajaya·2 months ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Closing the Prior-Posterior Loop: Self-Reflective Molecular Design with Analysis-Driven LLM Iteration

Can a general-purpose large language model design molecules with the precision of a seasoned chemist? Current LLM-based frameworks answer this question with scalar feedback loops-generate, score, reject-that amount to informed trial-and-error. Here we show that replacing a single number with the full physicochemical rationale from first-principles calculations transforms the LLM from a stochastic sampler into a causal reasoner. Our system couples retrieval-augmented generation with a self-reflection module that feeds orbital energies, atomic charges, and electron densities-rather than compres...

Junyi Gong·2 months ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Investigating Calibration Challenges in Probabilistic Electricity Price Forecasting

As renewable energy integration increases market volatility, probabilistic electricity price forecasting has become essential for effective risk management. However, current-proper-scoring rules often prioritize forecast sharpness at the expense of calibration, leading to overconfident and statistically unreliable uncertainty estimates. This work highlights the critical gap between theoretical scoring and practical calibration, demonstrating that models can become mere proxies for deterministic forecasts when reliability is neglected. We conclude that future research must shift toward calibra...

Jan Niklas Lettner·2 months ago

OpenAI· FRONTIER

Confidential submission of draft S-1 to the SEC

OpenAI recently submitted a confidential S-1 to the SEC and has not decided on timing.

OpenAI·2 months ago

The Verge AI· PRESS

Microsoft’s AI chief says superintelligence is near, but won’t take your job

Today I’m talking with Mustafa Suleyman, the CEO of Microsoft AI. And I’m actually going to keep today’s intro short — I’m working from my wife’s family farm this week, as you’ll see in the video, but also this is a real burner of an episode. We covered everything from Mustafa’s approach to training new models to his criticisms of Anthropic talking about Claude as though it is conscious. Of course, we also talked about Microsoft’s relationship with OpenAI, how Mustafa is thinking about all the negative polling and political pushback around AI right now, and whether any of the consumer product...

Nilay Patel·2 months ago

Ars Technica AI· PRESS

"Chat is dead": OpenAI preps overhaul of ChatGPT

OpenAI to recast hit chatbot as a route to higher-margin products before a potential IPO.

Cristina Criddle, Financial Times ·2 months ago

Hugging Face· INFRA

The crash that vanished: control and emergence in a five-model economy

Hugging Face·2 months ago

Google DeepMind· FRONTIER

Measuring the impact of learning with AI in Sierra Leone and beyond

Results from a randomized controlled trial show the potential of Gemini’s Guided Learning feature to boost engagement and accelerate learning.

Google DeepMind·2 months ago

← Front Page30 stories

← Newer Older →

The Archive

Seeing the Hivemind: A Consensus-Aware Interaction Technique for Mitigating AI Homogenization

Optical Reasoning: Rethinking Images as an Expressive Reasoning Medium Beyond Text

On Choosing the $μ$ Parameter in Gaussian Differential Privacy

TABVERSE: Benchmarking Cross-Format Table Understanding in LLMs and VLMs

Code Is More Than Text: Uncertainty Estimation for Code Generation

CT-VAM: A Cerebello-Thalamic-Inspired Vision-Action Model for Efficient Visuomotor Control

UXBench: Benchmarking User Experience in AI Assistants

Self-Explainability in Self-Adaptive and Self-Organising Systems: Status and Research Directions

PRISM: Recovering Instruction Sets from Language Model Activations

Safe-RULE: Safe Reinforcement UnLEarning

Integrating gene regulatory priors into Transformer attention with scTransformer for interpretable scRNA-seq analysis

AI Scientists Are Only as Good as Their Evidence: A Stratified Ablation of Proprietary Data and Reasoning Skills in Drug-Asset Valuation

OpenBibleTTS: Large-Scale Speech Resources and TTS Models for Low-Resource Languages

FuseFSS: Efficient Secure LLM Inference with Function Secret Sharing

SecureClaw: Clawing Back Control of LLM Agents

Model Poisoning Against Federated Model Adaptation with Chain of Bit-Flips

Streaming Interventions: Can Video Large Language Models Correct Mistakes as They Occur?

From Genes to Tokens: a GWAS-inspired Approach for Interpretable Stylometric Analysis

Automating the Expert Eye: A System-Agnostic Deep Learning Framework for Rare Event Discovery in Imbalanced Force Spectroscopy

Efficient Traffic Prediction at Scale: A Systematic Study of STGCN Architectural Depth

Overcoming Decoder Inconsistencies in Whisper for Dravidian and Low-Resource Languages

Interpretable Crisis Behavior Analysis Using Mobility and Social Media Data

Emergence of Context Characteristics Sensitivity in Large Language Models

Closing the Prior-Posterior Loop: Self-Reflective Molecular Design with Analysis-Driven LLM Iteration

Investigating Calibration Challenges in Probabilistic Electricity Price Forecasting

Confidential submission of draft S-1 to the SEC

Microsoft’s AI chief says superintelligence is near, but won’t take your job

"Chat is dead": OpenAI preps overhaul of ChatGPT

The crash that vanished: control and emergence in a five-model economy

Measuring the impact of learning with AI in Sierra Leone and beyond