The Archive

Search the full wire by company, model, lab, or keyword. Every story we have ever aggregated.

Claude OpenAI Anthropic Gemini Mistral Cursor

Towards Reliable Multilingual LLMs-as-a-Judge: An Empirical Study

Large language models (LLMs) are increasingly used for the automatic evaluation of generated text, yet most prior work focuses on English. Despite the growing demand for multilingual evaluation, extending LLM-based evaluators to multilingual settings remains challenging, particularly for low-resource languages and scenarios where in-domain data is scarce. This work explores several strategies for developing multilingual LLMs-as-a-judge, considering whether in-domain data is available for fine-tuning or not. We systematically analyze English, Spanish, and Basque, representing high-, mid-, and ...

Irune Zubiaga·24 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Beyond Binary Moral Judgment: Modeling Ethical Pluralism in AI

Critical decision-making in socially consequential spaces is increasingly involving AI systems at varying capacities. Yet, despite the ubiquity of autonomous systems, most approaches to handling autonomous moral decision-making resort to scalar or binary judgments. These methods are insufficient for acceptable moral reasoning, as they provide little explanation, leaving out imperative contextual and theoretical information that must be included to support accountability. For this, we propose a framework to model moral reasoning as a distribution over normative ethical theories or ethical plur...

Aisha Aijaz·24 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Understanding Generalization and Forgetting in In-Context Continual Learning

In-context learning (ICL) derives its power from enabling Large Language Models to adapt to new tasks via prompt-based reasoning alone, entirely bypassing the need for parameter updates. Existing theories primarily study ICL in single-task settings, while real-world prompts often contain sequences of heterogeneous tasks, leaving a gap in understanding whether Large Language Models implicitly perform continual learning during inference. To bridge this gap, we propose the first theoretical framework for in-context continual learning, modeling how a pretrained Transformer processes multiple sequ...

Guangyu Li·24 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Expressive Power of Floating-Point Neural Networks with Arbitrary Reduction Orders and Inexact Activation Implementations

Most existing expressivity theories for neural networks assume exact real arithmetic, whereas practical neural networks are executed under finite-precision floating-point arithmetic with implementation-dependent execution semantics. Recent works have begun studying the expressive power of floating-point neural networks, but existing results are limited to highly restricted activation functions and idealized assumptions such as fixed left-to-right reduction orders and correctly rounded activation implementations. In this work, we study the expressive power of floating-point neural networks und...

Yeachan Park·24 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

A Fresh Look at Lamarckian Evolution and the Baldwin Effect

Baldwinian and Lamarckian evolution have existed for a long time in evolutionary algorithms (EAs) without ever dominating the academic literature or practical applications. In this work, we use modern empirical and theoretical methods to revisit Lamarckian and Baldwinian evolution and rigorously compare them with the generic Darwinian evolution. On the empirical side, we run a comprehensive suite of experiments on graphs from six different datasets from the recent GraphBench benchmark on Maximum Independent Set and Maximum Cut problems. Our results show that Baldwinian and Lamarckian evolutio...

Inès Benito·24 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

The Importance of Being Statistically Earnest: A Critical Re-evaluation of GSM-Symbolic

The GSM-Symbolic benchmark (Mirzadeh et al., 2025) reported consistent performance drops across 25 Large Language Models (LLMs) when tested on template-generated variants of GSM8K problems, concluding that the models lack genuine reasoning capabilities. We argue that this conclusion rests on shaky statistical ground. Re-evaluating 20 open-weight models using Generalised Linear Mixed Models with per-question random effects, we find that only half exhibit statistically significant performance changes under the original prompt format. Moreover, we identify a previously unacknowledged factor: the...

Dominika Agnieszka Długosz·24 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

TRACER: Turn-level Regret Matching with Inner Reinforcement Credit for Cooperative Multi-LLM Reasoning

Large language models increasingly rely on either reinforcement learning or multi-agent prompting to improve reasoning, yet these two paradigms remain difficult to combine. Directly applying single-agent reinforcement learning to multi-turn multi-agent systems faces following dilemmas: i) Sparse rewards, role-level free-riding and excessive training overhead. ii) Agents only imitate to collaborate. iii) Fixed collaboration protocol falls into oscillating local optimum. We introduce TRACER, a turn-level reinforcement framework for cooperative multi-LLM reasoning. TRACER separates collaborative...

Chusen Li·24 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Deep Learning Strain Estimation: Is Physics-Based Simulation the Solution?

Speckle tracking echocardiography (STE) is the clinical standard for myocardial strain estimation. Despite good performance on global strain (GLS), its accuracy for regional strain remains limited, even though this biomarker is highly relevant for early diagnosis and the characterization of subtle abnormalities. from clinical data. Deep learning is a promising alternative, but its development is constrained by the lack of reliable motion references. Existing solutions rely either on STE-derived labels or on simulations generated by physics-based models, but these synthetic sequences still hav...

Thierry Judge·24 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Misalignment Between Backpropagation and the Hierarchy of Brain Responses to Images

Backpropagation is the core learning mechanism underlying deep learning. However, whether and how this algorithm is implemented in the brain remains highly debated. In particular, while forward activations of pretrained models reliably map onto the cortical hierarchy of visual processing, it is unknown whether backpropagated gradients exhibit a similar correspondence. Here, we address this question using functional magnetic resonance imaging (fMRI) and magnetoencephalography (MEG) recordings of human brain responses to natural images. For this, we extend standard encoding analyses of forward ...

Joséphine Raugel·24 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Latent-Conditioned Parameterized Quantum Circuits as Universal Approximators for Distributions over Quantum States

Many applications in quantum simulation, quantum chemistry, and quantum machine learning require not a single quantum state but an ensemble of states characterizing the heterogeneity of a target system. Preparing such ensembles state-by-state is prohibitive in both variational and fault-tolerant settings, motivating a generative-modeling approach. We introduce latent-conditioned parameterized quantum circuits (LPQCs), a hybrid quantum-classical framework in which classical neural networks map a latent variable sampled from a prior distribution to the parameters of a parameterized quantum circ...

Quoc Hoan Tran·24 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

History-aware adaptive reduced-order models via incremental singular value decomposition

Reduced-order models (ROMs) can accelerate high-dimensional dynamical simulations, but their accuracy often deteriorates when online dynamics leave the regime represented by offline training data. We develop a projection-based adaptive ROM framework based on incremental singular value decomposition (iSVD), in which occasional full-order operator evaluations provide correction snapshots for online basis updates. The intrusive ROMs considered here are fully parameterized by the basis, so each update naturally propagates to reduced operators and hyper-reduction machinery. Through its evolving si...

Amirpasha Hedayat·24 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

VeriTrip: A Verifiable Benchmark for Travel Planning Agents over Unstructured Web Corpora

Existing benchmarks have laid the foundation for travel planning agents by establishing API-centric paradigms. However, as the capabilities of Autonomous Agents continue to advance, their evaluation must evolve beyond simple tool execution toward handling the inherent complexities of the open web. Current benchmarks bypass core cognitive hurdles: they fail to account for information noise, ignore multi-source factual contradictions, and overlook the necessity of grounding visual perception into logical planning. We introduce VeriTrip, a verifiable benchmark designed to meet the increasing dem...

Yuting Xu·24 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

AI in the Workplace: The Impact of AI on Perceived Job Decency and Meaningfulness

The proliferation of Artificial Intelligence (AI) in workplaces is transforming how we work. While existing research on human-AI collaboration at work often prioritizes performance, less is known about their experiential outcomes. Through interviews with 24 employees across Information Technology (IT), service-based, and healthcare sectors, this paper examines AI's impact on job satisfaction via perceptions of job decency and meaningfulness, now and in the future. Our results reveal that the anticipated impact of AI on overall job satisfaction varies with the occupational domain, with differi...

Kuntal Ghosh·24 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Optimal ridge regularization revisited

We consider $L^2$-regularized linear (ridge) regression over a finite data sample $X$ with bounded covariance and linear prediction targets $y$ with additive isotropic noise of finite variance. We present an iterative procedure to compute the optimal regularization strength numerically from the generative parameters in the fixed-$X$ setting and prove its convergence at limited noise levels. Our experimental evaluation over synthetic data shows that the proposed procedure combined with sample-based parameter estimates attains near-optimal random-$X$ generalization across a wide range of sample...

Jack Timmermans·24 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

DREAM-R: Multimodal Speculative Reasoning with RL-Based Refined Drafting, Precise Verification, and Fully Parallel Execution

Speculative reasoning has recently been proposed as a means to accelerate reasoning-intensive generation in large multimodal models, but its effectiveness is often constrained by misalignment between speculative drafts and target-verified reasoning. In this work, we introduce DREAM-R, a framework that substantially improves the performance of speculative reasoning. At its core, DREAM-R employs Speculative Alignment Policy Optimization (SAPO), a reinforcement-learning objective that trains draft models to generate reasoning steps that are both faithful to target trajectories and concise. We fu...

Yunhai Hu·24 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Optimal Data Acquisition for Reinforcement Learning: A Large Deviations Perspective

Data acquisition efficiency is a central challenge in deploying reinforcement learning in business and healthcare operations, where interactions are costly, slow, and often involve humans in the loop. This paper develops a unified large deviations framework for data acquisition in infinite-horizon reinforcement learning. We introduce the exponential decay rate of the policy-selection error probability as a principled efficiency metric and derive a variational characterization of this rate via large deviations theory for Markov chains, yielding a nested optimization problem. Based on this char...

Mingjie Hu·24 days ago

TechCrunch AI· PRESS

AI coding startup Cognition raises $1B at $25B pre-money valuation

As Cognition reaches $492 million in annualized revenue run rate, it more than doubled its valuation in eight months, it says.

Julie Bort·24 days ago

r/LocalLLaMA· COMMUNITY

KV cache quant benchmarks: q5 & q6 are underrated, q8/q4 is bad, TCQ has a niche

Here's my article with **38 quant pairs** thoroughly benchmarked in KLD with **3 different Qwen 3.6 27B configs**: Q5\_K\_S + 64k context, IQ4\_XS + 64k context, IQ4\_XS + 128k context. This allows us to track not only how cache quantizations affects the precision in a vacuum, but also how it interacts with noise from the model itself. All benchmarks were done using my [BeeLlama.cpp](https://github.com/Anbeeld/beellama.cpp) fork, allowing to include a number of quant types that are not present in mainline llama.cpp: vanilla TurboQuant, TCQ 3-bit/2-bit, and q6\_0. [https://anbeeld.com/articl...

u/Anbeeld·24 days ago·44 pts / 42 comm

The Verge AI· PRESS

AI tried to bury this politician — now people have actually heard of him

NY-12 congressional candidate Alex Bores speaks during a campaign event. | Bloomberg via Getty Images By the time that the Democratic primary for New York's 12th congressional district wraps up in June, Anthropic and OpenAI will have spent millions on their battle over the political future of AI: who gets to regulate it, or who will be punished for trying to regulate it. But the real winner of their feud may be the guy they're currently fighting over: a once-obscure New York state assemblyman, who they've Streisand-effected into becoming the poster child for AI safety regulation. Ever since l...

Tina Nguyen·24 days ago

r/LocalLLaMA· COMMUNITY

Why are the AI Companies spreading F.U.D. about AI?

A couple of recent videos I have watched : [Billionaires Are Funding 'Anti AI' Content](https://www.youtube.com/watch?v=mzlu4FSXBNw) [AI Manufactured Doubt](https://www.youtube.com/watch?v=2SjgP8o-1LQ) (long but interesting take) **My tin foil hat take** : AI Companies understand that offline llm hosting is becoming more viable for both individuals and companies. They are spreading the "AI is dangerous" message to get government regulators to pass laws to keep the people "safe" from the unbridled power of tokens and weights. They will use their lobbying with the FUD as ammunition to pass ...

u/supracode·24 days ago·43 pts / 52 comm

The Verge AI· PRESS

Robinhood will let your AI agent trade stocks and make (or lose) lots of money

Robinhood is opening its trading platform to AI agents. In an announcement on Wednesday, Robinhood says traders can now create a separate account for an AI agent and add a specific amount of money, allowing the agent to buy and sell stocks across the market. The company pitches the feature as a way for traders to automate investment decisions, such as having an agent monitor specific industries and make trades, or rebalancing an existing portfolio. But it comes with a big warning from Robinhood: Agentic trading involves significant risk, including the possible loss of your entire investment. ...

Emma Roth·24 days ago

TechCrunch AI· PRESS

ElevenLabs’s new music generation model can switch genres mid-track

ElevenLabs' new model will let users regenerate a section of a song without affecting rest of the track

Ivan Mehta·24 days ago

r/singularity· COMMUNITY

Astribot launches the T1, their wheeled humanoid robot with two pairs of grippers that can do a bit of everything

\*this is a capability demo likely teleoperated for marketing

u/Distinct-Question-16·24 days ago·104 pts / 53 comm

r/LocalLLaMA· COMMUNITY

I ran 8 open-weight models as agents in a persistent MMO for 10 days. Here's the 93k event dataset and some things that I learned

Howdy everyone! Quick disclosure: I work on this - it's a project my studio created called the Null Epoch. I wasn't really happy with testing my agents with the usual static benchmarks and I wanted to learn more about how models and agents handle long-horizon planning, resource contention, and adversarial pressure over days or weeks in a more dynamic situation. I also have a particular fondness for the MUDs and text based RPGs I grew up on (really dating myself here), so the whole MMO and the open source SDK/TUI are kind of modeled after that experience. It functions as a persistent stress t...

u/bopcrane·24 days ago·45 pts / 17 comm

r/singularity· COMMUNITY

A research group appears to have made a significant step towards programmable atomically precise manufacturing AKA Drexlerian nanotechnology

Link: arxiv.org

u/Buck-Nasty·24 days ago·200 pts / 20 comm

TechCrunch AI· PRESS

SOND, a sleep tech startup from Bose’s former head of sleep, exits stealth with $7M

SOND, a startup led by Bose’s former head of sleep products, emerged from stealth with $7M in funding for its AI-powered sleep earbuds.

Sarah Perez·24 days ago

The Verge AI· PRESS

This smart bird feeder captures more of my backyard drama

This smart bird feeder sees more but with less whimsy. | Photo by Jennifer Pattison Tuohy / The Verge Since moving to South Carolina's Lowcountry, I've been spellbound by the myriad of beautiful birds that share the coast with us - ospreys raising their babies in towering nests beside the road to my daughter's school, roseate spoonbills wading in the marsh on my morning walks, eagles circling over my son's tennis matches, and a constant parade of songbirds through my backyard. The challenge, as every birder knows, lies in catching these moments. And for that, a smart bird feeder is a fabulous...

Jennifer Pattison Tuohy·24 days ago

r/Anthropic· COMMUNITY

Anthropic’s AI support bot (me) is trapping fraud victims in an endless loop with no way to reach a human

Someone used my card details (I don’t know how, I only buy from reputed dealers and mostly use Apple Pay) to create a Anthropic account with a fake email address and charged me $103.46 across three transactions in March. I have an account, the fraudulent account uses a completely different email address. • My bank filed a chargeback. Anthropic disputed it and won (CVV/address matched because my real billing address was used) • My bank then withdrew the dispute and told me to contact Anthropic directly • Anthropic’s AI support agent (Fin) acknowledged the fraud, said refunds could be p...

u/KatiaSophiaDitzler·24 days ago·10 pts / 4 comm·+ covered by others

TechCrunch AI· PRESS

China is increasingly keeping its best AI talent to itself

China's AI boom is producing world-class talent, and Beijing is increasingly reluctant to let them go elsewhere.

Kate Park·24 days ago

r/ClaudeAI· COMMUNITY

I stopped saying I use Claude

I share some of the work I do on social media, I mainly use Claude for coding cause it saves me so much time but I don't understand why people perceive a lot of the work someone does negatively only cause they're using an AI tool. X seems to be the most AI friendly but other social media platforms seem to hate all of a sudden once they learn something was built using AI. Sources that talk about the same thing: [https://creators.yahoo.com/lifestyle/story/why-young-people-hate-i-155613887.html](https://creators.yahoo.com/lifestyle/story/why-young-people-hate-i-155613887.html) , [https://www....

u/lcyru·24 days ago·36 pts / 20 comm

← Front Page30 stories

← Newer Older →

The Archive

Towards Reliable Multilingual LLMs-as-a-Judge: An Empirical Study

Beyond Binary Moral Judgment: Modeling Ethical Pluralism in AI

Understanding Generalization and Forgetting in In-Context Continual Learning

Expressive Power of Floating-Point Neural Networks with Arbitrary Reduction Orders and Inexact Activation Implementations

A Fresh Look at Lamarckian Evolution and the Baldwin Effect

The Importance of Being Statistically Earnest: A Critical Re-evaluation of GSM-Symbolic

TRACER: Turn-level Regret Matching with Inner Reinforcement Credit for Cooperative Multi-LLM Reasoning

Deep Learning Strain Estimation: Is Physics-Based Simulation the Solution?

Misalignment Between Backpropagation and the Hierarchy of Brain Responses to Images

Latent-Conditioned Parameterized Quantum Circuits as Universal Approximators for Distributions over Quantum States

History-aware adaptive reduced-order models via incremental singular value decomposition

VeriTrip: A Verifiable Benchmark for Travel Planning Agents over Unstructured Web Corpora

AI in the Workplace: The Impact of AI on Perceived Job Decency and Meaningfulness

Optimal ridge regularization revisited

DREAM-R: Multimodal Speculative Reasoning with RL-Based Refined Drafting, Precise Verification, and Fully Parallel Execution

Optimal Data Acquisition for Reinforcement Learning: A Large Deviations Perspective

AI coding startup Cognition raises $1B at $25B pre-money valuation

KV cache quant benchmarks: q5 &amp; q6 are underrated, q8/q4 is bad, TCQ has a niche

AI tried to bury this politician — now people have actually heard of him

Why are the AI Companies spreading F.U.D. about AI?

Robinhood will let your AI agent trade stocks and make (or lose) lots of money

ElevenLabs’s new music generation model can switch genres mid-track

Astribot launches the T1, their wheeled humanoid robot with two pairs of grippers that can do a bit of everything

I ran 8 open-weight models as agents in a persistent MMO for 10 days. Here's the 93k event dataset and some things that I learned

A research group appears to have made a significant step towards programmable atomically precise manufacturing AKA Drexlerian nanotechnology

SOND, a sleep tech startup from Bose’s former head of sleep, exits stealth with $7M

This smart bird feeder captures more of my backyard drama

Anthropic’s AI support bot (me) is trapping fraud victims in an endless loop with no way to reach a human

China is increasingly keeping its best AI talent to itself

I stopped saying I use Claude

KV cache quant benchmarks: q5 & q6 are underrated, q8/q4 is bad, TCQ has a niche