The Archive

Search the full wire by company, model, lab, or keyword. Every story we have ever aggregated.

Claude OpenAI Anthropic Gemini Mistral Cursor

CLaC@FinMMEval 2026 Task 3: Sentiment-Augmented Deep Reinforcement Learning for Active Trading -- An Alpha-Reward Approach

DRL trading system for Bitcoin/Tesla using policy gradient and Q-learning with LLaMA 3.2 sentiment analysis and technical indicators.

Andrei Neagu·9 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Production and Perception in LLMs: A Token Probability Approach

Study compares production vs. perception asymmetry in Llama-3.1-8B via token probability analysis, finding LLMs lack functional production-perception distinction.

Anna Marklová·13 days ago

TechCrunch AI· PRESS

Popular open source AI developer tool Ollama raises $65M, grows to nearly 9M users

Benchmark-backed Ollama has amassed 176,000 stars, and nearly 17,000 forks on Github by helping developers easily run AI on their PCs.

Julie Bort·17 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

PALS: Percentile-Aware Layerwise Sparsity for LLM Pruning

PALS adjusts per-layer sparsity in LLM pruning via activation percentiles, improving LLaMA-2-7B perplexity by 15% at 50% sparsity over uniform Wanda.

Yazdan Jamshidi·18 days ago

The Verge AI· PRESS

Meta’s new Muse Image model can pull other Instagram users into AI photos

Meta is launching the first AI image generation model made by its Superintelligence Labs division. The Muse Image model now powers the image-making tools across the Meta AI app, Instagram, and WhatsApp, and it's coming soon to Facebook and Messenger, according to an announcement on Tuesday. It's part of the growing Muse family of AI models that replace Meta's Llama lineup. Alexandr Wang, who Meta hired to head up its Superintelligence Labs last year, says on Threads that Muse Image is "agentic," meaning it works with its Muse Spark large language model "to reason through your prompt, search t...

Emma Roth·19 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Improving LLM-Generated Process Model Quality Through Reinforcement Learning: The Role of Reward Function Design

Systematic study of reward function design for RL-based BPMN process model generation using Llama 3.1 and Qwen 2.5 across 48 configurations.

Alexander Rombach·19 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Dynamic Bidirectional Pattern Memory: A Production-Scale Empirical Characterisation of Inference-Time Gating in Clinical NLP

Production-scale clinical NLP study of inference-time gating with Llama-3.3 70B generator and MMed-Llama-3.1 70B verifier over 167K narratives shows pattern-memory filtering limitations.

Ali H. Lazem·25 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Internal-State Probes Read the Situation, Not the Action: Three Negative Results for Pre-Action Misalignment Monitoring

Three-method study across Qwen2.5-Coder-32B, Llama-3.1-8B, and Gemma-3-27B shows internal probes read situation not pre-action intent, limiting misalignment monitoring efficacy.

Max Fomin·27 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

A Tree-of-Thoughts Inspired Hybrid Approach for Legal Case Judgement Summarization using LLMs

Extractive-abstractive hybrid summarization for legal case judgements using tree-of-thoughts with DeepSeek and Llama.

Aniket Deroy·1 month ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Actionable Activation Directions for Detecting and Mitigating Emergent Misalignment Across Language Model Families

Fine-tuning language models on insecure code induces emergent misalignment with poorly understood internal structure. We investigate whether this misalignment corresponds to a causally actionable activation-space direction shared across architectures. Across four instruction-tuned model families (Qwen2.5-1.5B, Gemma-2-2B, Llama-3.2-1B, Ministral-3-3B) finetuned identically, a difference-in-means direction achieves 99.6% separation of aligned and misaligned activations at each model's final layer. Causal steering by subtracting this direction reduces code spillover by 21-51 points, while a sec...

Abdul Rafay Syed·1 month ago

arXiv (cs.AI/CL/LG)· ACADEMIA

The Truth Stays in the Family: Enhancing Contextual Grounding via Inherited Truthful Heads in Model Lineages

Recent advances in large language models (LLMs) have produced many specialized multimodal LLMs (MLLMs) that share common foundational LLMs, forming distinct model lineages. It remains unclear whether a fundamental behavioral link exists between the foundational LLMs and downstream variants. We investigate this question by quantifying head-level context-truthfulness scores. Across diverse LLM and MLLM lineages, including Vicuna-, Qwen2.5-, LLaMA2-, and Mistral-based models, we find that Truth Scores are strongly preserved within model families, even after instruction tuning or multimodal adapt...

Miso Choi·1 month ago

arXiv (cs.AI/CL/LG)· ACADEMIA

BayLing-Duplex: Native Full-Duplex Speech Dialogue with a Single Autoregressive LLM

Real-time, full-duplex speech interaction is a key feature of next-generation spoken chatbots, allowing the model to listen and speak at the same time and to handle natural phenomena such as overlap, hesitation, and barge-in. Existing speech language models (SpeechLMs) such as LLaMA-Omni and GLM-4-Voice are still turn-based and rely on an external Voice Activity Detection (VAD) module to mark the end of the user's turn, which fundamentally limits their interactive ability. In this paper, we introduce BayLing-Duplex, a native full-duplex SpeechLM where a single autoregressive LLM decides when ...

Qingkai Fang·1 month ago

arXiv (cs.AI/CL/LG)· ACADEMIA

A Quantitative Experimental Repeated Measures Study of Training Dynamics in a Small Llama Style Language Model Under a Compute-Aware Token Budget

This study examines training dynamics in a small Llama-style language model trained under a fixed, compute-constrained token budget. Rather than evaluating efficiency solely through endpoint performance, the study uses a quantitative experimental repeated measures design to analyze how validation loss, validation perplexity, rolling volatility, backslide behavior, spike behavior, and between-seed variability change across token-based training intervals. Six independent training runs were conducted on a 4.26-million-parameter model using the TinyStories corpus, CPU-based full-precision trainin...

Joe Dwyer·1 month ago

arXiv (cs.AI/CL/LG)· ACADEMIA

The Shibboleth Effect: Auditing the Cross-Lingual Distributional Skew of Large Language Models

This study investigates cross-lingual distributional skew (the Shibboleth Effect) in frontier large language models (LLMs) subjected to sustained adversarial conditions. We develop a multi-agent geopolitical wargame, the Cerulean Sea Crisis, a synthetic maritime territorial dispute designed to mirror the structural dynamics of Eastern Mediterranean conflicts. Six frontier models (GPT-4o, Llama-4, Mistral-Large, Gemini-3.1-Pro, Qwen3.6-Plus, and DeepSeek-R1) participate in a between-groups experiment (N = 10 games per arm, K = 5 rounds per game) in which the sole manipulation is the language o...

Hakan Mehmetcik·2 months ago

arXiv (cs.AI/CL/LG)· ACADEMIA

AutoMegaKernel: A Statically-Checked Agent Harness for Self-Retargeting Megakernel Synthesis

AutoMegaKernel (AMK) compiles a HuggingFace Llama-family model into a single persistent cooperative CUDA kernel that runs the whole forward pass in one launch, with no per-model hand-written CUDA. The contribution is the system, not raw speed. A frozen schedule-IR validator statically certifies deadlock-freedom and race-freedom via static graph checks (not a mechanized proof), so an unsafe agent-proposed schedule is rejected before launch: across 7,160 adversarial schedules (6,091 unsafe) it had zero false-accepts and accepted all 360 real lowerings. The same source retargets sm_80/sm_90/sm_1...

Jaber Jaber·2 months ago

arXiv (cs.AI/CL/LG)· ACADEMIA

PC Layer: Polynomial Weight Preconditioning for Improving LLM Pre-Training

We propose a preconditioning (PC) layer, a weight parameterization via polynomial preconditioner that ensures stable weight conditioning throughout LLM training. The PC module reshapes the singular-value spectrum of weight matrices via low-degree polynomial preconditioning. After training, the preconditioned weights can be merged back into the original architecture, incurring no inference overhead. We demonstrate the advantage of the proposed PC layer over standard transformers in Llama-1B pre-training, for both the AdamW and Muon optimizers. Theoretically, we justify this spectrum-control pr...

Senmiao Wang·2 months ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Cross-Lingual Token Arbitrage: Optimizing Code Agent Context Windows via Local LLM Preprocessing

AI-assisted coding agents are bottlenecked by input-token cost. Two pathologies of raw human input drive much of this overhead: tokenization inefficiency for non-English text and structural entropy in conversational prompts. Existing approaches act reactively by compressing already-bloated contexts or intervening after failures occur. We introduce a pre-flight, edge-side prompt-rewriting middleware that operates between the developer and the cloud agent. A local Llama 3.2 (3B) model performs cross-lingual translation into English, structural rewriting into a compact task-oriented format, and ...

Mehmet Utku Colak·2 months ago

arXiv (cs.AI/CL/LG)· ACADEMIA

MarginGate: Sparse Margin-Triggered Verification for Batch-Invariant LLM Inference

Temperature-zero BF16 LLM inference is often treated as reproducible, yet the same request can emit different tokens when decoded alone or inside a larger batch. Existing fixes use batch-invariant operators or LLM-42's per-token verification, incurring cost even when most steps are stable. We ask whether verification can be applied exclusively to flipped tokens. Across five models, batch-induced token flips are sparse on the flip-rate benchmarks: on MATH500, Llama-3.1-8B flips on $0.48\%$ of synchronous decode steps, and all tested models stay within the 0.3-1.3% range on MATH500, GSM8K, and ...

Kexin Chu·2 months ago

r/LocalLLaMA· COMMUNITY

Qwen3.6 huge quality gain from Q4 to Q6 for coding agent

So, last week I tried to update my unused local LLM setup. I had to stop using it because quality was too low and deepseek was too cheap. First thing I stopped using Ollama and now I only use llama.cpp built in server that works really great. The quality improvement from Q4 to Q6 is outstanding and finally a local LLM server can work very similarly to paid APIs. That's great! And MTP makes a big performance gain, on a dual 3090 (downvolted and limited to 65°C) it generates from 20 to 50 tokens per second with minimal heat generation. So yes, that time has finally arrived! Local coding age...

u/Yes-Scale-9723·2 months ago·42 pts / 32 comm

r/LocalLLaMA· COMMUNITY

KV cache quant benchmarks: q5 & q6 are underrated, q8/q4 is bad, TCQ has a niche

Here's my article with **38 quant pairs** thoroughly benchmarked in KLD with **3 different Qwen 3.6 27B configs**: Q5\_K\_S + 64k context, IQ4\_XS + 64k context, IQ4\_XS + 128k context. This allows us to track not only how cache quantizations affects the precision in a vacuum, but also how it interacts with noise from the model itself. All benchmarks were done using my [BeeLlama.cpp](https://github.com/Anbeeld/beellama.cpp) fork, allowing to include a number of quant types that are not present in mainline llama.cpp: vanilla TurboQuant, TCQ 3-bit/2-bit, and q6\_0. [https://anbeeld.com/articl...

u/Anbeeld·2 months ago·44 pts / 42 comm

r/LocalLLaMA· COMMUNITY

Info: Nvidia Cuda 13.3 landed

[Cuda 13.3 Downloads](https://developer.nvidia.com/cuda-downloads) [Release Notes](https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html) Anybody already tried llama.cpp with 13.3?

u/parrot42·2 months ago·40 pts / 10 comm

r/LocalLLaMA· COMMUNITY

Strix Halo users, a rejected PR can give you up to 30% faster PP for MOEs.

Here's the PR by pedapudi. https://github.com/ggml-org/llama.cpp/pull/21344 It's merge request has been denied so it will not be in mainline llama.cpp. The changes are so small that I just put them into whatever the current release of llama.cpp is. Read the PR for more info. It will only work with MOEs. Also, it gives the most boost at low context. As the context rises, the gain diminishes. Pedapudi explains why that happens in the PR. Here are some numbers. It really works well. The tiny amount of time it takes me to apply the code to the current release of llama.cpp is time well spent. ...

u/fallingdowndizzyvr·2 months ago·40 pts / 33 comm

r/LocalLLaMA· COMMUNITY

AI content detector based on Qwen 0.8b fine-tuned on Pangram dataset

I've fine-tuned Qwen 3.5 0.8B on the dataset provided by Pangram with their EditLens paper. It's available via a [Chrome extension](https://chromewebstore.google.com/detail/slop-hammer/gfjdmhfokmhedlgfggmmgchpppmhkdgg); you can just click selected text and it's going to give you the probability distribution of how likely it is AI-generated. It takes under 1s on my M1 MacBook Pro. Pangram did release Llama 3.2 3B trained on their dataset, but I found this model slightly too legacy (too big for the capabilities). Qwen 0.8B (base) ended up being as good after roughly 20h of fine-tuning on a sin...

u/jslominski·2 months ago·48 pts / 55 comm

r/LocalLLaMA· COMMUNITY

CUDA: add fast walsh-hadamard transform by am17an · Pull Request #23615 · ggml-org/llama.cpp

llama.cpp adds fast Walsh-Hadamard transform (FWHT) for CUDA, yielding 1–2% prompt-processing and 7–9% token-generation speedups with quantized KV-cache.

u/pmttyji·2 months ago·40 pts / 10 comm

r/LocalLLaMA· COMMUNITY

The Financial Times has published an article about Heretic

Financial Times reports Heretic tool removes guardrails from Meta's Llama 3.3 in <10 minutes; 3,500+ decensored variants downloaded 13M times.

u/-p-e-w-·2 months ago·81 pts / 10 comm

r/LocalLLaMA· COMMUNITY

server: fix checkpoints creation by jacekpoplawski · Pull Request #22929 · ggml-org/llama.cpp

llama.cpp PR addresses checkpoint creation inefficiency when context optimization tools modify conversation history in agentic workflows.

u/jacek2023·2 months ago·79 pts / 19 comm

r/LocalLLaMA· COMMUNITY

What frontend do you guys use?

Community discussion on local LLM frontends and limitations of llama-server as default interface.

u/Borkato·2 months ago·40 pts / 65 comm

r/LocalLLaMA· COMMUNITY

Vision-capable LLMs vs. OCR for long-document (including charts, images, tables, etc.) QA

Benchmark of vision LLMs vs. OCR pipelines on 30 long, image-heavy PDFs from MMLongBench-Doc shows LlamaCloud + Azure premium achieving 59.6%–58.5% accuracy; agentic RAG and native PDF vision approaches compared on cost and accuracy.

u/Uiqueblhats·2 months ago·40 pts / 12 comm

r/LocalLLaMA· COMMUNITY

llama.cpp server have built-in native tools (exec_shell, edit_file, etc.)

llama.cpp server adds native tool support (shell execution, file ops) via experimental --tools flag.

u/srigi·2 months ago·57 pts / 17 comm

r/ClaudeAI· COMMUNITY

My experience using Claude code with Local Llm, and full guide on how to set it up

Wanted to share a workflow I tested on a real flight, in case anyone else is trying to set up offline Claude Code. The core idea: using ollama to pull the needed model of what you need, and then use it to run claude code The setup, in order: 1. Pull a model on home wifi the night before. \`ollama pull <model>\` — \~9 GB for a 14B, \~17 GB for a 26B. Don't try this at the gate. 2. In Claude Code, point at Ollama. The cleanest path I found is wrapping it in two aliases: alias claude-local='ollama launch claude --model gemma4:26b' alias claude-cloud='claude' 3. Verify on the ground with...

u/MaterialAppearance21·2 months ago·20 pts / 7 comm

← Front Page30 matches

Older →

The Archive

CLaC@FinMMEval 2026 Task 3: Sentiment-Augmented Deep Reinforcement Learning for Active Trading -- An Alpha-Reward Approach

Production and Perception in LLMs: A Token Probability Approach

Popular open source AI developer tool Ollama raises $65M, grows to nearly 9M users

PALS: Percentile-Aware Layerwise Sparsity for LLM Pruning

Meta’s new Muse Image model can pull other Instagram users into AI photos

Improving LLM-Generated Process Model Quality Through Reinforcement Learning: The Role of Reward Function Design

Dynamic Bidirectional Pattern Memory: A Production-Scale Empirical Characterisation of Inference-Time Gating in Clinical NLP

Internal-State Probes Read the Situation, Not the Action: Three Negative Results for Pre-Action Misalignment Monitoring

A Tree-of-Thoughts Inspired Hybrid Approach for Legal Case Judgement Summarization using LLMs

Actionable Activation Directions for Detecting and Mitigating Emergent Misalignment Across Language Model Families

The Truth Stays in the Family: Enhancing Contextual Grounding via Inherited Truthful Heads in Model Lineages

BayLing-Duplex: Native Full-Duplex Speech Dialogue with a Single Autoregressive LLM

A Quantitative Experimental Repeated Measures Study of Training Dynamics in a Small Llama Style Language Model Under a Compute-Aware Token Budget

The Shibboleth Effect: Auditing the Cross-Lingual Distributional Skew of Large Language Models

AutoMegaKernel: A Statically-Checked Agent Harness for Self-Retargeting Megakernel Synthesis

PC Layer: Polynomial Weight Preconditioning for Improving LLM Pre-Training

Cross-Lingual Token Arbitrage: Optimizing Code Agent Context Windows via Local LLM Preprocessing

MarginGate: Sparse Margin-Triggered Verification for Batch-Invariant LLM Inference

Qwen3.6 huge quality gain from Q4 to Q6 for coding agent

KV cache quant benchmarks: q5 &amp; q6 are underrated, q8/q4 is bad, TCQ has a niche

Info: Nvidia Cuda 13.3 landed

Strix Halo users, a rejected PR can give you up to 30% faster PP for MOEs.

AI content detector based on Qwen 0.8b fine-tuned on Pangram dataset

CUDA: add fast walsh-hadamard transform by am17an · Pull Request #23615 · ggml-org/llama.cpp

The Financial Times has published an article about Heretic

server: fix checkpoints creation by jacekpoplawski · Pull Request #22929 · ggml-org/llama.cpp

What frontend do you guys use?

Vision-capable LLMs vs. OCR for long-document (including charts, images, tables, etc.) QA

llama.cpp server have built-in native tools (exec_shell, edit_file, etc.)

My experience using Claude code with Local Llm, and full guide on how to set it up

KV cache quant benchmarks: q5 & q6 are underrated, q8/q4 is bad, TCQ has a niche