Vol. I · No. 52WED, JUN 10, 2026
Topic

Llama

Every story matching this topic across titles and summaries, newest first.

The Shibboleth Effect: Auditing the Cross-Lingual Distributional Skew of Large Language Models

This study investigates cross-lingual distributional skew (the Shibboleth Effect) in frontier large language models (LLMs) subjected to sustained adversarial conditions. We develop a multi-agent geopolitical wargame, the Cerulean Sea Crisis, a synthetic maritime territorial dispute designed to mirror the structural dynamics of Eastern Mediterranean conflicts. Six frontier models (GPT-4o, Llama-4, Mistral-Large, Gemini-3.1-Pro, Qwen3.6-Plus, and DeepSeek-R1) participate in a between-groups experiment (N = 10 games per arm, K = 5 rounds per game) in which the sole manipulation is the language o...

·

AutoMegaKernel: A Statically-Checked Agent Harness for Self-Retargeting Megakernel Synthesis

AutoMegaKernel (AMK) compiles a HuggingFace Llama-family model into a single persistent cooperative CUDA kernel that runs the whole forward pass in one launch, with no per-model hand-written CUDA. The contribution is the system, not raw speed. A frozen schedule-IR validator statically certifies deadlock-freedom and race-freedom via static graph checks (not a mechanized proof), so an unsafe agent-proposed schedule is rejected before launch: across 7,160 adversarial schedules (6,091 unsafe) it had zero false-accepts and accepted all 360 real lowerings. The same source retargets sm_80/sm_90/sm_1...

·

PC Layer: Polynomial Weight Preconditioning for Improving LLM Pre-Training

We propose a preconditioning (PC) layer, a weight parameterization via polynomial preconditioner that ensures stable weight conditioning throughout LLM training. The PC module reshapes the singular-value spectrum of weight matrices via low-degree polynomial preconditioning. After training, the preconditioned weights can be merged back into the original architecture, incurring no inference overhead. We demonstrate the advantage of the proposed PC layer over standard transformers in Llama-1B pre-training, for both the AdamW and Muon optimizers. Theoretically, we justify this spectrum-control pr...

·

Cross-Lingual Token Arbitrage: Optimizing Code Agent Context Windows via Local LLM Preprocessing

AI-assisted coding agents are bottlenecked by input-token cost. Two pathologies of raw human input drive much of this overhead: tokenization inefficiency for non-English text and structural entropy in conversational prompts. Existing approaches act reactively by compressing already-bloated contexts or intervening after failures occur. We introduce a pre-flight, edge-side prompt-rewriting middleware that operates between the developer and the cloud agent. A local Llama 3.2 (3B) model performs cross-lingual translation into English, structural rewriting into a compact task-oriented format, and ...

·

MarginGate: Sparse Margin-Triggered Verification for Batch-Invariant LLM Inference

Temperature-zero BF16 LLM inference is often treated as reproducible, yet the same request can emit different tokens when decoded alone or inside a larger batch. Existing fixes use batch-invariant operators or LLM-42's per-token verification, incurring cost even when most steps are stable. We ask whether verification can be applied exclusively to flipped tokens. Across five models, batch-induced token flips are sparse on the flip-rate benchmarks: on MATH500, Llama-3.1-8B flips on $0.48\%$ of synchronous decode steps, and all tested models stay within the 0.3-1.3% range on MATH500, GSM8K, and ...

·

Qwen3.6 huge quality gain from Q4 to Q6 for coding agent

So, last week I tried to update my unused local LLM setup. I had to stop using it because quality was too low and deepseek was too cheap. First thing I stopped using Ollama and now I only use llama.cpp built in server that works really great. The quality improvement from Q4 to Q6 is outstanding and finally a local LLM server can work very similarly to paid APIs. That's great! And MTP makes a big performance gain, on a dual 3090 (downvolted and limited to 65°C) it generates from 20 to 50 tokens per second with minimal heat generation. So yes, that time has finally arrived! Local coding age...

··

KV cache quant benchmarks: q5 & q6 are underrated, q8/q4 is bad, TCQ has a niche

Here's my article with **38 quant pairs** thoroughly benchmarked in KLD with **3 different Qwen 3.6 27B configs**: Q5\_K\_S + 64k context, IQ4\_XS + 64k context, IQ4\_XS + 128k context. This allows us to track not only how cache quantizations affects the precision in a vacuum, but also how it interacts with noise from the model itself. All benchmarks were done using my [BeeLlama.cpp](https://github.com/Anbeeld/beellama.cpp) fork, allowing to include a number of quant types that are not present in mainline llama.cpp: vanilla TurboQuant, TCQ 3-bit/2-bit, and q6\_0. [https://anbeeld.com/articl...

··

Info: Nvidia Cuda 13.3 landed

[Cuda 13.3 Downloads](https://developer.nvidia.com/cuda-downloads) [Release Notes](https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html) Anybody already tried llama.cpp with 13.3?

··

Strix Halo users, a rejected PR can give you up to 30% faster PP for MOEs.

Here's the PR by pedapudi. https://github.com/ggml-org/llama.cpp/pull/21344 It's merge request has been denied so it will not be in mainline llama.cpp. The changes are so small that I just put them into whatever the current release of llama.cpp is. Read the PR for more info. It will only work with MOEs. Also, it gives the most boost at low context. As the context rises, the gain diminishes. Pedapudi explains why that happens in the PR. Here are some numbers. It really works well. The tiny amount of time it takes me to apply the code to the current release of llama.cpp is time well spent. ...

··

AI content detector based on Qwen 0.8b fine-tuned on Pangram dataset

I've fine-tuned Qwen 3.5 0.8B on the dataset provided by Pangram with their EditLens paper. It's available via a [Chrome extension](https://chromewebstore.google.com/detail/slop-hammer/gfjdmhfokmhedlgfggmmgchpppmhkdgg); you can just click selected text and it's going to give you the probability distribution of how likely it is AI-generated. It takes under 1s on my M1 MacBook Pro. Pangram did release Llama 3.2 3B trained on their dataset, but I found this model slightly too legacy (too big for the capabilities). Qwen 0.8B (base) ended up being as good after roughly 20h of fine-tuning on a sin...

··

My experience using Claude code with Local Llm, and full guide on how to set it up

Wanted to share a workflow I tested on a real flight, in case anyone else is trying to set up offline Claude Code. The core idea: using ollama to pull the needed model of what you need, and then use it to run claude code The setup, in order: 1. Pull a model on home wifi the night before. \`ollama pull <model>\` — \~9 GB for a 14B, \~17 GB for a 26B. Don't try this at the gate. 2. In Claude Code, point at Ollama. The cleanest path I found is wrapping it in two aliases: alias claude-local='ollama launch claude --model gemma4:26b' alias claude-cloud='claude' 3. Verify on the ground with...

··

Experts first llama.cpp

Community fork of llama.cpp optimizes MoE inference on 12GB VRAM by loading only active experts rather than full layers.

··

[NEW] Supra-50M Released!

SupraLabs released Supra-50M, a 50M-parameter Llama-style language model trained on 20B educational tokens with competitive benchmark performance.

··

[WIP] Gemma 4 MTP

Early-stage Gemma 4 MTP compilation work-in-progress shared on LocalLLaMA.

··

Qwen 35b a3b surprises me

Just wanted to share that I'm pretty happy about Qwen 35b a3b agentic coding performance. I'm running the model in q80 quant, kv cache both q8\_0 as well, with 262144 in 4090 + 5060 ti, via llama.cpp backend with claude code pointing to localhost. For demo/data analytics purposes, it works pretty well. I haven't used it for large codebases, but it definitely is better than gemma4 26b in my use case. One thing that surprises me is that it seems to get better outcome in agentic coding, than chat. When using it with just chat UI, i found the code qwen35b provide a bit too clunky. I wonder o...

··

NEW BITNET MODELS!

OpenBMB releases BitCPM4-CANN family (1B–8B params) with BitNet quantization; awaiting llama.cpp support.

··

Dual GPU llama.cpp speedup

llama.cpp fork adds quantized KV cache support for tensor parallelism across dual GPUs, addressing long-standing inference bottleneck.

··

b9180 llama.ccp MTP landed

llama.cpp release b9180 ships MTP support, enabling improved inference optimization for local LLM deployment.

··

MTP PR Merged!!!

MTP PR merged into llama.cpp; technical details absent.

··

That's a good news...

Looks like it finally happens... MTP getting approved for llama.cpp. Time to prepare for the update.

··

PSA: If your project has an ANTHROPIC_API_KEY in any .env file, Claude Code will silently bill your API account instead of your Max plan — Anthropic calls it "intentional functionality"

r/ClaudeAI • also crosspost to r/LocalLLaMA and r/artificial I lost $187 to this and want to save others the same headache. **What happened** I run Claude Code headlessly via Windows Task Scheduler. My project repo has a `.env` file with `ANTHROPIC_API_KEY` set — legitimately, for a separate Express server doing AI-based transaction classification. Nothing to do with Claude Code itself. Claude Code reads environment variables from the `.env` in its working directory on launch. When it finds `ANTHROPIC_API_KEY` there, it silently uses that key for billing instead of your OAuth ...

··

Stop wasting electricity

RTX 4090 power optimization for llama.cpp: reduce consumption 40% via power limits without performance loss.

··

ExLlamaV3 Major Updates!

ExLlamaV3 adds Gemma 4 support, improved caching, and DFlash optimization for faster LLM inference on consumer hardware.

··

I have DeepSeek V4 Pro at home

User successfully quantized and ran DeepSeek V4 Pro locally on AMD EPYC + RTX PRO hardware using modified llama.cpp with Q4_K_M compression.

··

I am overwhelmed by Harnesses

Reddit user seeks advice on LLaMA inference harnesses; discusses fragmentation and compatibility issues with local LLM tooling.

··

Meta sued by major book publishers over copyright infringement

Meta is facing a class action lawsuit filed by five major book publishers and one author over claims the company "engaged in one of the most massive infringements of copyrighted materials in history" when training its Llama AI models, as reported earlier by The New York Times. In their suit, Macmillan, McGraw-Hill, Elsevier, Hachette, Cengage, and author Scott Turow allege that Meta "repeatedly copied" their books and journal articles without permission. The lawsuit accuses Meta of knowingly ripping copyrighted work from "notorious pirate sites," such as LibGen, Anna's Archive, Sci-Hub, Sci-M...

·

OllamaXClaude

Unexpected email to wake up to but I am here for it! Model agnostic tools are the way! This is huge!

··

Llama.cpp MTP support now in beta!

llama.cpp adds beta MTP (Multi-Token Prediction) support, starting with Qwen3.5, closing performance gap with vLLM on token generation.

··

Qwen3.6-27B-NVFP4 - images

User shares Qwen3.6-27B quantized setup with RTX 5090 and llamacpp configuration parameters.

··

New rules 1 week check-in

r/LocalLLaMA moderators report positive community response to new rules reducing spam after one week.

··
100 stories