The Archive

Search the full wire by company, model, lab, or keyword. Every story we have ever aggregated.

Claude OpenAI Anthropic Gemini Mistral Cursor

Stop wasting electricity

RTX 4090 power optimization for llama.cpp: reduce consumption 40% via power limits without performance loss.

u/OkFly3388·29 days ago·68 pts / 26 comm

Drastically improve prompt processing speed for --n-cpu-moe partially offloaded models

Tuning llama.cpp ubatch and n-cpu-moe parameters improves gpt-oss-120b prompt processing from 240 to higher tok/s on RTX 3090.

u/coder543·30 days ago·42 pts / 16 comm

r/LocalLLaMA· COMMUNITY

I catalogued every way local models break JSON output and built a repair library, here's what I found across 288 model calls

Empirical study across 288 model calls identifying JSON output failures in Llama 3, Mistral, Command R, DeepSeek, Qwen; failure modes consistent across open and closed models but vary by rate.

u/kexxty·30 days ago·61 pts / 12 comm

r/LocalLLaMA· COMMUNITY

PSA: Watch out for extra spaces in chat-template-kwargs when using Qwen3.6 with llama-server

JSON parsing bug in llama-server's chat-template-kwargs for Qwen3.6 preserve_thinking parameter requires whitespace-free formatting.

u/CaptBrick·1 month ago·40 pts / 11 comm

r/LocalLLaMA· COMMUNITY

ExLlamaV3 Major Updates!

ExLlamaV3 adds Gemma 4 support, improved caching, and DFlash optimization for faster LLM inference on consumer hardware.

u/Unstable_Llama·1 month ago·40 pts / 19 comm

r/LocalLLaMA· COMMUNITY

I have DeepSeek V4 Pro at home

User successfully quantized and ran DeepSeek V4 Pro locally on AMD EPYC + RTX PRO hardware using modified llama.cpp with Q4_K_M compression.

u/fairydreaming·1 month ago·48 pts / 33 comm

r/LocalLLaMA· COMMUNITY

Running Minimax 2.7 at 100k context on strix halo

User shares llama-server configuration for running Minimax 2.7 at 100k context on Strix Halo hardware with detailed tuning notes.

u/Zc5Gwu·1 month ago·43 pts / 18 comm

r/LocalLLaMA· COMMUNITY

I am overwhelmed by Harnesses

Reddit user seeks advice on LLaMA inference harnesses; discusses fragmentation and compatibility issues with local LLM tooling.

u/Available_Hornet3538·1 month ago·40 pts / 107 comm

r/LocalLLaMA· COMMUNITY

BeeLlama.cpp: advanced DFlash & TurboQuant with support of reasoning and vision. Qwen 3.6 27B Q5 with 200k context on 3090, 2-3x faster than baseline (peak 135 tps!)

BeeLlama.cpp fork adds DFlash, TurboQuant, and vision support; runs Qwen 3.6 27B Q5 on RTX 3090 with 200k context at 135 tps.

u/Anbeeld·1 month ago·43 pts / 34 comm

r/LocalLLaMA· COMMUNITY

More Qwen3.6-27B MTP success but on dual Mi50s

User reports 1.5–2x speedup running Qwen 27B with MTP optimization on dual AMD MI50 GPUs via llama.cpp.

u/legit_split_·1 month ago·40 pts / 15 comm

r/LocalLLaMA· COMMUNITY

80 tok/sec and 128K context on 12GB VRAM with Qwen3.6 35B A3B and llama.cpp MTP

User achieves 80 tok/sec with 128K context on RTX 4070 Super using Qwen3.6 35B quantization and llama.cpp MTP implementation.

u/janvitos·1 month ago·46 pts / 18 comm

r/LocalLLaMA· COMMUNITY

How long for llama.cpp official support of MTP?

Reddit user asks about llama.cpp timeline for Vulkan/HIP MTP support on Strix Halo Windows 11.

u/Manaberryio·1 month ago·42 pts / 35 comm

arXiv (cs.AI/CL/LG)· ACADEMIA

Where's the Plan? Locating Latent Planning in Language Models with Lightweight Mechanistic Interventions

Linear probing and activation patching reveal latent planning representations in Qwen3, Gemma-3, Llama-3; future constraints encoded at layer boundaries.

Nicole Ma·1 month ago

r/LocalLLaMA· COMMUNITY

Multi-Token Prediction (MTP) for LLaMA.cpp - Gemma 4 speedup by 40%

Multi-Token Prediction optimization for LLaMA.cpp achieves 40% speedup on Gemma 4 quantized models via parallel token drafting.

u/gladkos·1 month ago·67 pts / 15 comm

r/LocalLLaMA· COMMUNITY

guess what? if you are a chrome user, technically you are localllama member!

Chrome allegedly downloads 4GB LLM checkpoint without user consent, raising privacy and transparency concerns for browser-embedded AI.

u/LambdaHominem·1 month ago·53 pts / 16 comm

r/LocalLLaMA· COMMUNITY

feat: Add Mimo v2.5 model support by AesSedai · Pull Request #22493 · ggml-org/llama.cpp

Xiaomi releases Mimo v2.5, a 310B sparse MoE multimodal model with 1M token context supporting text, image, video, and audio.

u/jacek2023·1 month ago·40 pts / 14 comm

r/LocalLLaMA· COMMUNITY

Running Qwen3.5 / Qwen3.6 with NextN MTP (Multi-Token Prediction) speculative decode in llama.cpp — single RTX 3090 Ti GPU guide

Qwen3.5/3.6 inference optimization guide: NextN MTP speculative decoding achieves 2.9× speedup on RTX 3090 Ti via llama.cpp with zero quality loss.

u/yes_i_tried_google·1 month ago·40 pts / 19 comm

r/LocalLLaMA· COMMUNITY

Great results with Qwen3.6-35B-A3B-UD-Q5_K_XL + VS Code and Copilot

User reports successful local inference with Qwen3.6-35B on AMD R9700 GPU, generating functional code and tests via llama-cpp.

u/supracode·1 month ago·41 pts / 15 comm

r/LocalLLaMA· COMMUNITY

Qwen3.6-27B with MTP grafted on Unsloth UD XL: 2.5x throughput via unmerged llama.cpp PR

Qwen3.6-27B with Multi-Token Prediction achieves 2.5x throughput via Unsloth quantization and llama.cpp integration.

u/havenoammo·1 month ago·48 pts / 28 comm

r/LocalLLaMA· COMMUNITY

2.5x faster inference with Qwen 3.6 27B using MTP - Finally a viable option for local agentic coding - 262k context on 48GB - Fixed chat template - Drop-in OpenAI and Anthropic API endpoints

Qwen 3.6 27B achieves 2.5x inference speedup via MTP speculative decoding in llama.cpp; 262k context on 48GB with fixed chat templates.

u/ex-arman68·1 month ago·85 pts / 23 comm

r/LocalLLaMA· COMMUNITY

Qwen 3.6 27B MTP on v100 32GB: 54 t/s

Qwen 27B achieves 54 t/s on V100 GPU with MTP optimization in llama.cpp, nearly 2x baseline speed for code review and tool use tasks.

u/m94301·1 month ago·41 pts / 10 comm

r/LocalLLaMA· COMMUNITY

Bleeding Llama: Critical Unauthenticated Memory Leak in Ollama

Cyera reports critical unauthenticated memory leak vulnerability in Ollama enabling unauthorized data access.

u/exintrovert420·1 month ago·41 pts / 10 comm

r/LocalLLaMA· COMMUNITY

MTP on strix halo with llama.cpp (PR #22673)

User reports successful MTP speculative decoding on AMD Strix Halo (AI Max 395) with llama.cpp achieving 60-80 tok/s on Qwen 3.6B GGUF.

u/Edenar·1 month ago·42 pts / 20 comm

The Verge AI· PRESS

Meta sued by major book publishers over copyright infringement

Meta is facing a class action lawsuit filed by five major book publishers and one author over claims the company "engaged in one of the most massive infringements of copyrighted materials in history" when training its Llama AI models, as reported earlier by The New York Times. In their suit, Macmillan, McGraw-Hill, Elsevier, Hachette, Cengage, and author Scott Turow allege that Meta "repeatedly copied" their books and journal articles without permission. The lawsuit accuses Meta of knowingly ripping copyrighted work from "notorious pirate sites," such as LibGen, Anna's Archive, Sci-Hub, Sci-M...

Emma Roth·1 month ago

r/ClaudeAI· COMMUNITY

OllamaXClaude

Unexpected email to wake up to but I am here for it! Model agnostic tools are the way! This is huge!

u/No-Butterscotch-218·1 month ago·27 pts / 5 comm

r/LocalLLaMA· COMMUNITY

As MTP prepares to land in llama.cpp, Models that support MTP

MTP format support coming to llama.cpp; DeepSeekv3, Qwen3.5, GLM4.5, and other models compatible pending native weights.

u/segmond·1 month ago·46 pts / 28 comm

r/LocalLLaMA· COMMUNITY

FastDMS: 6.4X KV-cache compression running faster than vLLM BF16/FP8

FastDMS achieves 6.4× KV-cache compression on Llama 3.2 1B via learned token eviction, matching vLLM performance with lower memory overhead.

u/randomfoo2·1 month ago·51 pts / 10 comm

r/LocalLLaMA· COMMUNITY

Llama.cpp MTP support now in beta!

llama.cpp adds beta MTP (Multi-Token Prediction) support, starting with Qwen3.5, closing performance gap with vLLM on token generation.

u/ilintar·1 month ago·49 pts / 24 comm

r/LocalLLaMA· COMMUNITY

What a time to be alive from 1tk/sec to 20-100tk/sec for huge models

Quantized Llama 405B and DeepSeek models now achieve 20-100 tokens/sec on consumer hardware, up from 1 token/sec two years ago.

u/segmond·1 month ago·45 pts / 32 comm

r/LocalLLaMA· COMMUNITY

If you've been waiting to try local AI development, please try it

Developer reports local Qwen 27B setup with llama-server now competitive with Claude Code and Cursor for coding tasks, driven by cloud provider cost increases.

u/Imaginary_Belt4976·1 month ago·48 pts / 30 comm

← Front Page30 matches

← Newer Older →

The Archive

Stop wasting electricity

Drastically improve prompt processing speed for --n-cpu-moe partially offloaded models

I catalogued every way local models break JSON output and built a repair library, here's what I found across 288 model calls

PSA: Watch out for extra spaces in chat-template-kwargs when using Qwen3.6 with llama-server

ExLlamaV3 Major Updates!

I have DeepSeek V4 Pro at home

Running Minimax 2.7 at 100k context on strix halo

I am overwhelmed by Harnesses

BeeLlama.cpp: advanced DFlash &amp; TurboQuant with support of reasoning and vision. Qwen 3.6 27B Q5 with 200k context on 3090, 2-3x faster than baseline (peak 135 tps!)

More Qwen3.6-27B MTP success but on dual Mi50s

80 tok/sec and 128K context on 12GB VRAM with Qwen3.6 35B A3B and llama.cpp MTP

How long for llama.cpp official support of MTP?

Where's the Plan? Locating Latent Planning in Language Models with Lightweight Mechanistic Interventions

Multi-Token Prediction (MTP) for LLaMA.cpp - Gemma 4 speedup by 40%

guess what? if you are a chrome user, technically you are localllama member!

feat: Add Mimo v2.5 model support by AesSedai · Pull Request #22493 · ggml-org/llama.cpp

Running Qwen3.5 / Qwen3.6 with NextN MTP (Multi-Token Prediction) speculative decode in llama.cpp — single RTX 3090 Ti GPU guide

Great results with Qwen3.6-35B-A3B-UD-Q5_K_XL + VS Code and Copilot

Qwen3.6-27B with MTP grafted on Unsloth UD XL: 2.5x throughput via unmerged llama.cpp PR

2.5x faster inference with Qwen 3.6 27B using MTP - Finally a viable option for local agentic coding - 262k context on 48GB - Fixed chat template - Drop-in OpenAI and Anthropic API endpoints

Qwen 3.6 27B MTP on v100 32GB: 54 t/s

Bleeding Llama: Critical Unauthenticated Memory Leak in Ollama

MTP on strix halo with llama.cpp (PR #22673)

Meta sued by major book publishers over copyright infringement

OllamaXClaude

As MTP prepares to land in llama.cpp, Models that support MTP

FastDMS: 6.4X KV-cache compression running faster than vLLM BF16/FP8

Llama.cpp MTP support now in beta!

What a time to be alive from 1tk/sec to 20-100tk/sec for huge models

If you've been waiting to try local AI development, please try it

BeeLlama.cpp: advanced DFlash & TurboQuant with support of reasoning and vision. Qwen 3.6 27B Q5 with 200k context on 3090, 2-3x faster than baseline (peak 135 tps!)