The Archive

Search the full wire by company, model, lab, or keyword. Every story we have ever aggregated.

Claude OpenAI Anthropic Gemini Mistral Cursor

r/LocalLLaMA· COMMUNITY

LM Studio finally added support for MTP Speculative Decoding

LM Studio 0.4.14 adds MTP Speculative Decoding support via llama.cpp 2.15.0 for faster inference.

u/pigeon57434·21 days ago·77 pts / 13 comm

r/LocalLLaMA· COMMUNITY

Time to update llama.cpp to get som MTP improvements!

llama.cpp PR #23269 introduces MTP (Multi-Token Prediction) improvements for faster local LLM inference.

u/PixelatedCaffeine·22 days ago·41 pts / 23 comm

r/LocalLLaMA· COMMUNITY

llama.cpp MTP support landed - Qwen3.6 27B at 2.44× on a Strix Halo, 2.17× on a RTX 3090 rig

llama.cpp MTP speculative decoding merged; Qwen 27B/35B inference speedups 1.24–2.44× on consumer GPUs/APUs with quantization.

u/C_Coffie·23 days ago·40 pts / 28 comm

r/LocalLLaMA· COMMUNITY

Qwen 35b a3b surprises me

Just wanted to share that I'm pretty happy about Qwen 35b a3b agentic coding performance. I'm running the model in q80 quant, kv cache both q8\_0 as well, with 262144 in 4090 + 5060 ti, via llama.cpp backend with claude code pointing to localhost. For demo/data analytics purposes, it works pretty well. I haven't used it for large codebases, but it definitely is better than gemma4 26b in my use case. One thing that surprises me is that it seems to get better outcome in agentic coding, than chat. When using it with just chat UI, i found the code qwen35b provide a bit too clunky. I wonder o...

u/siegevjorn·23 days ago·40 pts / 33 comm

r/LocalLLaMA· COMMUNITY

PSA: If you haven’t updated Llama.cpp for a couple of days and find MTP to not be performing well, update llamacpp.

Llama.cpp performance improved 1.5–1.8x in recent updates; MTP and prompt processing issues partially resolved.

u/Borkato·23 days ago·40 pts / 30 comm

r/LocalLLaMA· COMMUNITY

Quantizing MTP KV Cache = free lunch?

MTP KV cache quantization in llama.cpp Qwen models reduces VRAM overhead without apparent inference degradation.

u/legit_split_·23 days ago·44 pts / 28 comm

r/LocalLLaMA· COMMUNITY

NEW BITNET MODELS!

OpenBMB releases BitCPM4-CANN family (1B–8B params) with BitNet quantization; awaiting llama.cpp support.

u/Silver-Champion-4846·23 days ago·40 pts / 18 comm

r/LocalLLaMA· COMMUNITY

Qwen 3.6 27B on 24GB VRAM setup: backend comparisons, quant choice and settings (llama.cpp, ik_llama.cpp, BeeLlama, vllm)

Qwen 3.6 27B performance benchmarks across llama.cpp backends on RTX 3090: ik_llama.cpp achieved 1261 tok/s prefill, 72.9 tok/s decode with 156k context.

u/VolandBerlioz·23 days ago·43 pts / 16 comm

r/LocalLLaMA· COMMUNITY

llama: avoid copying logits during prompt decode in MTP by am17an · Pull Request #23198 · ggml-org/llama.cpp

llama.cpp PR #23198 optimizes prompt decode by avoiding logits copying in MTP, improving inference speed.

u/jacek2023·24 days ago·40 pts / 13 comm

r/LocalLLaMA· COMMUNITY

Dual GPU llama.cpp speedup

llama.cpp fork adds quantized KV cache support for tensor parallelism across dual GPUs, addressing long-standing inference bottleneck.

u/Legitimate-Dog5690·24 days ago·44 pts / 27 comm

r/LocalLLaMA· COMMUNITY

Llama.cpp MTP with Qwen3.6 27B on Headless RTX 3090

Llama.cpp multi-token prediction on Qwen 3.6 27B shows 42% prefill slowdown but 85% token generation speedup on RTX 3090.

u/cleversmoke·24 days ago·40 pts / 34 comm

r/LocalLLaMA· COMMUNITY

Testing llama.cpp MTP support on Qwen3.6 - RTX 5090

Benchmarking llama.cpp MTP (multi-token prediction) on Qwen 3.6 with RTX 5090, comparing inference speed with draft-mtp flag toggled.

u/3VITAERC·24 days ago·63 pts / 10 comm

r/LocalLLaMA· COMMUNITY

Looking to migrate off of Ollama and LMStudio

User seeks faster inference alternatives to Ollama/LM Studio for local model serving (Gemma, Qwen, OpenBioLLM) on 64GB RAM.

u/letsbefrds·24 days ago·40 pts / 77 comm

r/LocalLLaMA· COMMUNITY

b9180 llama.ccp MTP landed

llama.cpp release b9180 ships MTP support, enabling improved inference optimization for local LLM deployment.

u/Bulky-Priority6824·25 days ago·40 pts / 26 comm

r/LocalLLaMA· COMMUNITY

Qwen 27b MTP Config, Llama.cpp Single 3090

Community discussion of Qwen 27B quantization and inference optimization on single RTX 3090 GPU.

u/GotHereLateNameTaken·25 days ago·41 pts / 31 comm

r/LocalLLaMA· COMMUNITY

Strix Halo Llama.cpp MTP Benchmarks: 27B Gets Much Faster, 35B Is Mixed

Qwen3.6 27B-MTP shows 11.5% faster wall-clock time vs base on single-turn; 35B-MTP regresses 11.2%, with generation speedups offset by prompt processing slowdowns.

u/xjE4644Eyc·25 days ago·49 pts / 21 comm

r/LocalLLaMA· COMMUNITY

MTP support merged into llama.cpp

MTP support merged into llama.cpp master branch (PR #22673); no context on feature impact.

u/tacticaltweaker·25 days ago·43 pts / 14 comm

r/LocalLLaMA· COMMUNITY

MTP PR Merged!!!

MTP PR merged into llama.cpp; technical details absent.

u/Valuable_Touch5670·25 days ago·98 pts / 27 comm

r/LocalLLaMA· COMMUNITY

That's a good news...

Looks like it finally happens... MTP getting approved for llama.cpp. Time to prepare for the update.

u/Pjotrs·25 days ago·125 pts / 28 comm

r/LocalLLaMA· COMMUNITY

Built a fully offline suitcase robot around a Jetson Orin NX SUPER 16GB. Gemma 4 E4B, ~200ms cached TTFT, 30+ sensors, no WiFi/BT/cellular. He has opinions.

Engineer deployed Gemma 4 E4B on Jetson Orin NX with 30+ sensors, 200ms TTFT, offline multimodal robotics stack using llama.cpp and Piper TTS.

u/CreativelyBankrupt·26 days ago·106 pts / 17 comm

r/LocalLLaMA· COMMUNITY

club-5060ti: practical RTX 5060 Ti local LLM notes and configs

RTX 5060 Ti local LLM configuration guide covering vLLM and llama.cpp serving of Qwen 27B/35B models with quantization and long-context presets.

u/do_u_think_im_spooky·26 days ago·40 pts / 11 comm

r/LocalLLaMA· COMMUNITY

Llama-Studio, WebUI for llama-server Management

Llama-Studio: open-source WebUI for managing multiple local llama-server instances and configurations.

u/m94301·27 days ago·41 pts / 27 comm

r/LocalLLaMA· COMMUNITY

Automated AI researcher running locally with llama.cpp

Hugging Face releases ml-intern, an agent framework for local LLM research automation supporting llama.cpp/ollama with Qwen and Claude models.

u/lewtun·27 days ago·48 pts / 13 comm

r/LocalLLaMA· COMMUNITY

Multi-Token Prediction (MTP) for Qwen on LLaMA.cpp + TurboQuant

Multi-token prediction + TurboQuant quantization achieves 40% throughput gain (21→34 tokens/s) on Qwen 27B/35B via LLaMA.cpp on M-series Mac.

u/gladkos·27 days ago·47 pts / 28 comm

r/LocalLLaMA· COMMUNITY

24+ tok/s from ~30B MoE models on an old GTX 1080 (8 GB VRAM, 128k context)

Qwen 3.6 35B and Gemma 4 26B MoE models achieve 20–24.5 tok/s on GTX 1080 with 128k context via llama.cpp quantization.

u/mdda·28 days ago·46 pts / 10 comm

r/LocalLLaMA· COMMUNITY

llama.cpp docker images to run MTP models

Community Docker images for llama.cpp with MTP support to simplify local model inference setup across CUDA versions.

u/havenoammo·28 days ago·40 pts / 10 comm

r/LocalLLaMA· COMMUNITY

Is using vLLM actually worth it if you aren't serving the model to other people?

Reddit discussion comparing vLLM vs llama.cpp for single-user local inference on AMD GPUs.

u/ayylmaonade·29 days ago·40 pts / 44 comm

r/ClaudeAI· COMMUNITY

PSA: If your project has an ANTHROPIC_API_KEY in any .env file, Claude Code will silently bill your API account instead of your Max plan — Anthropic calls it "intentional functionality"

r/ClaudeAI • also crosspost to r/LocalLLaMA and r/artificial I lost $187 to this and want to save others the same headache. **What happened** I run Claude Code headlessly via Windows Task Scheduler. My project repo has a `.env` file with `ANTHROPIC_API_KEY` set — legitimately, for a separate Express server doing AI-based transaction classification. Nothing to do with Claude Code itself. Claude Code reads environment variables from the `.env` in its working directory on launch. When it finds `ANTHROPIC_API_KEY` there, it silently uses that key for billing instead of your OAuth ...

u/35yearstrading·29 days ago·36 pts / 16 comm

r/LocalLLaMA· COMMUNITY

Luce DFlash + PFlash on AMD Strix Halo: Qwen3.6-27B at 2.23x decode and 3.05x prefill vs llama.cpp HIP

Luce ships DFlash+PFlash optimizations for AMD Ryzen AI MAX+ 395, achieving 2.23x decode speedup on Qwen 3.6-27B vs llama.cpp HIP.

u/sandropuppo·29 days ago·41 pts / 16 comm

r/LocalLLaMA· COMMUNITY

examples : add llama-eval by ggerganov · Pull Request #21152 · ggml-org/llama.cpp

llama.cpp adds llama-eval benchmarking tool supporting AIME, GSM8K, GPQA for local quantized model evaluation.

u/jacek2023·29 days ago·41 pts / 13 comm

← Front Page30 matches

← Newer Older →