LM Studio finally added support for MTP Speculative Decoding
LM Studio 0.4.14 adds MTP Speculative Decoding support via llama.cpp 2.15.0 for faster inference.
Search the full wire by company, model, lab, or keyword. Every story we have ever aggregated.
LM Studio 0.4.14 adds MTP Speculative Decoding support via llama.cpp 2.15.0 for faster inference.
llama.cpp PR #23269 introduces MTP (Multi-Token Prediction) improvements for faster local LLM inference.
llama.cpp MTP speculative decoding merged; Qwen 27B/35B inference speedups 1.24–2.44× on consumer GPUs/APUs with quantization.
Just wanted to share that I'm pretty happy about Qwen 35b a3b agentic coding performance. I'm running the model in q80 quant, kv cache both q8\_0 as well, with 262144 in 4090 + 5060 ti, via llama.cpp backend with claude code pointing to localhost. For demo/data analytics purposes, it works pretty well. I haven't used it for large codebases, but it definitely is better than gemma4 26b in my use case. One thing that surprises me is that it seems to get better outcome in agentic coding, than chat. When using it with just chat UI, i found the code qwen35b provide a bit too clunky. I wonder o...
Llama.cpp performance improved 1.5–1.8x in recent updates; MTP and prompt processing issues partially resolved.
MTP KV cache quantization in llama.cpp Qwen models reduces VRAM overhead without apparent inference degradation.
OpenBMB releases BitCPM4-CANN family (1B–8B params) with BitNet quantization; awaiting llama.cpp support.
Qwen 3.6 27B performance benchmarks across llama.cpp backends on RTX 3090: ik_llama.cpp achieved 1261 tok/s prefill, 72.9 tok/s decode with 156k context.
llama.cpp PR #23198 optimizes prompt decode by avoiding logits copying in MTP, improving inference speed.
llama.cpp fork adds quantized KV cache support for tensor parallelism across dual GPUs, addressing long-standing inference bottleneck.
Llama.cpp multi-token prediction on Qwen 3.6 27B shows 42% prefill slowdown but 85% token generation speedup on RTX 3090.
Benchmarking llama.cpp MTP (multi-token prediction) on Qwen 3.6 with RTX 5090, comparing inference speed with draft-mtp flag toggled.
User seeks faster inference alternatives to Ollama/LM Studio for local model serving (Gemma, Qwen, OpenBioLLM) on 64GB RAM.
llama.cpp release b9180 ships MTP support, enabling improved inference optimization for local LLM deployment.
Community discussion of Qwen 27B quantization and inference optimization on single RTX 3090 GPU.
Qwen3.6 27B-MTP shows 11.5% faster wall-clock time vs base on single-turn; 35B-MTP regresses 11.2%, with generation speedups offset by prompt processing slowdowns.
MTP support merged into llama.cpp master branch (PR #22673); no context on feature impact.
Looks like it finally happens... MTP getting approved for llama.cpp. Time to prepare for the update.
Engineer deployed Gemma 4 E4B on Jetson Orin NX with 30+ sensors, 200ms TTFT, offline multimodal robotics stack using llama.cpp and Piper TTS.
RTX 5060 Ti local LLM configuration guide covering vLLM and llama.cpp serving of Qwen 27B/35B models with quantization and long-context presets.
Llama-Studio: open-source WebUI for managing multiple local llama-server instances and configurations.
Hugging Face releases ml-intern, an agent framework for local LLM research automation supporting llama.cpp/ollama with Qwen and Claude models.
Multi-token prediction + TurboQuant quantization achieves 40% throughput gain (21→34 tokens/s) on Qwen 27B/35B via LLaMA.cpp on M-series Mac.
Qwen 3.6 35B and Gemma 4 26B MoE models achieve 20–24.5 tok/s on GTX 1080 with 128k context via llama.cpp quantization.
Community Docker images for llama.cpp with MTP support to simplify local model inference setup across CUDA versions.
Reddit discussion comparing vLLM vs llama.cpp for single-user local inference on AMD GPUs.
r/ClaudeAI • also crosspost to r/LocalLLaMA and r/artificial I lost $187 to this and want to save others the same headache. **What happened** I run Claude Code headlessly via Windows Task Scheduler. My project repo has a `.env` file with `ANTHROPIC_API_KEY` set — legitimately, for a separate Express server doing AI-based transaction classification. Nothing to do with Claude Code itself. Claude Code reads environment variables from the `.env` in its working directory on launch. When it finds `ANTHROPIC_API_KEY` there, it silently uses that key for billing instead of your OAuth ...
Luce ships DFlash+PFlash optimizations for AMD Ryzen AI MAX+ 395, achieving 2.23x decode speedup on Qwen 3.6-27B vs llama.cpp HIP.
llama.cpp adds llama-eval benchmarking tool supporting AIME, GSM8K, GPQA for local quantized model evaluation.