Ban phrases on llama.cpp with this script.
Community tool adds phrase-filtering capability to llama.cpp inference engine via GitHub script.
Search the full wire by company, model, lab, or keyword. Every story we have ever aggregated.
Community tool adds phrase-filtering capability to llama.cpp inference engine via GitHub script.
User shares Qwen3.6-27B quantized setup with RTX 5090 and llamacpp configuration parameters.
r/LocalLLaMA moderators report positive community response to new rules reducing spam after one week.
PFlash: speculative prefill technique achieves 10x speedup on 128K context with quantized 27B models on RTX 3090, open-source C++/CUDA implementation.
gemma-4-31B-it-DFlash open-weights model released on Hugging Face, pending llama.cpp integration.
User demonstrates DFlash speculative decoding in llama.cpp with Qwen3.5-35B-A3B on RTX 2080 SUPER 8GB, achieving inference on VRAM-constrained hardware.
llama-swap adds matrix grouping feature for multi-model orchestration and intelligent VRAM swap scheduling.
Developer built local PDF-to-audiobook app using Kokoro 82M TTS, Qwen, and llama.cpp with Tauri 2.0 on M1 Mac.
PS5 Linux exploit proposed as potential hardware for local LLM inference via llama.cpp.
IK_LLAMA.cpp adds Qwen3.5 MTP support with 50% throughput gain (18-20→30 tok/s) via pipeline parallelism on 27B model.
llama.cpp adds native NVFP4 quantization support for Blackwell GPUs with benchmark results on RTX 5090.
llama.cpp merged SM120 native NVFP4 quantization support; community released GGUFs for Gemma-4-31B and Nemotron-Cascade models.
Lemonade OmniRouter unifies local AI inference across text, image, audio, and vision modalities via single OpenAI-compatible endpoint using llama.cpp, sd.cpp, and Whisper.
Qwen3.6-27B IQ4_XS quantization bloat analysis; reverting llama.cpp commit reduces VRAM from 15.1GB to 14.7GB with 110k context.
Reddit discussion analyzing tensions within r/LocalLLaMA community between open-weights advocates and commercial interests.
Reddit user seeks recommendations for large open-weight models to run locally on 56GB VRAM using llama.cpp.
User reports GBNF grammar optimizations for Qwen 3.6 35B and 27B models improving coding task performance in llama.cpp.
Tutorial: running a local coding agent with Gemma 4 and Pi using llama.cpp for on-device inference.
Reddit user seeks advice on setting up local coding agents like Claude Code with open-weight models via llama.cpp.
llama.cpp adds experimental DeepSeek v4 Flash support with aggressive 2-bit quantization, achieving 17 tokens/sec on M3 Max with 128GB RAM requirement.
Llama.cpp benchmarks on Windows 11 vs Lubuntu 26.04 with RTX 5080 show significant OS-level performance variance in local inference.
User demonstrates PaddleOCR-VL-1.5 multimodal inference via llama.cpp server for end-to-end document digitization with layout and table handling.
ik_llama.cpp maintainer seeks volunteers to develop Vulkan backend support for CPU/GPU inference optimization.
llama.cpp and ik_llama.cpp now support FP4 inference with different formats: NVFP4 (Nvidia E4M3) and MXFP4 (MX standard) across varying hardware backends.
Developer documents practical setup for running Qwen 3.6 35B on M2 MacBook Pro 32GB via llama.cpp, with performance notes and optimization tips.
Reddit post celebrating current state of local LLM deployment without specific technical claims or data.
Reddit discussion asking about TurboQuant KV cache optimization implementation in llama.cpp.
Qwen 3.6 35B-A3B MoE model achieves 250+ tok/s on AMD Radeon 780M iGPU via llama.cpp Vulkan.
Simon Willison ports LlamaIndex's LiteParse PDF text extraction tool to run in-browser, using spatial parsing and Tesseract OCR without ML models.
Technical walkthrough: Qwen 3.6 27B achieves 85 TPS, 125K context on single RTX 3090 using llama.cpp.