The Archive

Search the full wire by company, model, lab, or keyword. Every story we have ever aggregated.

Claude OpenAI Anthropic Gemini Mistral Cursor

r/LocalLLaMA· COMMUNITY

Ban phrases on llama.cpp with this script.

Community tool adds phrase-filtering capability to llama.cpp inference engine via GitHub script.

u/Total-Resort-3120·1 month ago·41 pts / 25 comm

r/LocalLLaMA· COMMUNITY

Qwen3.6-27B-NVFP4 - images

User shares Qwen3.6-27B quantized setup with RTX 5090 and llamacpp configuration parameters.

u/Usual-Carrot6352·1 month ago·41 pts / 18 comm

r/LocalLLaMA· COMMUNITY

New rules 1 week check-in

r/LocalLLaMA moderators report positive community response to new rules reducing spam after one week.

u/rm-rf-rm·1 month ago·50 pts / 28 comm

r/LocalLLaMA· COMMUNITY

PFlash: 10x prefill speedup over llama.cpp at 128K on a RTX 3090

PFlash: speculative prefill technique achieves 10x speedup on 128K context with quantized 27B models on RTX 3090, open-source C++/CUDA implementation.

u/sandropuppo·1 month ago·68 pts / 17 comm

r/LocalLLaMA· COMMUNITY

gemma-4-31B-it-DFlash has been released

gemma-4-31B-it-DFlash open-weights model released on Hugging Face, pending llama.cpp integration.

u/Total-Resort-3120·1 month ago·41 pts / 10 comm

r/LocalLLaMA· COMMUNITY

Got DFlash speculative decoding working on Qwen3.5-35B-A3B with an RTX 2080 SUPER 8GB

User demonstrates DFlash speculative decoding in llama.cpp with Qwen3.5-35B-A3B on RTX 2080 SUPER 8GB, achieving inference on VRAM-constrained hardware.

u/jwestra·1 month ago·45 pts / 11 comm

r/LocalLLaMA· COMMUNITY

PSA: llama-swap released a new grouping feature, matrix, allowing you to fine tune which models can run together

llama-swap adds matrix grouping feature for multi-model orchestration and intelligent VRAM swap scheduling.

u/walden42·1 month ago·41 pts / 14 comm

r/LocalLLaMA· COMMUNITY

Building a fully local PDF-to-audiobook workflow with Kokoro 82M, Qwen and llama.cpp

Developer built local PDF-to-audiobook app using Kokoro 82M TTS, Qwen, and llama.cpp with Tauri 2.0 on M1 Mac.

u/purellmagents·1 month ago·40 pts / 17 comm

r/LocalLLaMA· COMMUNITY

PS5’s can now be hacked to run Linux - perhaps some potential for local inference?

PS5 Linux exploit proposed as potential hardware for local LLM inference via llama.cpp.

u/Thrumpwart·1 month ago·62 pts / 37 comm

r/LocalLLaMA· COMMUNITY

IK_LLAMA now supports Qwen3.5 MTP Support :O

IK_LLAMA.cpp adds Qwen3.5 MTP support with 50% throughput gain (18-20→30 tok/s) via pipeline parallelism on 27B model.

u/fragment_me·1 month ago·40 pts / 16 comm

r/LocalLLaMA· COMMUNITY

llama.cpp - NVFP4 native support on Blackwell from now - b8967

llama.cpp adds native NVFP4 quantization support for Blackwell GPUs with benchmark results on RTX 5090.

u/mossy_troll_84·1 month ago·40 pts / 32 comm·+ covered by others

r/LocalLLaMA· COMMUNITY

llama.cpp's Preliminary SM120 Native NVFP4 MMQ Is Merged

llama.cpp merged SM120 native NVFP4 quantization support; community released GGUFs for Gemma-4-31B and Nemotron-Cascade models.

u/ggonavyy·1 month ago·42 pts / 19 comm

r/LocalLLaMA· COMMUNITY

Lemonade OmniRouter: unifying the best local AI engines for omni-modality

Lemonade OmniRouter unifies local AI inference across text, image, audio, and vision modalities via single OpenAI-compatible endpoint using llama.cpp, sd.cpp, and Whisper.

u/jfowers_amd·1 month ago·41 pts / 18 comm

r/LocalLLaMA· COMMUNITY

Qwen3.6-27B IQ4_XS FULL VRAM with 110k context

Qwen3.6-27B IQ4_XS quantization bloat analysis; reverting llama.cpp commit reduces VRAM from 15.1GB to 14.7GB with 110k context.

u/Pablo_the_brave·1 month ago·43 pts / 16 comm

r/LocalLLaMA· COMMUNITY

Duality of r/LocalLLaMA

Reddit discussion analyzing tensions within r/LocalLLaMA community between open-weights advocates and commercial interests.

u/HornyGooner4402·1 month ago·72 pts / 22 comm

r/LocalLLaMA· COMMUNITY

Built myself a bit of a local llm workhorse. What's a good model to try out with llamacpp that will put my 56G of VRAM to good use? Any other fun suggestions?

Reddit user seeks recommendations for large open-weight models to run locally on 56GB VRAM using llama.cpp.

u/SBoots·1 month ago·41 pts / 33 comm

r/LocalLLaMA· COMMUNITY

GBNF grammar tweak for faster Qwen3.6 35B-A3B and Qwen3.6 27B

User reports GBNF grammar optimizations for Qwen 3.6 35B and 27B models improving coding task performance in llama.cpp.

u/Holiday_Purpose_3166·1 month ago·51 pts / 10 comm

r/LocalLLaMA· COMMUNITY

How to run a local coding agent with Gemma 4 and Pi | Patrick Loeber

Tutorial: running a local coding agent with Gemma 4 and Pi using llama.cpp for on-device inference.

u/jacek2023·1 month ago·42 pts / 12 comm

r/LocalLLaMA· COMMUNITY

What is the best coding agent (CLI) like Claude Code for Local Development

Reddit user seeks advice on setting up local coding agents like Claude Code with open-weight models via llama.cpp.

u/exaknight21·1 month ago·43 pts / 82 comm

r/LocalLLaMA· COMMUNITY

llama.cpp DeepSeek v4 Flash experimental inference

llama.cpp adds experimental DeepSeek v4 Flash support with aggressive 2-bit quantization, achieving 17 tokens/sec on M3 Max with 128GB RAM requirement.

u/antirez·2 months ago·42 pts / 37 comm

r/LocalLLaMA· COMMUNITY

Benchmark: Windows 11 vs Lubuntu 26.04 on Llama.cpp (RTX 5080 + i9-14900KF). I didn't expect the gap to be this big.

Llama.cpp benchmarks on Windows 11 vs Lubuntu 26.04 with RTX 5080 show significant OS-level performance variance in local inference.

u/Ok_Mine189·2 months ago·40 pts / 39 comm

r/LocalLLaMA· COMMUNITY

Using PaddleOCR-VL-1.5 with llama-server for book OCR

User demonstrates PaddleOCR-VL-1.5 multimodal inference via llama.cpp server for end-to-end document digitization with layout and table handling.

u/Final-Frosting7742·2 months ago·40 pts / 16 comm

r/LocalLLaMA· COMMUNITY

Experts-Volunteers needed for Vulkan on ik_llama.cpp

ik_llama.cpp maintainer seeks volunteers to develop Vulkan backend support for CPU/GPU inference optimization.

u/pmttyji·2 months ago·69 pts / 10 comm

r/LocalLLaMA· COMMUNITY

FP4 inference in llama.cpp (NVFP4) and ik_llama.cpp (MXFP4) landed - Finally

llama.cpp and ik_llama.cpp now support FP4 inference with different formats: NVFP4 (Nvidia E4M3) and MXFP4 (MX standard) across varying hardware backends.

u/Usual-Carrot6352·2 months ago·40 pts / 39 comm

r/LocalLLaMA· COMMUNITY

Field report: coding with Qwen 3.6 35B-A3B on an M2 Macbook Pro with 32GB RAM

Developer documents practical setup for running Qwen 3.6 35B on M2 MacBook Pro 32GB via llama.cpp, with performance notes and optimization tips.

u/boutell·2 months ago·40 pts / 33 comm

r/LocalLLaMA· COMMUNITY

This is where we are right now, LocalLLaMA

Reddit post celebrating current state of local LLM deployment without specific technical claims or data.

u/jacek2023·2 months ago·61 pts / 32 comm·+ covered by others

r/LocalLLaMA· COMMUNITY

Turboquant on llama.cpp?

Reddit discussion asking about TurboQuant KV cache optimization implementation in llama.cpp.

u/StupidScaredSquirrel·2 months ago·41 pts / 26 comm

r/LocalLLaMA· COMMUNITY

Qwen3.6 35B-A3B is quite useful on 780m iGPU (llama.cpp,vulkan)

Qwen 3.6 35B-A3B MoE model achieves 250+ tok/s on AMD Radeon 780M iGPU via llama.cpp Vulkan.

u/itroot·2 months ago·44 pts / 26 comm

Simon Willison· ANALYST

Extract PDF text in your browser with LiteParse for the web

Simon Willison ports LlamaIndex's LiteParse PDF text extraction tool to run in-browser, using spatial parsing and Tesseract OCR without ML models.

Simon Willison·2 months ago

r/LocalLLaMA· COMMUNITY

An Overnight Stack for Qwen3.6–27B: 85 TPS, 125K Context, Vision — on One RTX 3090 | by Wasif Basharat | Apr, 2026

Technical walkthrough: Qwen 3.6 27B achieves 85 TPS, 125K context on single RTX 3090 using llama.cpp.

u/AmazingDrivers4u·2 months ago·216 pts / 70 comm

← Front Page30 matches

← Newer Older →