Stop wasting electricity
RTX 4090 power optimization for llama.cpp: reduce consumption 40% via power limits without performance loss.
Search the full wire by company, model, lab, or keyword. Every story we have ever aggregated.
RTX 4090 power optimization for llama.cpp: reduce consumption 40% via power limits without performance loss.
Tuning llama.cpp ubatch and n-cpu-moe parameters improves gpt-oss-120b prompt processing from 240 to higher tok/s on RTX 3090.
Empirical study across 288 model calls identifying JSON output failures in Llama 3, Mistral, Command R, DeepSeek, Qwen; failure modes consistent across open and closed models but vary by rate.
JSON parsing bug in llama-server's chat-template-kwargs for Qwen3.6 preserve_thinking parameter requires whitespace-free formatting.
ExLlamaV3 adds Gemma 4 support, improved caching, and DFlash optimization for faster LLM inference on consumer hardware.
User successfully quantized and ran DeepSeek V4 Pro locally on AMD EPYC + RTX PRO hardware using modified llama.cpp with Q4_K_M compression.
User shares llama-server configuration for running Minimax 2.7 at 100k context on Strix Halo hardware with detailed tuning notes.
Reddit user seeks advice on LLaMA inference harnesses; discusses fragmentation and compatibility issues with local LLM tooling.
BeeLlama.cpp fork adds DFlash, TurboQuant, and vision support; runs Qwen 3.6 27B Q5 on RTX 3090 with 200k context at 135 tps.
User reports 1.5–2x speedup running Qwen 27B with MTP optimization on dual AMD MI50 GPUs via llama.cpp.
User achieves 80 tok/sec with 128K context on RTX 4070 Super using Qwen3.6 35B quantization and llama.cpp MTP implementation.
Reddit user asks about llama.cpp timeline for Vulkan/HIP MTP support on Strix Halo Windows 11.
Linear probing and activation patching reveal latent planning representations in Qwen3, Gemma-3, Llama-3; future constraints encoded at layer boundaries.
Multi-Token Prediction optimization for LLaMA.cpp achieves 40% speedup on Gemma 4 quantized models via parallel token drafting.
Chrome allegedly downloads 4GB LLM checkpoint without user consent, raising privacy and transparency concerns for browser-embedded AI.
Xiaomi releases Mimo v2.5, a 310B sparse MoE multimodal model with 1M token context supporting text, image, video, and audio.
Qwen3.5/3.6 inference optimization guide: NextN MTP speculative decoding achieves 2.9× speedup on RTX 3090 Ti via llama.cpp with zero quality loss.
User reports successful local inference with Qwen3.6-35B on AMD R9700 GPU, generating functional code and tests via llama-cpp.
Qwen3.6-27B with Multi-Token Prediction achieves 2.5x throughput via Unsloth quantization and llama.cpp integration.
Qwen 3.6 27B achieves 2.5x inference speedup via MTP speculative decoding in llama.cpp; 262k context on 48GB with fixed chat templates.
Qwen 27B achieves 54 t/s on V100 GPU with MTP optimization in llama.cpp, nearly 2x baseline speed for code review and tool use tasks.
Cyera reports critical unauthenticated memory leak vulnerability in Ollama enabling unauthorized data access.
User reports successful MTP speculative decoding on AMD Strix Halo (AI Max 395) with llama.cpp achieving 60-80 tok/s on Qwen 3.6B GGUF.
Meta is facing a class action lawsuit filed by five major book publishers and one author over claims the company "engaged in one of the most massive infringements of copyrighted materials in history" when training its Llama AI models, as reported earlier by The New York Times. In their suit, Macmillan, McGraw-Hill, Elsevier, Hachette, Cengage, and author Scott Turow allege that Meta "repeatedly copied" their books and journal articles without permission. The lawsuit accuses Meta of knowingly ripping copyrighted work from "notorious pirate sites," such as LibGen, Anna's Archive, Sci-Hub, Sci-M...
Unexpected email to wake up to but I am here for it! Model agnostic tools are the way! This is huge!
MTP format support coming to llama.cpp; DeepSeekv3, Qwen3.5, GLM4.5, and other models compatible pending native weights.
FastDMS achieves 6.4× KV-cache compression on Llama 3.2 1B via learned token eviction, matching vLLM performance with lower memory overhead.
llama.cpp adds beta MTP (Multi-Token Prediction) support, starting with Qwen3.5, closing performance gap with vLLM on token generation.
Quantized Llama 405B and DeepSeek models now achieve 20-100 tokens/sec on consumer hardware, up from 1 token/sec two years ago.
Developer reports local Qwen 27B setup with llama-server now competitive with Claude Code and Cursor for coding tasks, driven by cloud provider cost increases.