Topic

Llama

Every story matching this topic across titles and summaries, newest first.

CLaC@FinMMEval 2026 Task 3: Sentiment-Augmented Deep Reinforcement Learning for Active Trading -- An Alpha-Reward Approach

DRL trading system for Bitcoin/Tesla using policy gradient and Q-learning with LLaMA 3.2 sentiment analysis and technical indicators.

Andrei Neagu·9 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Production and Perception in LLMs: A Token Probability Approach

Study compares production vs. perception asymmetry in Llama-3.1-8B via token probability analysis, finding LLMs lack functional production-perception distinction.

Anna Marklová·13 days ago

TechCrunch AI· PRESS

Popular open source AI developer tool Ollama raises $65M, grows to nearly 9M users

Benchmark-backed Ollama has amassed 176,000 stars, and nearly 17,000 forks on Github by helping developers easily run AI on their PCs.

Julie Bort·17 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

PALS: Percentile-Aware Layerwise Sparsity for LLM Pruning

PALS adjusts per-layer sparsity in LLM pruning via activation percentiles, improving LLaMA-2-7B perplexity by 15% at 50% sparsity over uniform Wanda.

Yazdan Jamshidi·18 days ago

The Verge AI· PRESS

Meta’s new Muse Image model can pull other Instagram users into AI photos

Meta is launching the first AI image generation model made by its Superintelligence Labs division. The Muse Image model now powers the image-making tools across the Meta AI app, Instagram, and WhatsApp, and it's coming soon to Facebook and Messenger, according to an announcement on Tuesday. It's part of the growing Muse family of AI models that replace Meta's Llama lineup. Alexandr Wang, who Meta hired to head up its Superintelligence Labs last year, says on Threads that Muse Image is "agentic," meaning it works with its Muse Spark large language model "to reason through your prompt, search t...

Emma Roth·19 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Improving LLM-Generated Process Model Quality Through Reinforcement Learning: The Role of Reward Function Design

Systematic study of reward function design for RL-based BPMN process model generation using Llama 3.1 and Qwen 2.5 across 48 configurations.

Alexander Rombach·19 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Dynamic Bidirectional Pattern Memory: A Production-Scale Empirical Characterisation of Inference-Time Gating in Clinical NLP

Production-scale clinical NLP study of inference-time gating with Llama-3.3 70B generator and MMed-Llama-3.1 70B verifier over 167K narratives shows pattern-memory filtering limitations.

Ali H. Lazem·25 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Internal-State Probes Read the Situation, Not the Action: Three Negative Results for Pre-Action Misalignment Monitoring

Three-method study across Qwen2.5-Coder-32B, Llama-3.1-8B, and Gemma-3-27B shows internal probes read situation not pre-action intent, limiting misalignment monitoring efficacy.

Max Fomin·27 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

A Tree-of-Thoughts Inspired Hybrid Approach for Legal Case Judgement Summarization using LLMs

Extractive-abstractive hybrid summarization for legal case judgements using tree-of-thoughts with DeepSeek and Llama.

Aniket Deroy·30 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Actionable Activation Directions for Detecting and Mitigating Emergent Misalignment Across Language Model Families

Fine-tuning language models on insecure code induces emergent misalignment with poorly understood internal structure. We investigate whether this misalignment corresponds to a causally actionable activation-space direction shared across architectures. Across four instruction-tuned model families (Qwen2.5-1.5B, Gemma-2-2B, Llama-3.2-1B, Ministral-3-3B) finetuned identically, a difference-in-means direction achieves 99.6% separation of aligned and misaligned activations at each model's final layer. Causal steering by subtracting this direction reduces code spillover by 21-51 points, while a sec...

Abdul Rafay Syed·1 month ago

arXiv (cs.AI/CL/LG)· ACADEMIA

The Truth Stays in the Family: Enhancing Contextual Grounding via Inherited Truthful Heads in Model Lineages

Recent advances in large language models (LLMs) have produced many specialized multimodal LLMs (MLLMs) that share common foundational LLMs, forming distinct model lineages. It remains unclear whether a fundamental behavioral link exists between the foundational LLMs and downstream variants. We investigate this question by quantifying head-level context-truthfulness scores. Across diverse LLM and MLLM lineages, including Vicuna-, Qwen2.5-, LLaMA2-, and Mistral-based models, we find that Truth Scores are strongly preserved within model families, even after instruction tuning or multimodal adapt...

Miso Choi·1 month ago

arXiv (cs.AI/CL/LG)· ACADEMIA

BayLing-Duplex: Native Full-Duplex Speech Dialogue with a Single Autoregressive LLM

Real-time, full-duplex speech interaction is a key feature of next-generation spoken chatbots, allowing the model to listen and speak at the same time and to handle natural phenomena such as overlap, hesitation, and barge-in. Existing speech language models (SpeechLMs) such as LLaMA-Omni and GLM-4-Voice are still turn-based and rely on an external Voice Activity Detection (VAD) module to mark the end of the user's turn, which fundamentally limits their interactive ability. In this paper, we introduce BayLing-Duplex, a native full-duplex SpeechLM where a single autoregressive LLM decides when ...

Qingkai Fang·1 month ago

arXiv (cs.AI/CL/LG)· ACADEMIA

A Quantitative Experimental Repeated Measures Study of Training Dynamics in a Small Llama Style Language Model Under a Compute-Aware Token Budget

This study examines training dynamics in a small Llama-style language model trained under a fixed, compute-constrained token budget. Rather than evaluating efficiency solely through endpoint performance, the study uses a quantitative experimental repeated measures design to analyze how validation loss, validation perplexity, rolling volatility, backslide behavior, spike behavior, and between-seed variability change across token-based training intervals. Six independent training runs were conducted on a 4.26-million-parameter model using the TinyStories corpus, CPU-based full-precision trainin...

Joe Dwyer·1 month ago

arXiv (cs.AI/CL/LG)· ACADEMIA

The Shibboleth Effect: Auditing the Cross-Lingual Distributional Skew of Large Language Models

This study investigates cross-lingual distributional skew (the Shibboleth Effect) in frontier large language models (LLMs) subjected to sustained adversarial conditions. We develop a multi-agent geopolitical wargame, the Cerulean Sea Crisis, a synthetic maritime territorial dispute designed to mirror the structural dynamics of Eastern Mediterranean conflicts. Six frontier models (GPT-4o, Llama-4, Mistral-Large, Gemini-3.1-Pro, Qwen3.6-Plus, and DeepSeek-R1) participate in a between-groups experiment (N = 10 games per arm, K = 5 rounds per game) in which the sole manipulation is the language o...

Hakan Mehmetcik·2 months ago

arXiv (cs.AI/CL/LG)· ACADEMIA

AutoMegaKernel: A Statically-Checked Agent Harness for Self-Retargeting Megakernel Synthesis

AutoMegaKernel (AMK) compiles a HuggingFace Llama-family model into a single persistent cooperative CUDA kernel that runs the whole forward pass in one launch, with no per-model hand-written CUDA. The contribution is the system, not raw speed. A frozen schedule-IR validator statically certifies deadlock-freedom and race-freedom via static graph checks (not a mechanized proof), so an unsafe agent-proposed schedule is rejected before launch: across 7,160 adversarial schedules (6,091 unsafe) it had zero false-accepts and accepted all 360 real lowerings. The same source retargets sm_80/sm_90/sm_1...

Jaber Jaber·2 months ago

arXiv (cs.AI/CL/LG)· ACADEMIA

PC Layer: Polynomial Weight Preconditioning for Improving LLM Pre-Training

We propose a preconditioning (PC) layer, a weight parameterization via polynomial preconditioner that ensures stable weight conditioning throughout LLM training. The PC module reshapes the singular-value spectrum of weight matrices via low-degree polynomial preconditioning. After training, the preconditioned weights can be merged back into the original architecture, incurring no inference overhead. We demonstrate the advantage of the proposed PC layer over standard transformers in Llama-1B pre-training, for both the AdamW and Muon optimizers. Theoretically, we justify this spectrum-control pr...

Senmiao Wang·2 months ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Cross-Lingual Token Arbitrage: Optimizing Code Agent Context Windows via Local LLM Preprocessing

AI-assisted coding agents are bottlenecked by input-token cost. Two pathologies of raw human input drive much of this overhead: tokenization inefficiency for non-English text and structural entropy in conversational prompts. Existing approaches act reactively by compressing already-bloated contexts or intervening after failures occur. We introduce a pre-flight, edge-side prompt-rewriting middleware that operates between the developer and the cloud agent. A local Llama 3.2 (3B) model performs cross-lingual translation into English, structural rewriting into a compact task-oriented format, and ...

Mehmet Utku Colak·2 months ago

arXiv (cs.AI/CL/LG)· ACADEMIA

MarginGate: Sparse Margin-Triggered Verification for Batch-Invariant LLM Inference

Temperature-zero BF16 LLM inference is often treated as reproducible, yet the same request can emit different tokens when decoded alone or inside a larger batch. Existing fixes use batch-invariant operators or LLM-42's per-token verification, incurring cost even when most steps are stable. We ask whether verification can be applied exclusively to flipped tokens. Across five models, batch-induced token flips are sparse on the flip-rate benchmarks: on MATH500, Llama-3.1-8B flips on $0.48\%$ of synchronous decode steps, and all tested models stay within the 0.3-1.3% range on MATH500, GSM8K, and ...

Kexin Chu·2 months ago

Llama

CLaC@FinMMEval 2026 Task 3: Sentiment-Augmented Deep Reinforcement Learning for Active Trading -- An Alpha-Reward Approach

Production and Perception in LLMs: A Token Probability Approach

Popular open source AI developer tool Ollama raises $65M, grows to nearly 9M users

PALS: Percentile-Aware Layerwise Sparsity for LLM Pruning

Meta’s new Muse Image model can pull other Instagram users into AI photos

Improving LLM-Generated Process Model Quality Through Reinforcement Learning: The Role of Reward Function Design

Dynamic Bidirectional Pattern Memory: A Production-Scale Empirical Characterisation of Inference-Time Gating in Clinical NLP

Internal-State Probes Read the Situation, Not the Action: Three Negative Results for Pre-Action Misalignment Monitoring

A Tree-of-Thoughts Inspired Hybrid Approach for Legal Case Judgement Summarization using LLMs

Actionable Activation Directions for Detecting and Mitigating Emergent Misalignment Across Language Model Families

The Truth Stays in the Family: Enhancing Contextual Grounding via Inherited Truthful Heads in Model Lineages

BayLing-Duplex: Native Full-Duplex Speech Dialogue with a Single Autoregressive LLM

A Quantitative Experimental Repeated Measures Study of Training Dynamics in a Small Llama Style Language Model Under a Compute-Aware Token Budget

The Shibboleth Effect: Auditing the Cross-Lingual Distributional Skew of Large Language Models

AutoMegaKernel: A Statically-Checked Agent Harness for Self-Retargeting Megakernel Synthesis

PC Layer: Polynomial Weight Preconditioning for Improving LLM Pre-Training

Cross-Lingual Token Arbitrage: Optimizing Code Agent Context Windows via Local LLM Preprocessing

MarginGate: Sparse Margin-Triggered Verification for Batch-Invariant LLM Inference

Qwen3.6 huge quality gain from Q4 to Q6 for coding agent

KV cache quant benchmarks: q5 &amp; q6 are underrated, q8/q4 is bad, TCQ has a niche

Info: Nvidia Cuda 13.3 landed

Strix Halo users, a rejected PR can give you up to 30% faster PP for MOEs.

AI content detector based on Qwen 0.8b fine-tuned on Pangram dataset

CUDA: add fast walsh-hadamard transform by am17an · Pull Request #23615 · ggml-org/llama.cpp

The Financial Times has published an article about Heretic

server: fix checkpoints creation by jacekpoplawski · Pull Request #22929 · ggml-org/llama.cpp

What frontend do you guys use?

Vision-capable LLMs vs. OCR for long-document (including charts, images, tables, etc.) QA

llama.cpp server have built-in native tools (exec_shell, edit_file, etc.)

My experience using Claude code with Local Llm, and full guide on how to set it up

Gemma4 26b a4b Apex quant is quite good

Can't believe I got it working! Dual GPU - 48gb VRAM llama-cpp server - R7900 + 7800XT

BeeLlama v0.2.0 – major DFlash update. Single RTX 3090: Qwen 3.6 27B up to 164 tps (4.40x), Gemma 4 31B up to 177.8 tps (4.93x). Prompt processing speed near baseline.

Experts first llama.cpp

Qwen-27B-IQ4_KS for ik_llama.cpp, especially for NVIDIA with 16GB VRAM

[NEW] Supra-50M Released!

Latest b9274 Addresses MTP VRAM leak

Waiting for Qwen 3.7 open weight... The new King has arrived...

For everyone that uses OpenCode / Pi - Heres your promptprocessing fix!

110 tok/s with 12GB VRAM on Qwen3.6 35B A3B and ik_llama.cpp

Qwen3.6 27B and llama.cpp appreciation post

Move to backend sampling for MTP draft path by gaugarg-nv · Pull Request #23287 · ggml-org/llama.cpp

[WIP] Gemma 4 MTP

LM Studio finally added support for MTP Speculative Decoding

Time to update llama.cpp to get som MTP improvements!

llama.cpp MTP support landed - Qwen3.6 27B at 2.44× on a Strix Halo, 2.17× on a RTX 3090 rig

Qwen 35b a3b surprises me

PSA: If you haven’t updated Llama.cpp for a couple of days and find MTP to not be performing well, update llamacpp.

Quantizing MTP KV Cache = free lunch?

NEW BITNET MODELS!

Qwen 3.6 27B on 24GB VRAM setup: backend comparisons, quant choice and settings (llama.cpp, ik_llama.cpp, BeeLlama, vllm)

llama: avoid copying logits during prompt decode in MTP by am17an · Pull Request #23198 · ggml-org/llama.cpp

Dual GPU llama.cpp speedup

Llama.cpp MTP with Qwen3.6 27B on Headless RTX 3090

Testing llama.cpp MTP support on Qwen3.6 - RTX 5090

Looking to migrate off of Ollama and LMStudio

b9180 llama.ccp MTP landed

Qwen 27b MTP Config, Llama.cpp Single 3090

Strix Halo Llama.cpp MTP Benchmarks: 27B Gets Much Faster, 35B Is Mixed

MTP support merged into llama.cpp

MTP PR Merged!!!

That's a good news...

Built a fully offline suitcase robot around a Jetson Orin NX SUPER 16GB. Gemma 4 E4B, ~200ms cached TTFT, 30+ sensors, no WiFi/BT/cellular. He has opinions.

club-5060ti: practical RTX 5060 Ti local LLM notes and configs

Llama-Studio, WebUI for llama-server Management

Automated AI researcher running locally with llama.cpp

Multi-Token Prediction (MTP) for Qwen on LLaMA.cpp + TurboQuant

24+ tok/s from ~30B MoE models on an old GTX 1080 (8 GB VRAM, 128k context)

llama.cpp docker images to run MTP models

Is using vLLM actually worth it if you aren't serving the model to other people?

PSA: If your project has an ANTHROPIC_API_KEY in any .env file, Claude Code will silently bill your API account instead of your Max plan — Anthropic calls it "intentional functionality"

Luce DFlash + PFlash on AMD Strix Halo: Qwen3.6-27B at 2.23x decode and 3.05x prefill vs llama.cpp HIP

examples : add llama-eval by ggerganov · Pull Request #21152 · ggml-org/llama.cpp

Stop wasting electricity

Drastically improve prompt processing speed for --n-cpu-moe partially offloaded models

I catalogued every way local models break JSON output and built a repair library, here's what I found across 288 model calls

PSA: Watch out for extra spaces in chat-template-kwargs when using Qwen3.6 with llama-server

ExLlamaV3 Major Updates!

I have DeepSeek V4 Pro at home

KV cache quant benchmarks: q5 & q6 are underrated, q8/q4 is bad, TCQ has a niche

BeeLlama.cpp: advanced DFlash & TurboQuant with support of reasoning and vision. Qwen 3.6 27B Q5 with 200k context on 3090, 2-3x faster than baseline (peak 135 tps!)