Topic

§ Coding

Every story tagged with this topic, ordered by date.

Ruff v0.16.0

Ruff v0.16.0 enables 413 default linting rules (up from 59), breaking existing CI pipelines and catching syntax/runtime errors previously uncaught.

Simon Willison·14 hours ago

Anthropic· FRONTIER

Introducing Claude Opus 5

Anthropic releases Claude Opus 5 with improvements in agent execution, coding, and professional tasks.

Anthropic·2 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

From Resource Flow to Executable Tests: Petri-Net-Guided LLM Test Generation for Concurrent Stateful Rust APIs

Petri-net-guided LLM test generation for concurrent Rust APIs addresses shallow test synthesis by integrating formal models with executable test concretization.

Kaiwen Zhang·3 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

What, Where, and How: Disentangling the Roles of Task, Language, and Model in Code Model Representations

Analysis of code model representations shows Qwen2.5-Coder and DeepSeek-Coder align on grammatical concepts across Python/Rust, with task-driven specialization.

Piotr Wilam·3 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Agentic coding without the cloud: evaluating open-weight large language models on longitudinal data preparation tasks

Open-source evaluation framework for open-weight LLM agents on longitudinal data tasks, addressing privacy constraints in research deployments.

Mack Nixon·3 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Test-Time Scaling via Error Localization

TTEL: inference-time algorithm using token-level error localization and environment feedback for efficient test-time scaling.

Rajiv Shailesh Chitale·3 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Token Budget Saturation and Mechanistic Early Detection of Reasoning Non-Convergence in Chain-of-Thought Models

Linear probes on hidden states detect early non-convergence in chain-of-thought reasoning; DeepSeek-R1-Distill-Qwen-7B shows 90.3% converged vs 6.6% non-converged AIME accuracy.

Renuka Oladri·3 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Notes to Self: Can LLMs Benefit from Experiential Abstractions?

LLMs extract natural-language abstractions from solution traces to improve problem-solving via retrieval and RL-augmented training on math tasks.

Chang Liu·4 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

PyroDash: Cost-Efficient Token-Level Small-Large Language Model Collaborative Inference

PyroDash enables cost-efficient SLM-LLM collaborative inference by training SLM to emit control tokens requesting frozen LLM handoff during generation.

Niqi Lyu·4 days ago

OpenAI· FRONTIER

NTT DATA Group cuts incident analysis to 30 minutes with Codex

NTT DATA deploys ChatGPT Enterprise and Codex across 9,000 employees to reduce incident analysis time to 30 minutes.

OpenAI·5 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

CodeRescue: Budget-Calibrated Recovery Routing for Coding Agents

CodeRescue optimizes cost-aware routing for coding agents, determining when to retry vs. escalate after execution failures.

Qijia He·5 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Selective State-Space Adaptation and Retrieval for Language Model Reasoning

MaLoRA and MaLOR introduce dynamic, recurrent state-space adapters for task and token-level LLM adaptation, improving reasoning via selective modulation.

Atahan Dokme·5 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Off-Context GRPO: Learning to Reason on Hard Problems using Privileged Information

Off-Context GRPO uses privileged training information (solution prefixes) to enable RL with verifiable rewards on hard reasoning problems avoiding zero-gradient plateaus.

Priyank Agrawal·5 days ago

Simon Willison· ANALYST

A Fireside Chat with Cat and Thariq from the Claude Code team

Anthropic engineers discuss Claude Code, Claude Tag Slack integration, coding agent security, and internal tool usage in fireside chat.

Simon Willison·5 days ago

Simon Willison· ANALYST

Reverse-engineering is cheap now

Coding agents lower ROI threshold for reverse-engineering home automation, shifting economics of personal automation projects despite maintenance risk.

Simon Willison·6 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

SWE-Pruner Pro: The Coder LLM Already Knows What to Prune

SWE-Pruner Pro: prunes long context in coding LLMs by extracting internal relevance signals, improving efficiency over external classifiers.

Yuhang Wang·6 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

PPL-Factory: Task-Aware and Budget-Aware Data Selection from Language Modeling to Reasoning

PPL-Factory: task-aware perplexity-based data selection for efficient LLM fine-tuning across reasoning and language tasks.

Hang Zhang·6 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

OR Else: A Differentiable Trust Region for Policy Optimization

OR Else: smooth one-sided saturation rule replaces PPO clipping for stable LLM post-training with reduced gradient discontinuities.

Chinmay Rane·6 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

TRIM: Reducing AI-Generated CodeSlop via Agent Trajectory Minimization

TRIM reduces verbose AI-generated code by minimizing agent trajectory artifacts through search-process cleanup.

Alex Mathai·6 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

SGA: Plug&Play Geometric Verification for Educational Video Synthesis

SGA module detects and fixes geometric errors in LLM-generated pedagogical animation code via symbolic scene graphs.

Lopez Jhon·6 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

BayesPO: Bayesian Prompt Optimization via Parallel-Tempered Gradient-Guided Discrete MCMC

BayesPO formulates prompt optimization as Bayesian posterior sampling over discrete tokens using parallel-tempered MCMC.

Junjie Zhou·9 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Code-Poisoning Property Inference Attacks

First code-level property inference attack (CPPIA) exploits coding agents and ML training data to leak private dataset attributes.

Xukun Luan·9 days ago

Simon Willison· ANALYST

Quoting Thibault Sottiaux

GPT-5.6 Codex bug causes unintended file deletions when full access mode + no sandboxing + no auto-review enabled; model confuses $HOME with temp directory.

Simon Willison·10 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

MM-IssueLoc: A Controlled Benchmark for Evaluating Visual Evidence in Multimodal Repository-Level Issue Localization

MM-IssueLoc benchmark isolates visual evidence impact in multimodal software repository issue localization across 23 languages and 652 instances.

Shaoxiong Zhan·10 days ago

OpenAI· FRONTIER

How Codex became a collaborator for OpenAI’s creative team

OpenAI documents internal use of Codex for custom tool development, prototyping, and creative ideation workflows.

OpenAI·10 days ago

Simon Willison· ANALYST

Mermaid to Unicode box art (grok-mermaid)

Simon Willison ports Grok's Rust Mermaid-to-Unicode renderer to WebAssembly for browser use via Claude Code.

Simon Willison·11 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Generative Compilation: On-the-Fly Compiler Feedback as AI Generates Code

Generative Compilation: decoder-level compiler feedback for LLM code generation without white-box model access.

Niels Mündler-Sasahara·11 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Form, Not Content? A Preregistered, Placebo-Controlled Evaluation of Learned Error-Conditioned Self-Repair Through Prompts and Weights in Frozen Small Code Models

Placebo-controlled methodology (PoPE) to measure whether small code LLMs can actually use execution error feedback to repair code.

Mehmet Iscan·12 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Line-Anchored Feedback Cuts Token Costs and Improves Correctness in AI Code Editing

FileMark VSCode extension uses line-anchored feedback to reduce token generation in Claude Opus (22%) and Sonnet (58%), cutting code-editing latency and cost.

William Franz Lamberti·12 days ago

Stratechery· ANALYST

The OpenAI Super App, ChatGPT = Codex, Whither Chat

OpenAI integrates Codex capabilities into ChatGPT, shifting toward a multi-purpose platform rather than pure chat interface.

Ben Thompson·12 days ago

Latent Space· ANALYST

[AINews] Codex usage up >10x in 6 months to 7M users, +1M in the past ~day; did Codex overtake Claude Code??

Codex usage grew 10x to 7M users in 6 months; article questions whether it has outpaced Claude Code amid sparse adoption metrics.

Latent Space·12 days ago

Simon Willison· ANALYST

datasette code-frequency chart on GitHub

Simon Willison documents productivity spike in Datasette project correlating with Opus 4.8 and GPT-5.5 releases via GitHub commit frequency analysis.

Simon Willison·13 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

LLM for EDA in Front-End Design: Challenges and Opportunities

LLMs applied to EDA front-end design via HDL generation and agentic AI frameworks like OpenClaw for chip design automation.

Kangwei Xu·16 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Tokenizer Transplantation: Mitigating Autoregressive Collapse in Edge-Efficient Bengali ASR

Tokenizer vocabulary transplantation fixes Bengali ASR failure in Moonshine by replacing English-centric byte tokenizer with BanglaBERT.

Sanjid Hasan·16 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Failure as a Process: An Anatomy of CLI Coding Agent Trajectories

Large-scale study of LLM CLI coding-agent failure trajectories reveals onset patterns and recovery mechanisms, treating failure as temporal process not final outcome.

Xiangxin Zhao·16 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Practical Source Code Recovery from Binary Functions Using Anchor-Based Retrieval and LLM Reasoning

Pipeline combines Ghidra reverse engineering, anchor-based retrieval, and LLM reasoning to recover source code from stripped binaries.

Charles Edward Gagnon·16 days ago

Cohere· FRONTIER

Hardware-Aware, Dynamic Speculative Decoding (DSD)

Cohere introduces Dynamic Speculative Decoding (DSD) that optimizes K parameter selection based on hardware constraints to improve inference efficiency.

Cohere·17 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

ProjAgent: Procedural Similarity Retrieval for Repository-Level Code Generation

ProjAgent uses procedural similarity retrieval to improve repo-level code generation by matching functional logic across codebases.

QiHong Chen·17 days ago

Simon Willison· ANALYST

llm-meta-ai 0.1

llm-meta-ai 0.1 released, enabling CLI access to Meta's muse-spark-1.1 model via Simon Willison's llm tool.

Simon Willison·17 days ago

Simon Willison· ANALYST

llm 0.31.1

llm 0.31.1 patches tool call JSON error in OpenAI Chat Completion endpoints with empty arguments.

Simon Willison·17 days ago

Simon Willison· ANALYST

Quoting Kenton Varda

Kenton Varda banned AI-generated PR/commit messages after finding them missed high-level context while restating obvious code details.

Simon Willison·18 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Search, Fail, Recover: A Training Framework for Correction-Aware Reasoning

Pyligent training framework treats reasoning as validated search with backtracking, enabling models to correct mid-inference and recover from failed branches.

Dmitry Beresnev·18 days ago

OpenAI· FRONTIER

Separating signal from noise in coding evaluations

OpenAI identifies validity issues in SWE-Bench Pro coding benchmark, questioning reliability of popular AI model evaluation metric.

OpenAI·18 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

RuBench: A Repository-Level Agentic Coding Benchmark with Natively Authored Russian Task Specifications

RuBench 1.0 repository-level coding benchmark with 25 natively-authored Russian task specifications across Python, PHP, TypeScript, JavaScript projects.

Evgeny Shilov·19 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Harnessing Code Agents for Automatic Software Verification

Code agent framework for automated software verification outperforms fixed proof strategies, proving larger fraction of Coq theorems than prior LLM approaches.

Shuangxiang Kan·19 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

UI2App: Benchmarking Visual Interaction Inference in Executable Web Application Generation

UI2App benchmark evaluates LLM-generated web applications on interaction fidelity, not just visual rendering, using image-driven inputs.

Grace Man Chen·19 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Spider 2.0-AIFunc: Extending Real-World Text-to-SQL to AI-Native SQL Workflows

Spider 2.0-AIFunc benchmark of 465 text-to-SQL instances covering AI-native SQL functions on Snowflake platform.

Tianyang Liu·19 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Evaluating Fine-Tuning and Metrics for Neural Decompilation of Dart AOT Binaries

Empirical study on fine-tuning and metrics for neural decompilation of Dart AOT binaries with 154-task HumanEval-Dart benchmark.

Raafat Abualazm·19 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

TREK: Distill to Explore, Reinforce to Refine

TREK uses distillation for exploration support expansion in GRPO, enabling reasoning on hard prompts via black-box or white-box teacher trajectories.

Yuanda Xu·20 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Progressive Refinement: An Iterative Pseudo-Labeling Approach for Mandarin-English Code-Switching ASR

Iterative pseudo-labeling improves Mandarin-English code-switching ASR by leveraging unlabeled data via semi-supervised training.

Qu Yang·20 days ago

← Front Page50 stories