The Archive

AI coding agents are increasingly used to generate pull requests (PRs) that propose code fixes in software projects. From a first exploration of the AIDev dataset, we find that 46.41\% of the fixes proposed by the agents Copilot, Devin, Cursor, and Claude are rejected. This represents a significant amount of wasted resources that require human reviews, verifications, and running tests and validations for fixes that are merely discarded. Our goal in this paper is to understand the failure modes of AI-agents, an understanding that is crucial for better integrating AI-agents as efficient teammat...

Mahmoud Abujadallah·1 month ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Proxy Reward Internalization and Mechanistic Exploitation: A Learned Precursor to Reward Hacking and Its Generalization

Reward hacking is usually studied after it becomes visible, once a model earns high proxy reward while failing the intended task. We instead study what proxy RL teaches before that failure appears. We introduce Proxy Reward Internalization and Mechanistic Exploitation (PRIME), a learned capability to assess task correctness, predict proxy acceptance, and reason about exploitable proxy--gold gaps. In coding RL environments with exploitable pytest rewards, we measure PRIME through chain-of-thought monitoring, direct probes, and activation-level concept vectors. We find that PRIME emerges in a s...

Mohammad Beigi·2 months ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Explainable Forecasting of Scientific Breakthroughs from Concept Network Dynamics

We introduce an explainable machine-learning approach that forecasts the structural precursors of scientific breakthroughs -- the emergence and intensification of links between research concepts -- by modelling how OpenAlex concept networks evolve over time. Using 59 semantic and topological features, a two-stage LightGBM model jointly predicts the formation and the future weight of concept pairs, adding a regression stage that quantifies expected intensity to prior link-existence forecasts. Relative to the state of the art, the approach improves accuracy and explainability at once: comparati...

Thomas Maillart·2 months ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Lightweight CNN-Based Anomaly Detection for High Voltage Converter Modulators in the Spallation Neutron Source

Unscheduled trips of high-power pulsed converters are a leading source of downtime at large accelerator facilities. At the Spallation Neutron Source (SNS), the High Voltage Converter Modulators (HVCMs) are consistently the second-largest contributor to lost beam time. Each HVCM pulse is recorded across sensor channels spanning currents, voltages, and magnetic fluxes, whose mutual interactions encode the operating state of the system. Fault precursors do not manifest uniformly across these channels: depending on fault type, they may alter the temporal structure of individual signals, change th...

Alberto D. Cencillo·2 months ago

r/ClaudeAI· COMMUNITY

I spent $340 on AI subscriptions last month. Wrote down what I actually used each one for. It was depressing.

Going through the credit card statement, here's what I had active: Claude Pro (40), ChatGPT Plus (20), Cursor (20), Perplexity Pro (20), Notion AI (10), Granola (20), ElevenLabs Starter (5), Midjourney Basic (10), Gamma Pro (10), Beautiful.ai (12), Otter Pro (17), Loom Business (15), Zapier Pro (30), Make Core (10), Tactiq Pro (8), Descript Creator (15), Reclaim.ai Pro (8), Motion (19), Superhuman (30), one i can't remember the name of (10), some ai-something for instagram captions (11) Then I sat down and wrote next to each one the last time I'd actually used it. Not opened it, used it for...

u/OneSeaworthiness2676·2 months ago·20 pts / 31 comm

r/LocalLLaMA· COMMUNITY

SWE-rebench Leaderboard (March, April and May 2026): GPT-5.5, Opus 4.7, Cursor (Composer 2.5), Kimi K2.6 and More

Hi all, Sorry for going missing — we’ve been collecting a larger, higher-quality set of more complex tasks. We’re excited to share a major leaderboard update covering the past three months. We’ve updated the **SWE-rebench leaderboard** with **110 fresh Python tasks** from GitHub PRs created in **March, April, and part of May**. The setup follows the standard SWE-bench format: models read real PR issues, edit code, run tests, and must make the full test suite pass. This time, instead of our usual monthly updates with a smaller number of tasks, we collected a larger batch so we could evalua...

u/CuriousPlatypus1881·2 months ago·41 pts / 26 comm

r/ClaudeAI· COMMUNITY

Claude records demo videos for me now

I hate recording demo videos, so I made an open source skill for it: [https://github.com/MobAI-App/desktop-recorder-skill](https://github.com/MobAI-App/desktop-recorder-skill) Now I can give Claude a prompt like: Record a short demo of this app flow And it handles the annoying parts for me: preparing the app state, clicking through the flow, recording, adding cursor/click effects and captions, then exporting the video. So instead of spending time setting everything up and recording the same demo manually, I can let Claude do it while I work on something else. It also has Remotion integr...

u/interlap·2 months ago·21 pts / 6 comm

arXiv (cs.AI/CL/LG)· ACADEMIA

HarnessAPI: A Skill-First Framework for Unified Streaming APIs and MCP Tools

HarnessAPI unifies LLM tool and HTTP API definitions from single Python source; eliminates duplication across Claude, Cursor agent runtimes.

Edwin Jose·2 months ago

r/singularity· COMMUNITY

Gemini 3.5 flash is not that great at coding

Cursor evals show Gemini 3.5 Flash underperforms on coding tasks vs. competitors.

u/NoFaithlessness951·2 months ago·105 pts / 52 comm

VentureBeat AI· PRESS

Google just redesigned the search box for the first time in 25 years — here’s why it matters more than you think.

For a quarter century, the Google search box has been one of the most recognizable interfaces in computing: a thin white rectangle, a blinking cursor, a few typed words, and a list of blue links. On Tuesday, Google will formally retire that paradigm. At its annual I/O developer conference, Google announced a sweeping redesign of the search box itself — the literal text field where billions of queries begin every day — transforming it from a simple keyword input into a dynamic, AI-driven conversation starter that can accept text, images, PDFs, videos, and even open Chrome tabs as inputs. The c...

michael.nunez@venturebeat.com (Michael Nuñez)·2 months ago

r/LocalLLaMA· COMMUNITY

Public Repository "Codegraph" claims to reduce Claude, Cursor, Codex, and OpenCode API tool calls by 94% locally, an innovation that could directly offset the most recent Claude API pricing model.

Codegraph tool uses pre-indexed knowledge graphs to reduce Claude API tool calls by 94% and latency by 82% for code analysis tasks.

u/NetTechMan·2 months ago·49 pts / 16 comm

r/ClaudeAI· COMMUNITY

I tracked every dollar I spent on AI coding tools for 60 days and math is uglier than I thought but probably not in the way you'd guess.

Solo freelancer tracked 60-day AI coding tool spend and productivity ROI across Cursor, Claude, and other services.

u/thewritingwallah·2 months ago·20 pts / 14 comm

r/ClaudeAI· COMMUNITY

Multi-repo orchestration

Anyone know of a solution for tying in multiple IDE sessions with a multi-repo project so that they work cooperatively with a single shared inbox/memory? Here is my use case (whether it’s with or without the use of Storybloq): \- all sessions are running Storybloq which saves root level /.story tickets and issues or if I have multiple projects I store each of them in /projects/<project\_name>/.story \- have three repos open in Cursor with 1-2 sessions each \- have a master Cursor session open that at the root level with /Sites/.story I use the master session for any multi-repo or...

u/achilleshightops·3 months ago·28 pts / 5 comm

r/ClaudeAI· COMMUNITY

Three browser games built with Claude (25M plays). Two of them are 8,000-line HTML files.

Developer built 3 browser games with Claude/Cursor in 3 months (no prior coding), reaching 25M+ plays; documents rapid prototyping and user adoption.

u/gteehan·3 months ago·25 pts / 29 comm

r/LocalLLaMA· COMMUNITY

Open source models are going to be the future on Cursor, OpenCode etc.

User reports high API costs for Claude Opus and GPT-5.5 on Cursor, predicts open-source models will displace proprietary tools by end of 2024.

u/_maverick98·3 months ago·42 pts / 43 comm

r/LocalLLaMA· COMMUNITY

If you've been waiting to try local AI development, please try it

Developer reports local Qwen 27B setup with llama-server now competitive with Claude Code and Cursor for coding tasks, driven by cloud provider cost increases.

u/Imaginary_Belt4976·3 months ago·48 pts / 30 comm

TechCrunch AI· PRESS

Replit’s Amjad Masad on the Cursor deal, fighting Apple, and why he’d rather not sell

At TechCrunch's sold-out StrictlyVC event in San Francisco on Thursday night, we covered a lot of ground in a short time, beginning with the question everyone in the industry is asking right now: in a world where rival Cursor is reportedly in talks to be acquired by SpaceX for $60 billion, is Replit also bound to sell?

Connie Loizos·3 months ago

r/Anthropic· COMMUNITY

Claude-powered AI coding agent deletes entire company database in 9 seconds — backups zapped, after Cursor tool powered by Anthropic's Claude goes rogue

This is pretty funny tho ngl

u/69420lmaokek·3 months ago·12 pts / 4 comm·+ covered by others

r/ClaudeAI· COMMUNITY

What Claude Design does really well (and not so well)

I did a deep dive on Claude Design and below are my thoughts. What it does extremely well: * **Improves your prompt** \- similar to "ask me questions" when chatting to an LLM. Can make the difference between slop and actually useful. * **Invokes agent skills for you** \- a game changer for people who don't live in the terminal * **Claude Code handoff** \- easily get Claude Code to build it for real with a simple link share. Genius. * **Comment feature** \- spatial editing (similar to Cursor and a few others), but selection is very accurate and I like how you can queue up edits and select wh...

u/the-design-engineer·3 months ago·21 pts / 7 comm

TechCrunch AI· PRESS

Apple’s new CEO, and why Elon Musk wants to buy Cursor for $60B

A new era is on the way for Apple as Tim Cook plans to step down from his CEO role in September, handing the reins to hardware chief John Ternus. Ternus may be inheriting one of the most durable businesses in tech, but he’s also stepping into a very different ecosystem than the one Cook spent decades shaping. The App […]

Theresa Loconsolo, Kirsten Korosec, Anthony Ha, Sean O'Kane·3 months ago

Stratechery· ANALYST

2026.17: He Came, He Saw, He Cooked

Stratechery weekly digest covering Tim Cook's Apple departure, Cursor IDE, SpaceX developments, and geopolitical competition.

Ben Thompson·3 months ago

TechCrunch AI· PRESS

How SpaceX preempted a $2B fundraise with a $60B buyout offer

Cursor was on track to close a $2 billion funding round this week but chose to halt discussions after SpaceX offered a $10 billion "collaboration fee" and a path to a $60 billion acquisition.

Marina Temkin·3 months ago

Stratechery· ANALYST

John Ternus and Apple’s Hardware-Defined Future, SpaceXAI and Cursor

Commentary on Apple's John Ternus appointment and its implications for hardware-AI strategy, with tangential reference to SpaceX-Cursor partnership.

Ben Thompson·3 months ago

Latent Space· ANALYST

[AINews] OpenAI launches GPT-Image-2

OpenAI launches GPT-Image-2; Cursor secures $10B contract with xAI and $60B acquisition option.

Latent Space·3 months ago

← Front Page30 matches

Older →