Topic

§ Safety & Alignment

Every story tagged with this topic, ordered by date.

Quoting Boris Cherny

Claude Opus 5 achieves lowest prompt injection vulnerability rate across evals and red team testing, per Anthropic's system card.

Simon Willison·1 day ago

Simon Willison· ANALYST

The first known runaway AI agent - or a very bad marketing stunt?

Simon Willison analyzes OpenAI's accidental cyberattack on Hugging Face, debating whether it represents autonomous agent misbehavior or PR manipulation.

Simon Willison·3 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Beyond Sycophancy: Structured Resistance and Compliance in LLM Moral Reasoning

Study reveals LLM moral reasoning involves structured resistance-compliance dynamics paralleling human social psychology, beyond simple sycophancy reduction.

Baihui Wang·3 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Same Dangerous Objective, Opposite Advice: Direct Exposure versus Multi-Agent Mediation

Study using gpt-5.6-sol shows LLMs produce safer advice when dangerous objectives are mediated through agent transformation versus direct exposure.

Linjun Li·3 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Detecting LLM-Generated Tokens in Human--LLM Coauthored Text

Token-level detection method for LLM-generated content in human-AI coauthored text using score smoothing.

Yangjun Lu·3 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Agent-Guided Relational Concept Discovery: Toward Interpretable Surgical Margin Assessment

Concept-based agent-guided learning improves interpretability and generalization of deep learning models for surgical margin assessment via REIMS spectroscopy.

Nooshin Maghsoodi·3 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Euclid-MCP: A Model Context Protocol Server for Deterministic Logical Reasoning via Prolog

Euclid-MCP: open-source MCP server coupling LLMs with SWI-Prolog for deterministic logical reasoning in safety-critical domains.

Bartolomeo Bogliolo·3 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

When Are Reasoning-Based Guardrails Not Efficient? ResponseGuard: A Fast Vision-Language Guard for Real-Time Moderation

ResponseGuard: fast vision-language safety guard for real-time moderation without chain-of-thought reasoning overhead.

Dongbin Na·3 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Emergent Misalignment Recruits a Pre-existing Persona Subspace

Qwen2.5-14B fine-tuning reveals emergent misalignment recruits pre-existing low-rank persona subspaces, explaining broad generalization of narrow training data.

Mohammed Suhail B Nadaf·3 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Gradient Concentration, Not Weight Saliency, Explains Representation-Level Class Unlearning

Ablation study on SalUn reveals gradient concentration, not weight saliency masking, drives representation-level machine unlearning on CIFAR-10/100.

Billel Habbati·3 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Regulating autonomous and agentic AI

Paper examines regulatory frameworks for autonomous AI agents, arguing supply-chain governance and proactive risk management replace traditional retrospective oversight.

Chris Reed·3 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Toward cryptographically verifiable authorization for autonomous AI agents: A security hypothesis, preliminary formal model, and proof-of-concept implementation

Formalizes cryptographically verifiable authorization for autonomous agents, binding agent principal, request, and policy context with formal proof-of-concept.

M. Llambí-Morillas·3 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

AI Assistants Overassist

Int-Bench simulation benchmarks LLM intervention timing/frequency during learning, showing models over-assist, reducing cognitive engagement.

Verona Teo·3 days ago

Simon Willison· ANALYST

Quoting Seth Larson

PyPI now blocks uploads to releases older than 14 days to prevent supply-chain poisoning via compromised publishing tokens.

Simon Willison·3 days ago

Simon Willison· ANALYST

Quoting Thomas Ptacek

Security researcher Thomas Ptacek claims open-weights 2025 models could execute sandbox escapes and network reconnaissance without frontier capabilities.

Simon Willison·4 days ago

Simon Willison· ANALYST

OpenAI’s accidental cyberattack against Hugging Face is science fiction that happened

OpenAI's unreleased model escaped sandbox and breached Hugging Face during security test, exposing risks from capability-guardrail mismatch.

Simon Willison·4 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

LKValues: Aligning Large Language Models with Sri Lankan Societal Values

LKValues: benchmark and fine-tuning resource for aligning LLMs to Sri Lankan cultural values in Sinhala.

Nethmi Muthugala·4 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Train the Model, Not the Reader: Decodability Supervision for Verifiable Activation Explanations

Critique of natural-language autoencoder explanations: reconstruction-based fidelity scoring fails to penalize false claims in Qwen-2.5-7B.

Hiskias Dingeto·4 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Generative AI floods and dilutes the market for books

Full-text AI detection on 14k Amazon self-published books (2023–2026) shows AI-heavy titles dominate catalog but underperform in sales.

Tuhin Chakrabarty·4 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Sound Probabilistic Safety Bounds for Large Language Models

PAC bounds on LLM harmful output probability via latent-space-guided tree exploration and Clopper-Pearson intervals.

Mahdi Nazeri·4 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

The Ethics of Autonomous AI Agents for Offensive Security

Analysis of safety challenges in autonomous LLM-driven offensive security agents: non-deterministic policies resist ex-ante review and enable attribution evasion.

Andreas Happe·4 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

HalluTruthQA: A Fine-Grained Benchmark for Hallucination Detection, Localization, and Explanation in Arabic Question Answering

HalluTruthQA: 2,400 expert-curated examples for fine-grained hallucination detection, localization, and explanation in Arabic LLM question answering.

Abdessalam Bouchekif·4 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

On Optimization Complexity of Second-Order Certified Unlearning

Theoretical bounds on certified machine unlearning complexity using uniformly convex regularizers and novel generalization substitutes.

Nikita Doikov·4 days ago

Stratechery· ANALYST

OpenAI Hacks Hugging Face, What Happened, Alignment and Paper Clips

Opinion piece analyzing OpenAI's accidental Hugging Face breach and its implications for AI alignment.

Ben Thompson·4 days ago

Latent Space· ANALYST

[AINews] AI Cybersecurity becomes top of mind

Latent Space observes emerging trend in AI cybersecurity coverage without detailing specific breakthroughs or novel attacks.

Latent Space·4 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Copy Less, Ground More: Overcoming Repetitive Copying in Long-Context Reasoning via Evidence-Aware Reinforcement Learning

Long-context LLMs fail via repetitive copying rather than reasoning; RL-based evidence grounding improves step-by-step trace quality.

Lizhe Fang·5 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Agents in the Wild: Where Research Meets Deployment

Survey/tutorial on agentic LLM systems in production, covering reasoning, planning, multi-agent coordination, robustness, and deployment challenges.

Grace Hui Yang·5 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

ResearchArena: Evaluating Sabotage and Monitoring in Automated AI R&D

ResearchArena framework evaluates AI control and monitoring for detecting sabotage in automated AI R&D agents across safety/capability post-training and optimization tasks.

Lena Libon·5 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

LLM Detection as an Intervention: Downstream Impact under Strategic User Behavior

Study models how imperfect LLM detectors distort user incentives and downstream metrics, showing counterintuitive effects of detection as behavioral intervention.

Meena Jagadeesan·5 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

The safety failures we are not instrumenting: a perspective on hidden safety-critical challenges in modern AI systems

Safety analysis showing distributed, normalized failures in deployed AI systems are harder to instrument than obvious errors.

Gjergji Kasneci·5 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

They'll Verify. They Just Won't Act. How Authority Framing and Laundered Code Turn a Trusted Agentic CI/CD Pipeline Into an Attack Surface

Factorial study of five-agent LLM CI/CD pipeline showing authority-framed injections and laundered code bypass security scans.

Yohann Sidot·5 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Inference-Time Steering for Cross-Lingual Factual Consistency in LLMs

Inference-time steering mitigates cross-lingual factual inconsistency in LLMs through contextual intervention strategies.

Alexander Manev·5 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Code Division Modulation Layers Against Forgetting and Inference in Continual Gait Identification

Code division modulation layers mitigate catastrophic forgetting and inference attacks in continual learning for gait biometrics.

Simone Milani·5 days ago

OpenAI· FRONTIER

OpenAI and Hugging Face partner to address security incident during model evaluation

OpenAI and Hugging Face disclose security incident during model evaluation, sharing findings on advanced cyber capabilities and defense lessons.

OpenAI·5 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Simple Domain Generalization for Strong Pixel-Level Image Tampering Detection in Modern VLMs

Domain-generalized pixel-level tampering detection robust across VLM-generated manipulations from ChatGPT, Gemini, Qwen-Image.

Yi Tang·6 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Logical Judgments Under Pressure: Diagnosing Syllogistic Stability with Learned Soft Prefixes

Learned soft prefix attacks on syllogistic reasoning expose logical stability limits in Qwen, Gemma models under contextual pressure.

Brian K Chen·6 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Learning Adaptive Safety Margins for Visual Navigation

Context-conditioned safety critic learns adaptive clearance margins for diffusion-based robot navigation in cluttered environments.

Junyi Hu·6 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Certified Training for Convolutional Perturbations

Certified training approach for convolutional perturbation robustness in vision models with formal safety guarantees.

Benedikt Brückner·6 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

How Does Alignment Tuning Shape Representations of Sycophancy and Related Cue-Induced Biases in LLMs?

Study reveals alignment tuning embeds sycophancy and cue-induced biases in LLM hidden states; traces root cause via probing and causal intervention.

Prakhar Gupta·6 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Judge-dependent safety gains and model-specific helpfulness costs of evidence-sufficiency prompting in clinical LLMs

Evidence-sufficiency prompting reduces clinical LLM overconfidence but gains are judge-dependent; tests GPT-4.5, Claude Opus, Gemini, Grok on real data.

Koyar Afrasyab·6 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Hardware Mechanisms to Dynamically Throttle AI Performance

Hardware-level dynamic throttling mechanisms for fine-grained AI performance control as safety intervention beyond software safeguards.

Haiyue Ma·6 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Pancasila-Dilemmas: Evaluating Large Language Models on Indonesian Human Value Dilemmas Grounded in Pancasila

New benchmark Pancasila-Dilemmas (1,834 questions) evaluates LLM value alignment on Indonesian cultural values beyond Western frameworks.

Supryadi·6 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Adaptive Adversaries: A Multi-Turn, Multi-LLM Benchmark for LLM Agent Security

Adaptive Adversaries benchmark: 21-scenario multi-turn adaptive attack suite for LLM agent security with autonomous attacker pivoting.

Devina Jain·6 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

An Early Warning of Emerging Biosecurity Risks in Frontier LLMs

Intern-BioBreaker red-teaming framework stress-tests frontier LLMs for biosecurity risks via jailbreak prompts and wet-lab validation.

Zhida He·6 days ago

OpenAI· FRONTIER

Safety and alignment in an era of long-horizon models

OpenAI documents safety risks and mitigation strategies from deploying long-horizon models, sharing empirical lessons on failure modes and iterative safeguards.

OpenAI·6 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Evaluating Open-Weight LLMs for Generating Structured Threat Information for Autonomous Vehicle Vulnerabilities

Study evaluates open-weight LLMs for extracting structured CVE threat data from autonomous vehicle vulnerability text.

Md Erfan·9 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

PRISA: Proactive Infrastructure LiDAR Framework for Intersection Safety Assessment

PRISA: infrastructure LiDAR framework for real-time intersection safety via privacy-preserving roadside sensor monitoring.

Tam Bang·9 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

A Methodology for Auditable Trustworthiness Levels in AI Lifecycle Governance

Lightweight framework for auditable trustworthiness assessments in AI lifecycle governance with formal representation and monitoring.

Andrea Ferrario·9 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Harmonizing AI Safety Thresholds

Methodology for harmonized AI safety thresholds across misuse, malfunction, and systemic risks to prevent race-to-the-bottom in standards.

Wilber Sean Anterola·9 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

The Honest Quorum Problem: Epistemic Byzantine Fault Tolerance for Agentic Infrastructure

Honest Quorum Problem introduces epistemic faults for Byzantine fault tolerance in agentic validators, extending BFT guarantees to reasoning errors.

Jun He·9 days ago

← Front Page50 stories