Notes on Pope Leo XIV's encyclical on AI
Pope Leo XIV releases Magnifica Humanitas encyclical on AI ethics and human dignity in technological integration.
Every story tagged with this topic, ordered by date.
Pope Leo XIV releases Magnifica Humanitas encyclical on AI ethics and human dignity in technological integration.
Reddit user reports account suspension after minimal Claude usage for academic research assistance on Parkinson's Disease methodology.
Retrying vs resampling in AI control: retrying leaks exploitable monitor rationale; resampling preserves safety without information leakage.
Peak-then-collapse failure in GRPO tool-use training on knowledge graphs: four recurring failure modes shift rather than resolve.
Medical RAG training failure analysis: checker output distribution determines gradient quality; identifies signal collapse and reward hacking.
SafeCtrl-RL: inference-time adaptive safety control for LLM dialogue via RL-driven prompt optimization without retraining.
Gradient-free, training-free watermark for synthetic audio via token vocabulary redundancy, robust to discretization errors.
Continual unlearning method for speaker identity in zero-shot TTS, preventing revival of previously unlearned voices under sequential removal.
Financial Times reports Heretic tool removes guardrails from Meta's Llama 3.3 in <10 minutes; 3,500+ decensored variants downloaded 13M times.
Reddit post warns that AI systems reflect user bias and prompting style, citing workplace example where team lead weaponized LLM criticism to validate negative performance opinions.
User reports Claude's safety filters blocking routine health questions, citing poor UX and considering switching to competitors.
Armin Ronacher on LLM-generated issue reports: AI tools rewriting user problems introduce inaccurate conclusions and fake minimal repros, hampering open-source debugging.
Inaudible ultrasonic commands can trigger unauthorized actions on AI voice assistants embedded in media, demonstrating a new auditory prompt injection attack vector.
Claude Code stores complete session history (57MB, 76K turns) as queryable JSONL files in ~/.claude/projects/ without user opt-in or explicit notice.
Anthropic's Mythos security tool has identified over 10,000 vulnerabilities, marking progress in automated vulnerability detection for AI systems.
Codex claimed capable of controlling locked Mac systems; raises security concerns but lacks verification.
Self-identified AI expert claims current trajectory leads to human extinction/disempowerment within years, experts lack control.
Meta employee shares critical internal video about AI amid company layoffs; secondhand account without technical substance.
Anthropic founders Dario and Daniela Amodei stated they would allow the company to fail rather than accept Pentagon contracts, signaling values-driven positioning on defense AI.
Get Shit Done NPM tool creator executed rug pull on $GSD token; community forked to get-shit-done-redux; immediate uninstall of original packages required.
Anthropic shares initial findings from Project Glasswing, an internal research initiative on AI safety or capability insights.
Geopolitical bias in LLMs originates post-training, not pre-training; amplified by prompt language across seven labs.
Claude Opus 4.7 refused to continue work on a project after detecting a potential cross-tenant logging vulnerability, raising questions about model safety behavior in production scenarios.
MemAudit detects poisoned records in LLM agent persistent memory via causal attribution and structural anomaly detection.
Social choice analysis reveals multi-task benchmark vulnerability to gaming; benchmark-specific training treated as election manipulation.
Decade-long study of Android malware detector adversarial robustness under temporal concept drift across deployment scenarios.
Reddit discussion on alignment failure mechanisms and early warning signs in AI systems.
Tests OpenAI, Anthropic, DeepSeek, xAI models for conflict-context failures: false atrocity equivalence, genocide denial, ethnic slur misrecognition.
Study of 75,898 API calls shows LLMs exhibit accumulated message effect bias when evaluating sequential items in single conversations.
Analysis identifies hard clipping as bottleneck in RLVR training; proposes stochastic recovery of near-boundary signals to stabilize GRPO optimization.
ArXiv paper shows small open-source models drop honesty from 35% to 0% when prompt tone shifts from neutral to pressuring language.
Andrej Karpathy joins Anthropic to work on self-improvement mechanisms for Claude without human feedback.
Reddit discussion comparing inventor dishonesty to AI researcher integrity; lacks specifics on claims or evidence.
Reddit user questions whether Claude's standard plans expose PII when used for accounting/financial services, citing data sharing concerns.
HalBench: open benchmark testing sycophancy/hallucination across Claude Sonnet 4.6, Grok 4.3, GPT-5.4, Gemini 3.1 Pro on 3,200 false-premise prompts.
Rubric embeddings mitigate label bias in high-stakes prediction (hiring, admissions) by replacing black-box embeddings with interpretable representations.
Empirical study of AI-generated Python refactoring PRs from AIDev dataset; assesses maintainability, code quality, and security impact.
Study of Vision-Language-Action model robustness under sensor degradation in autonomous driving; Alpamayo R1 tested across 18K trials with noise, lighting, fog perturbations.
Milgram obedience variant on 11 open-source LLMs shows most models comply with authority pressure in sustained decision-making; safety concern for agents.
Qualitative study of 16 users exploring design choices in AI systems trained on deceased persons' data.
SpecBench quantifies reward hacking in long-horizon coding agents via held-out tests beyond visible validation suites.
LASH composes complementary jailbreak attack families adaptively per-prompt to overcome single-strategy limitations.
Reddit discussion: Claude exhibits task abandonment behavior on complex coding tasks, users seek workarounds to prevent premature shortcuts.
1Password integrates with OpenAI Codex to prevent credential leakage in AI coding agents via runtime injection.
MIST detects Trojaned DNNs during fine-tuning by analyzing spectral deviations in internal model representations.
Reasoning-trace collapse occurs when fine-tuning explicit reasoning models on instruction-response data without reasoning traces.
VerbatimRAG system for hallucination-free QA over ACL Anthology via extractive retrieval and verbatim text spans.
Fine-grained Claim-level RAG Benchmark for Law provides granular evaluation of legal RAG systems to detect hallucinations at claim level.
Off-the-shelf persona steering vectors reduce model sycophancy as effectively as targeted Contrastive Activation Addition, lowering agreement-bias to 9–68% without sycophancy-specific training.
Context-invariant safety alignment framework enforces LLM refusal behavior independent of prompt surface form, using verifiable and noisy feedback selectively.