EnvFactory: Scaling Tool-Use Agents via Executable Environments Synthesis and Robust RL
EnvFactory scales tool-use agents via synthetic executable environments and robust RL, addressing data and execution bottlenecks.
Search the full wire by company, model, lab, or keyword. Every story we have ever aggregated.
EnvFactory scales tool-use agents via synthetic executable environments and robust RL, addressing data and execution bottlenecks.
Knowledge distillation transfers tabular foundation models to lightweight models on healthcare data via stratified out-of-fold labeling.
Personalized blood biomarker interpretation via learned representations accounts for intra-patient variability and baseline deviation.
This story originally appeared in The Algorithm, our weekly newsletter on AI. To get stories like this in your inbox first, sign up here. When Google opens its doors tomorrow for its annual developer conference, I/O, it will do so as a clear third place in the foundation model race. A year ago, at Google I/O…
Elon Musk's claim that he was mistreated by his OpenAI cofounders failed after nine California jurors decided in a unanimous verdict that his lawsuits had been filed too late.
PopPy auto-discovers parallelization opportunities in Python compound AI applications to reduce end-to-end latency.
Ensemble study of six tabular foundation models reveals high redundancy (Q=0.961) limiting ensemble gains to +0.18% at 253× cost.
AdaGrad and adaptive gradient methods converge under heavy-tailed noise without explicit clipping via implicit noise robustness.
SkillGenBench isolates skill generation as a benchmark task for LLM agents, measuring ability to create reusable executable skills from repositories.
Framework uses LLMs as OR experts to dynamically re-optimize operational models via natural-language interaction, addressing real-world constraint evolution.
Explores explainability of ML methods for quantum-gas physics experiments, focusing on image denoising and solid identification in cold-atom systems.
Reversa framework converts legacy software into operational specs via multi-agent pipelines, enabling AI agents to safely modify existing systems.
Proposes metric for evaluating XAI methods via continuous input perturbation, assessing sufficiency and necessity without ground-truth labels.
Lance: lightweight unified multimodal model using dual-stream MoE architecture for image/video understanding, generation, and editing via multi-task training.
COOPO framework cycles between offline RL training and online fine-tuning to mitigate distributional shift and catastrophic forgetting in hybrid settings.
Improves GNN-based generalized planning policies using efficient lookahead encoding and abstracted width for classical planning domains.
Argues generative AI advertising enables undetected commercial intervention in LLM outputs, framing it as trustworthy intervention problem not content placement.
Position paper: safe LLM agent deployment requires three-layer probabilistic assume-guarantee architecture covering semantic, environmental, and dynamical constraints.
Proposes complementarity-based evaluation framework for Earth observation embedding models through fusion rather than isolation.
Empirical study showing shallow DNNs with ReLU activation provide adversarial robustness in ML-based network intrusion detection without explicit defenses.
GIM benchmark of 820 problems evaluates LLMs on integrated multi-domain reasoning grounded in practical contexts, addressing saturation of existing benchmarks.
Theoretical PAC learning result: efficient algorithm for learning multiclass linear classifiers under malicious noise with marginal distribution assumptions.
Comprehensive analysis of AI-assisted research across full lifecycle (Apr 2026): automated systems generate papers cheaply but fabricate results and miss errors.
KairosHope time-series foundation model replaces attention with dual-memory HOPE block for specialized classification, integrating statistical knowledge.
FedHybrid algorithm balances accuracy, privacy, and communication in differentially private federated learning via improved FedAvg initialization.
Distills tabular foundation models into CPU-executable XGBoost/CatBoost via stratified OOF labeling, achieving <2ms latency for fraud detection.
Rep. Schiff proposes legislation requiring data centers to cover full electricity costs, targeting AI infrastructure economics and energy demand.
Controlled study on MNIST decouples human soft-label benefits from label correction, showing uncertainty captures improve calibration independently.
Mechanistic analysis of language-switching backdoor in 8B LM: three-phase circuit where Latin trigger redirects English→French via attention and subspace propagation.
ZEDA enables post-trained MoE models to skip ~50% of experts via self-distillation, reducing inference cost without retraining.