The Strength of Gemini Omni is in video manipulation
Google Gemini Omni demonstrates strong video manipulation capabilities, highlighting a key technical strength of the multimodal model.
Every story tagged with this topic, ordered by date.
Google Gemini Omni demonstrates strong video manipulation capabilities, highlighting a key technical strength of the multimodal model.
Nathan Witkin critiques METR's Long Tasks benchmark methodology, identifying severe flaws in the widely-cited AI time horizons graph.
MobileGym enables parallel RL training for mobile GUI agents via deterministic JSON-based state representation, supporting hundreds of concurrent instances.
Method conditions diffusion models on MLLMs for subject-driven image generation, improving identity preservation and instruction following via joint text-image encoding.
LoopMDM: selective layer looping in masked diffusion models improves training efficiency and performance, enabling depth scaling without parameter increase.
LLM-based framework for structured code change labeling (renames, moves, logic edits) to improve code review efficiency beyond summarization.
Sleep-like consolidation mechanism converts context to fast weights in SSM blocks, reducing attention's poor scaling with long context in LLMs.
Self-generated replay data from LMs nearly eliminates catastrophic forgetting on new tasks; forgetting persists only when model capacity is exhausted.
GoBOED: goal-driven Bayesian experimental design framework optimizing information gathering for decision-critical objectives under model uncertainty.
OrpQuant: geometric orthogonal residual projection for power-of-two quantization enabling multiplier-free LLM/ViT deployment on edge devices.
Channel-wise Vector Quantization replaces patch-wise image tokenization with channel-wise quantization for autoregressive visual models.
DiscoverPhysics benchmark tests LLM reasoning by having agents discover physics laws in simulated worlds with non-standard dynamics.
VeriTrace proposes explicit feedback mechanisms for research agents to evolve mental models and prevent error propagation in uncertain information.
Wasserstein Policy Gradient for entropy-regularized RL proves global convergence using optimal-transport geometry for continuous control.
Active Query Synthesis for preference learning uses confidence-aware response model to reduce labeling cost for pairwise comparisons.
WhoSaidIt applies human-LLM collaborative re-annotation to stabilize multilingual speaker-attribute labels via disagreement-focused sampling.
Conditional kernel ridge regression with unpenalized features via conditionally positive definite kernels; primarily theoretical ML contribution.
Neuronal Stochastic Attention Circuit for uncertainty in continuous-time representation learning using C.elegans-inspired circuits.
Neural operator surrogates for Bayesian inverse design in CFD to accelerate MCMC sampling for aerodynamic geometry inference.
Multi-objective textual gradient optimization for LLM judges: gradient conflicts cause failure modes; six decomposition modes tested.
Uncertainty quantification for activation oracles interpreting LLM internals; bootstrap frequency is best-calibrated confidence method.
L2IR: graph fraud detection combining GNNs and LLMs to infer fraudster intent behind suspicious connections.
DRScaffold: training lightweight vision-language models for dense-scene reasoning with explicit grounding between inference steps.
Transformer paper co-author argues for moving beyond transformer architecture; Pathway's post-transformer research gaining attention from technical community.
MLP-LDRU architecture improves length generalization in neural networks via log-depth recurrent units with associativity-biased operators.
SKILD unifies image generation and super-resolution via scale-invariant diffusion in K-space, leveraging scale invariance in natural and physical systems.
3D foundation model for light sheet fluorescence microscopy enables few-shot segmentation and classification of volumetric biological imaging data.
RAG framework detects potentially abusive clauses in Chilean Terms of Service using retrieval-augmented generation for legal document analysis.
STORM internalizes spatial-temporal reasoning in video-language models via implicit visual memory instead of externalizing to textual chain-of-thought.
AdvantageFlow applies advantage-weighted RL to forward-process diffusion optimization in Stable Diffusion, outperforming reverse-process baselines.
Orthogonal bottleneck representation prior constrains RL encoder features to low-dimensional subspaces without auxiliary objectives or pretraining.
MAGIC: training-free coreset selection for vision-language model instruction tuning via multimodal alignment signals.
Framework for systematizing GenAI evaluation concepts (reasoning, fairness, creativity) into measurable definitions using AI assistance.
Statistical inference methodology for SGD trajectories in infinite-variance regimes via weak convergence theory.
Causal inference methods for LLM development decisions: data mixtures, reward models, routing, and evaluation.
Deployment-complete benchmarking: framework ensuring benchmark evidence resolves deployment decisions via conformal coverage.
Fuzzy PyTorch: framework for evaluating numerical variability in deep learning via stochastic arithmetic integration.
Medical RAG training failure analysis: checker output distribution determines gradient quality; identifies signal collapse and reward hacking.
Neural-symbolic framework for complex query answering over knowledge graphs with multiple free variables.
68-cell empirical study: LLM agents show +19.69pp higher sensitivity to semantic noise vs. surface noise across reasoning tasks.
Empirical validation of creative quality alignment via chain-of-thought fine-tuning on small models with ~100 expert annotations.
ProAct: proactive agent architecture using idle-time compute to predict and prepare for future user requests via dialogue history analysis.
B³D-RWKV unifies causal RWKV with discrete diffusion via triplet-block layout, achieving O(L) inference with parallel bidirectional decoding.
Gradient-free, training-free watermark for synthetic audio via token vocabulary redundancy, robust to discretization errors.
Large factorial grid study (720 runs) shows optimal learning-rate schedule for sub-100M QAT is invariant across FP16/INT8/INT6 bit-widths.
LECTOR grounds scientific introduction generation via reasoning graphs and structured content to reduce hallucinated citations.
Continual unlearning method for speaker identity in zero-shot TTS, preventing revival of previously unlearned voices under sequential removal.
PolyGnosis 2.0 multi-agent system detects predictive trading signals via Polymarket-GDELT narrative mismatches and harness engineering.
QUIET benchmark for evaluating LLM creative generation (not discriminative ability) via multi-blank cascaded story cloze with objective scoring.
Step-TP: step-level dataset with CoT reasoning for LLM-guided tensor program optimization, enabling composable transformation decisions.