The Strength of Gemini Omni is in video manipulation
Google Gemini Omni demonstrates strong video manipulation capabilities, highlighting a key technical strength of the multimodal model.
Every story tagged with this topic, ordered by date.
Google Gemini Omni demonstrates strong video manipulation capabilities, highlighting a key technical strength of the multimodal model.
Method conditions diffusion models on MLLMs for subject-driven image generation, improving identity preservation and instruction following via joint text-image encoding.
Prism: plug-in infrastructure for multimodal continual instruction tuning of MLLMs, addressing engineering bottlenecks via modular method-agnostic design.
Channel-wise Vector Quantization replaces patch-wise image tokenization with channel-wise quantization for autoregressive visual models.
DRScaffold: training lightweight vision-language models for dense-scene reasoning with explicit grounding between inference steps.
3D foundation model for light sheet fluorescence microscopy enables few-shot segmentation and classification of volumetric biological imaging data.
STORM internalizes spatial-temporal reasoning in video-language models via implicit visual memory instead of externalizing to textual chain-of-thought.
MAGIC: training-free coreset selection for vision-language model instruction tuning via multimodal alignment signals.
Numind releases NuExtract3, open-weight 4B multimodal VLM for document extraction and Markdown conversion under Apache-2.0.
Benchmark of vision LLMs vs. OCR pipelines on 30 long, image-heavy PDFs from MMLongBench-Doc shows LlamaCloud + Azure premium achieving 59.6%–58.5% accuracy; agentic RAG and native PDF vision approaches compared on cost and accuracy.
OpenAI researcher demonstrates multi-character motion capture using AI, collaboration with choreographer on creative project.
Meituan releases LongCat-Video-Avatar 1.5, open-source audio-driven human video generation framework with AT2V, ATI2V, and video continuation tasks.
SpaceNum benchmark tests whether VLMs genuinely ground numerical outputs in spatial perception via dynamic and static reasoning tasks.
ETCHR decouples image editing from understanding in MLLMs to improve visual reasoning without predefined toolkits or noisy generation.
PGT generates procedural geometric tasks to improve MLLM fine-grained visual grounding and diagnose perception failure sources.
FM-CGM modular framework leverages foundation models for visual causal reasoning and counterfactual generation.
ToolMerge: LLM-based query decomposition for keyframe retrieval in long-video QA with learned ranking merging.
Debiased negative mining technique improves out-of-distribution detection in vision-language models via semantic label selection.
Adversarial subspace alignment enables robust multimodal knowledge editing in MLLMs with improved generalization across visual and linguistic variations.
PhotoFlow Director-Reviewer-Reflector agent framework for 3D virtual photography combines spatial reasoning and aesthetic judgment through closed-loop search.
ChartFI benchmark evaluates faithfulness and insightfulness of MLLM-generated chart descriptions beyond fact enumeration.
CVSearch training-free framework for multimodal LLMs adaptively balances visual search strategies to improve high-resolution image perception efficiency.
Entity-centric latent patch memory for multi-shot video generation maintains character consistency without full-frame overhead.
PathNavigate applies training-free agentic VQA to whole-slide pathology images using surprise-guided navigation and memory caching.
DrawVideo generates long-form video from storyboard sketches, decomposing sequences into independently controllable shots via sketch/appearance/motion prompts.
Compares acoustic emotion recognition vs LLM analysis (Gemini 2.5 Flash) for pathos in political speech; single-speech case study.
ChronoVAE-HOPE time series foundation model replaces attention with VAE for specialized classification, addresses quadratic complexity.
CEDAR method disentangles vision-language model embeddings via invertible transformation without expanding dimensionality, enabling sparse interpretability.
SegCompass uses sparse autoencoders to create interpretable alignment between LLM reasoning and visual segmentation.
Image-semantic detection method enhances MLLM performance detecting AI-generated modern Chinese poetry.
Discussion of whether production VLMs use fixed-patch ViTs or adaptive tokenization; explores engineering vs. scaling trade-offs.
Google I/O 2026: Gemini Omni and 99 other announcements; focus on multimodal AI and platform expansions.
WikiVQABench benchmark combines Wikipedia images with Wikidata for knowledge-grounded visual question answering evaluation.
Benchmark distinguishing temporal vs. spatial glitch detection in VLMs for game quality assurance; finds temporal glitches substantially harder than frame-level anomalies.
Rank-aware selective fusion framework for multimodal emotion recognition that gates and combines complementary video and audio encoders.
LoCar introduces evaluation framework for in-vehicle LLM assistants with focus on Korean honorific stability and localization.
SpectralEarth-FM integrates hyperspectral imagery with multisensor Earth observation data via hierarchical transformer with spectral tokenization.
Driving VLA redesigned via inverse kinematics framework to improve trajectory prediction by grounding visual tokens in dual boundary conditions.
Theoretical framework for training multimodal LLMs using only pairwise modality alignments instead of full joint multimodal datasets.
ArPoMeme introduces 7,300 Arabic political memes dataset labeled by ideology (Leftist, Islamist, Pan-Arabist, Satirical) for multimodal polarization analysis.
Inter-layer visual attention discrepancy method reduces hallucinations in LVLMs by detecting insufficient attention to correct visual evidence during generation.
SPpruner applies focus-then-context visual token reduction in VLMs, reducing inference cost while preserving subject-centric and contextual relationships.
Google I/O 2026: Gemini 3.5 Flash, multimodal Omni, Spark background agents, Antigravity 2.0.
Decoupled post-training of vision-language models shows visual perception, not reasoning, is primary bottleneck in VLM performance.
ClinSeekAgent: agentic framework automating active multimodal evidence seeking across heterogeneous clinical data sources.
Causal evaluation framework assesses faithfulness of vision-language model visual attribution in chest X-ray diagnosis.
VL-DPO aligns autonomous driving motion forecasting with human preferences using vision-language model guidance and DPO finetuning.
GeoX uses self-play with verifiable program-based rewards to learn geospatial reasoning from satellite/aerial imagery without human annotations.
Reddit discussion of Gemini Omni's inability to generate real-world physical actions, highlighting gap between multimodal capability claims and embodied task execution.
LLM method for generating multimodal agent behaviors (verbal, vocal, gestural, facial) calibrated to trustworthiness dimensions.