-
Solaris: Building a Multiplayer Video World Model in Minecraft
by Georgy Savva et al.
-
Soft Contamination Means Benchmarks Test Shallow Generalization
by Ari Spiesberger et al.
-
Visually Prompted Benchmarks Are Surprisingly Fragile
by Haiwen Feng et al.
-
BabyVision: Visual Reasoning Beyond Language
by Liang Chen et al.
-
Vision Encoders in Vision-Language Models: A Survey
by Han Xiao
-
Next-Embedding Prediction Makes Strong Vision Learners
by Sihan Xu et al.
-
What Kind of Reasoning (if any) is an LLM actually doing? On the Stochastic Nature and Abductive Appearance of Large Language Models
by Luciano Floridi et al.
-
Olmo 3
by Team Olmo
-
On the Interplay of Pre-Training, Mid-Training, and RL on Reasoning Language Models
by Charlie Zhang et al.
-
Lumine: An Open Recipe for Building Generalist Agents in 3D Open Worlds
by Weihao Tan et al.
-
Questioning the Stability of Visual Question Answering
by Amir Rosenfeld et al.
-
OpenVision 2: A Family of Generative Pretrained Visual Encoders for Multimodal Learning
by Yanqing Liu et al.
-
Subliminal Learning: Language models transmit behavioral traits via hidden signals in data
by Alex Cloud et al.
-
The Term 'Agent' Has Been Diluted Beyond Utility and Requires Redefinition
by Brinnae Bent
-
Vision Language Models are Biased
by An Vo et al.
-
GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models
by GLM-4. 5 Team et al.
-
Bifrost-1: Bridging Multimodal LLMs and Diffusion Models with Patch-level CLIP Latents
by Han Lin et al.
-
Seed1.5-VL Technical Report
by Dong Guo et al.
-
Emerging Properties in Unified Multimodal Pretraining
by Chaorui Deng et al.
-
Harnessing the Universal Geometry of Embeddings
by Rishi Jha et al.
-
Transfer between Modalities with MetaQueries
by Xichen Pan et al.
-
Bring Reason to Vision: Understanding Perception and Reasoning through Model Merging
by Shiqi Chen et al.
-
Rethinking Visual Layer Selection in Multimodal LLMs
by Haoran Chen et al.
-
Perception Encoder: The best visual embeddings are not at the output of the network
by Daniel Bolya et al.
-
Scaling Laws for Native Multimodal Models
by Mustafa Shukor et al.
-
Science-T2I: Addressing Scientific Illusions in Image Synthesis
by Jialuo Li et al.
-
Token-Shuffle: Towards High-Resolution Image Generation with Autoregressive Models
by Xu Ma et al.
-
Scaling Language-Free Visual Representation Learning
by David Fan et al.
-
A Decade's Battle on Dataset Bias: Are We There Yet?
by Zhuang Liu et al.
-
Sparse Autoencoders for Scientifically Rigorous Interpretation of Vision Models
by Samuel Stevens et al.
-
Pretrained Transformers as Universal Computation Engines
by Kevin Lu et al.
-
Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach
by Jonas Geiping et al.
-
Are Vision Language Models Texture or Shape Biased and Can We Steer Them?
by Paul Gavrikov et al.
-
Open Problems in Mechanistic Interpretability
by Lee Sharkey et al.
-
Why Do We Need Weight Decay in Modern Deep Learning?
by Francesco D'Angelo et al.
-
Vision-Language Models Do Not Understand Negation
by Kumail Alhamoud et al.
-
ICONS: Influence Consensus for Vision-Language Data Selection
by Xindi Wu et al.
-
The GAN is dead; long live the GAN! A Modern GAN Baseline
by Yiwen Huang et al.
-
The Unbearable Slowness of Being: Why do we live at 10 bits/s?
by Jieyu Zheng et al.
-
Hidden in Plain Text: Emergence & Mitigation of Steganographic Collusion in LLMs
by Yohan Mathew et al.
-
Analyzing (In)Abilities of SAEs via Formal Languages
by Abhinav Menon et al.
-
MetaMorph: Multimodal Understanding and Generation via Instruction Tuning
by Shengbang Tong et al.