Weak-to-Strong Compositional Learning from Generative Models for Language-based Object Detection.- Domesticating SAM for Breast Ultrasound Image Segmentation via Spatial-frequency Fusion and Uncertainty Correction.- CanonicalFusion: Generating Drivable 3D Human Avatars from Multiple Images.- Camera Height Doesn't Change: Unsupervised Training for Metric Monocular Road-Scene Depth Estimation.- Uni3DL: A Unified Model for 3D Vision-Language Understanding.- Object-Aware NIR-to-Visible Translation.- PaPr: Training-Free One-Step Patch Pruning with Lightweight ConvNets for Faster Inference.- GENIXER: Empowering Multimodal Large Language Models as a Powerful Data Generator.
- BLINK: Multimodal Large Language Models Can See but Not Perceive.- AFF-ttention! Affordances and Attention models for Short-Term Object Interaction Anticipation.- PreLAR: World Model Pre-training with Learnable Action Representation.- Multi-HMR: Multi-Person Whole-Body Human Mesh Recovery in a Single Shot.- De-confounded Gaze Estimation.- Diffusion Models for Monocular Depth Estimation: Overcoming Challenging Conditions.- FreestyleRet: Retrieving Images from Style-Diversified Queries.- ReGround: Improving Textual and Spatial Grounding at No Cost.
- CardiacNet: Learning to Reconstruct Abnormalities for Cardiac Disease Assessment from Echocardiogram Videos.- LaMI-DETR: Open-Vocabulary Detection with Language Model Instruction.- Unrolled Decomposed Unpaired Learning for Controllable Low-Light Video Enhancement.- Efficient Image Pre-Training with Siamese Cropped Masked Autoencoders.- VP-SAM: Taming Segment Anything Model for Video Polyp Segmentation via Disentanglement and Spatio-temporal Side Network.- Dataset Enhancement with Instance-Level Augmentations.- FreeMotion: MoCap-Free Human Motion Synthesis with Multimodal Large Language Models.- Chameleon: A Data-Efficient Generalist for Dense Visual Prediction in the Wild.
- Reliability in Semantic Segmentation: Can We Use Synthetic Data?.- SCPNet: Unsupervised Cross-modal Homography Estimation via Intra-modal Self-supervised Learning.- SCAPE: A Simple and Strong Category-Agnostic Pose Estimator.