📝 Publications

Research Direction 1: Referring Video Object Segmentation (RefVOS)

Arxiv 2025
sym

Arxiv 2025 MomentSeg: Moment-Centric Sampling for Enhanced Video Pixel Understanding
Ming Dai, Sen Yang, Boqiang Duan, Wankou Yang, Jingdong Wang

Paper | Code | Project

Highlights: MomentSeg is a MLLM method, which unifies temporal grounding and segmentation, enabling key-frame extraction without relying on any external models. In addition, we introduce a novel [FIND] token, which allows the model to perform temporal grounding without requiring any additional timestamp encoding.

Research Direction 2: Visual Grounding (REC, RES, GREC, GRES)

TPAMI 2025
sym

TPAMI 2025 Improving Generalized Visual Grounding with Instance-aware Joint Learning
Ming Dai, Wenxuan Cheng, Jiang-jiang Liu, Lingfeng Yang, Zhenhua Feng, Wankou Yang, Jingdong Wang

Paper | Code | 中文解读

Highlights: InstanceVG supports instance-level referring segmentation across general scenarios (no/single/multiple targets). It also provides consistent prediction across point, box, and mask.

ICCV 2025
sym

ICCV 2025 PropVG: End-to-End Proposal-Driven Visual Grounding with Multi-Granularity Discrimination
Ming Dai, Wenxuan Cheng, Jiedong Zhuang, Jiang-jiang Liu, Hongshen Zhao, Zhenhua Feng, Wankou Yang

Paper | Code | 中文解读

Highlights: PropVG achieves end-to-end two-stage visual grounding, overcoming the traditional drawbacks of previous two-stage approaches that relied on external detectors and were often associated with slow inference and limited performance.

ICCV 2025
sym

ICCV 2025 DeRIS: Decoupling Perception and Cognition for Enhanced Referring Image Segmentation through Loopback Synergy
Ming Dai, Wenxuan Cheng, Jiang-jiang Liu, Sen Yang, Wenxiao Cai, Yanpeng Sun, Wankou Yang

Paper | Code img

Highlights: DeRIS analyzes a key bottleneck in visual grounding—Cognition. It decouples the VG task into perception and cognition components, and integrates them effectively through a loopback synergy mechanism.

AAAI 2025 (Selected as Oral)
sym

AAAI 2025 Multi-task Visual Grounding with Coarse-to-Fine Consistency Constraints
Ming Dai, Jian Li, Jiedong Zhuang, Xian Zhang, Wankou Yang

Paper | Code

Highlights: C3VG investigates the consistency prediction problem in REC and RIS, introducing a coarse-to-fine architecture that enforces consistency through both implicit and explicit constraints.

NeurIPS 2024
sym

NeurIPS 2024 SimVG: A Simple Framework for Visual Grounding with Decoupled Multi-modal Fusion
Ming Dai, Lingfeng Yang, Yihao Xu, Zhenhua Feng, Wankou Yang

Paper | Code | 中文解读

Highlights: SimVG explores the importance of multi-modal understanding for the VG task, proposing a simple yet effective framework. It also adopts a synchronized distillation learning strategy between the teacher and student branches, enhancing the performance of the student branch.

Research Direction 3: Cross-View Geo-Localization

Arxiv 2024
sym

Arxiv 2024 Drone Referring Localization: An Efficient Heterogeneous Spatial Feature Interaction Method For UAV Self-Localization
Ming Dai, Enhui Zheng, Jiahao Chen, Lei Qi, Zhenhua Feng, Wankou Yang

Paper | Code

Highlights: DRL adopts an end-to-end training and inference paradigm to address common issues in image-retrieval-based UAV self-localization, including complex preprocessing, inherent localization errors, and slow inference.

TIP 2023
sym

TIP 2023 Vision-Based UAV Self-Positioning in Low-Altitude Urban Environments
Ming Dai, Enhui Zheng, Zhenhua Feng, Jiedong Zhuang, Wankou Yang

Paper | Code | 中文解读

Highlights: DenseUAV introduces a real-world sampled dataset for vision-based UAV self-localization and provides a comprehensive benchmark for the task.

TCSVT 2021
sym

TCSVT 2021 A Transformer-Based Feature Segmentation and Region Alignment Method For UAV-View Geo-Localization
Ming Dai, Jianhong Hu, Jiedong Zhuang, Enhui Zheng

Paper | Code

Highlights: FSRA is the first successful application of Transformer models to cross-view geo-localization. It introduces an attention-map-based region partitioning and alignment strategy that alleviates performance degradation caused by viewpoint shifts.