I am currently a third-year Ph.D. student in the School of Automation, Southeast University, under the supervision of Prof. Wankou Yang.
My research interests include MLLM, Visual Grounding, Video Understanding and RL.
I have published 10+ papers in top international AI conferences and journals such as TPAMI, TIP, TCSVT, PR, ICCV, NeurIPS, and AAAI.
🔥 News
- 2025.11: 🎉 One first-author paper has been accepted to TCSVT 2025.
- 2025.09: 🎉 One first-author paper has been accepted to TPAMI 2025.
- 2025.07: 🎉 Two first-author paper has been accepted to ICCV 2025.
- 2024.12: 🎉 One first-author paper has been accepted to AAAI 2025.
- 2024.09: 🎉 One first-author paper has been accepted to NeurIPS 2024.
- 2023.12: 🎉 One first-author paper has been accepted to TIP 2023.
- 2021.09: 🎉 One first-author paper has been accepted to TCSVT 2021.
📝 Publications
Research Direction 1: Referring Video Object Segmentation (RefVOS)

Arxiv 2025 MomentSeg: Moment-Centric Sampling for Enhanced Video Pixel Understanding
Ming Dai, Sen Yang, Boqiang Duan, Wankou Yang, Jingdong Wang
Highlights: MomentSeg is a MLLM method, which unifies temporal grounding and segmentation, enabling key-frame extraction without relying on any external models. In addition, we introduce a novel
[FIND]token, which allows the model to perform temporal grounding without requiring any additional timestamp encoding.
Research Direction 2: Visual Grounding (REC, RES, GREC, GRES)

TPAMI 2025 Improving Generalized Visual Grounding with Instance-aware Joint Learning
Ming Dai, Wenxuan Cheng, Jiang-jiang Liu, Lingfeng Yang, Zhenhua Feng, Wankou Yang, Jingdong Wang
Highlights: InstanceVG supports instance-level referring segmentation across general scenarios (no/single/multiple targets). It also provides consistent prediction across
point,box, andmask.

ICCV 2025 PropVG: End-to-End Proposal-Driven Visual Grounding with Multi-Granularity Discrimination
Ming Dai, Wenxuan Cheng, Jiedong Zhuang, Jiang-jiang Liu, Hongshen Zhao, Zhenhua Feng, Wankou Yang
Highlights: PropVG achieves end-to-end two-stage visual grounding, overcoming the traditional drawbacks of previous two-stage approaches that relied on external detectors and were often associated with slow inference and limited performance.

ICCV 2025 DeRIS: Decoupling Perception and Cognition for Enhanced Referring Image Segmentation through Loopback Synergy
Ming Dai, Wenxuan Cheng, Jiang-jiang Liu, Sen Yang, Wenxiao Cai, Yanpeng Sun, Wankou Yang
Highlights: DeRIS analyzes a key bottleneck in visual grounding—Cognition. It decouples the VG task into perception and cognition components, and integrates them effectively through a loopback synergy mechanism.

AAAI 2025 Multi-task Visual Grounding with Coarse-to-Fine Consistency Constraints
Ming Dai, Jian Li, Jiedong Zhuang, Xian Zhang, Wankou Yang
Highlights: C3VG investigates the consistency prediction problem in REC and RIS, introducing a coarse-to-fine architecture that enforces consistency through both
implicitandexplicitconstraints.

NeurIPS 2024 SimVG: A Simple Framework for Visual Grounding with Decoupled Multi-modal Fusion
Ming Dai, Lingfeng Yang, Yihao Xu, Zhenhua Feng, Wankou Yang
Highlights: SimVG explores the importance of multi-modal understanding for the VG task, proposing a simple yet effective framework. It also adopts a synchronized distillation learning strategy between the teacher and student branches, enhancing the performance of the student branch.
Research Direction 3: Cross-View Geo-Localization

Arxiv 2024 Drone Referring Localization: An Efficient Heterogeneous Spatial Feature Interaction Method For UAV Self-Localization
Ming Dai, Enhui Zheng, Jiahao Chen, Lei Qi, Zhenhua Feng, Wankou Yang
Highlights: DRL adopts an end-to-end training and inference paradigm to address common issues in image-retrieval-based UAV self-localization, including complex preprocessing, inherent localization errors, and slow inference.

TIP 2023 Vision-Based UAV Self-Positioning in Low-Altitude Urban Environments
Ming Dai, Enhui Zheng, Zhenhua Feng, Jiedong Zhuang, Wankou Yang
Highlights: DenseUAV introduces a real-world sampled dataset for vision-based UAV self-localization and provides a comprehensive benchmark for the task.

TCSVT 2021 A Transformer-Based Feature Segmentation and Region Alignment Method For UAV-View Geo-Localization
Ming Dai, Jianhong Hu, Jiedong Zhuang, Enhui Zheng
Highlights: FSRA is the first successful application of Transformer models to cross-view geo-localization. It introduces an attention-map-based region partitioning and alignment strategy that alleviates performance degradation caused by viewpoint shifts.
🎖 Honors and Awards
Competition
- 2023.12 National First Prize, 5th Global Campus AI Algorithm Elite Competition (Zero-Shot Referring Expression Understanding)
- 2023.10 National First Prize (Champion), 4th “Space Cup” National Innovation and Creativity Competition (Multispectral Object Detection), Team Leader
- 2022.08 National Second Prize (Runner-up), China Postgraduate Smart City Technology and Creative Design Competition (Object Detection), Team Leader
- 2018.09 Zhejiang Provincial Robotics Competition: 2nd Prize (Shopping Track), 2nd Prize (Tourism Track), 3rd Prize (Transportation Track)
- 2017.09 1st Prize (East China Division) and 2nd Prize (National Division), Siemens Cup China Intelligent Manufacturing Challenge, Team Leader
Scholarships and Honors
- 2025 National Scholarship for Doctoral Students, Advanced Academic Individual, Southeast University
- 2022 National Scholarship for Graduate Students
- 2020 Outstanding Graduate of Zhejiang Province, Outstanding Undergraduate Graduate of China Jiliang University
- 2018 Zhejiang Provincial Government Scholarship
📖 Educations
- 2023.09 – present Ph.D. Student, School of Automation, Southeast University, Nanjing, China.
- 2020.09 – 2023.06 Master’s Student, China Jiliang University, Hangzhou, China.
- 2016.09 – 2020.06 Undergraduate Student, China Jiliang University, Hangzhou, China.
💻 Internships
- 2024.12 – present Baidu, LMMs Research, Shanghai, China
- 2022.11 – 2023.05 NIO, Autonomous Driving – Algorithm, Beijing, China
- 2022.03 – 2022.08 ByteDance, E-commerce – Algorithm Hangzhou, China
💬 Services
Reviewers
- TNNLS, TCSVT, ISPRS, PR
- NeurIPS2025, CVPR2025, ICCV2025, AAAI2026, ICLR2026, CVPR2026
Leadership
- 2018–2019 President, 1st AI and Robotics Association, China Jiliang University