I am currently a third-year Ph.D. student in the School of Automation, Southeast University, under the supervision of Prof. Wankou Yang.
My research interests include Deep Search Agent, Agentic RL, Visual Grounding and Video Understanding.
I have published 10+ papers in top international AI conferences and journals such as TPAMI, TIP, ICML, ICCV, CVPR, NeurIPS.
🔥 News
- 2026.06: 🎉 One first-author paper has been accepted to ICML 2026.
- 2026.02: 🎉 One co-first-author paper has been accepted to CVPR 2026.
- 2026.01: 🎉 One first-author paper has been accepted to PR 2026.
- 2025.11: 🎉 One first-author paper has been accepted to TCSVT 2025.
- 2025.09: 🎉 One first-author paper has been accepted to TPAMI 2025.
- 2025.07: 🎉 Two first-author paper has been accepted to ICCV 2025.
- 2024.12: 🎉 One first-author paper has been accepted to AAAI 2025.
- 2024.09: 🎉 One first-author paper has been accepted to NeurIPS 2024.
- 2023.12: 🎉 One first-author paper has been accepted to TIP 2023.
- 2021.09: 🎉 One first-author paper has been accepted to TCSVT 2021.
📝 Publications
Research Direction 1: Referring/Reasoning Video Object Segmentation (RVOS)

ICML 2026 VideoSEG-O3: A Multi-turn Reinforcement Learning Framework for Reasoning Video Object Segmentation
Ming Dai, Sen Yang, Boqiang Duan, Boyuan Tong, Jiedong Zhuang, Wankou Yang, Jingdong Wang
Highlights: VideoSEG-O3 is a multi-turn RL framework for RVOS, actively exploring temporal intervals and keyframes through temporal-spatial CoT instead of relying on fixed sampled frames. It further introduces SEG-aware logit calibration and a decoupled thinking trace, aligning token-level policy optimization with pixel-level mask quality.

CVPR 2026 DeRVOS: Decoupling Consistent Trajectory Generation and Multimodal Understanding for Referring Video Object Segmentation
Wenxuan Cheng*, Ming Dai*, Huimin Lu, Wankou Yang
Highlights: DeRVOS decouples trajectory generation and multimodal understanding, with
TAISaligning and selecting instance trajectories for robust RVOS.

Arxiv 2025 MomentSeg: Moment-Centric Sampling for Enhanced Video Pixel Understanding
Ming Dai, Sen Yang, Boqiang Duan, Wankou Yang, Jingdong Wang
Highlights: MomentSeg is a MLLM method, which unifies temporal grounding and segmentation, enabling key-frame extraction without relying on any external models. In addition, we introduce a novel
[FIND]token, which allows the model to perform temporal grounding without requiring any additional timestamp encoding.
Research Direction 2: Visual Grounding (REC, RES, GREC, GRES)

TPAMI 2025 Improving Generalized Visual Grounding with Instance-aware Joint Learning
Ming Dai, Wenxuan Cheng, Jiang-jiang Liu, Lingfeng Yang, Zhenhua Feng, Wankou Yang, Jingdong Wang
Highlights: InstanceVG supports instance-level referring segmentation across general scenarios (no/single/multiple targets). It also provides consistent prediction across
point,box, andmask.

ICCV 2025 PropVG: End-to-End Proposal-Driven Visual Grounding with Multi-Granularity Discrimination
Ming Dai, Wenxuan Cheng, Jiedong Zhuang, Jiang-jiang Liu, Hongshen Zhao, Zhenhua Feng, Wankou Yang
Highlights: PropVG achieves end-to-end two-stage visual grounding, overcoming the traditional drawbacks of previous two-stage approaches that relied on external detectors and were often associated with slow inference and limited performance.

ICCV 2025 DeRIS: Decoupling Perception and Cognition for Enhanced Referring Image Segmentation through Loopback Synergy
Ming Dai, Wenxuan Cheng, Jiang-jiang Liu, Sen Yang, Wenxiao Cai, Yanpeng Sun, Wankou Yang
Highlights: DeRIS analyzes a key bottleneck in visual grounding—Cognition. It decouples the VG task into perception and cognition components, and integrates them effectively through a loopback synergy mechanism.

TCSVT 2025 GC3VG: Generalized Multi-task Visual Grounding with Coarse-to-Fine Consistency Constraints
Ming Dai, Kai Chen, Wenxuan Cheng, Jiedong Zhuang, Zhenhua Feng, Pengfei Zhu, Wankou Yang
Highlights: GC3VG generalizes the C3VG architecture and incorporates
UCRM, which implicitly captures region/instance features and explicitly aligns them via an IoU-based relational constraint. TheGHAstrategy ensures feature-space consistency and boosts the discriminative strength of multi-modal representations.

AAAI 2025 Multi-task Visual Grounding with Coarse-to-Fine Consistency Constraints
Ming Dai, Jian Li, Jiedong Zhuang, Xian Zhang, Wankou Yang
Highlights: C3VG investigates the consistency prediction problem in REC and RIS, introducing a coarse-to-fine architecture that enforces consistency through both
implicitandexplicitconstraints.

NeurIPS 2024 SimVG: A Simple Framework for Visual Grounding with Decoupled Multi-modal Fusion
Ming Dai, Lingfeng Yang, Yihao Xu, Zhenhua Feng, Wankou Yang
Highlights: SimVG explores the importance of multi-modal understanding for the VG task, proposing a simple yet effective framework. It also adopts a synchronized distillation learning strategy between the teacher and student branches, enhancing the performance of the student branch.
Research Direction 3: Cross-View Geo-Localization

PR 2026 Drone Referring Localization: An Efficient Heterogeneous Spatial Feature Interaction Method For UAV Self-Localization
Ming Dai, Enhui Zheng, Jiahao Chen, Lei Qi, Zhenhua Feng, Wankou Yang
Highlights: DRL adopts an end-to-end training and inference paradigm to address common issues in image-retrieval-based UAV self-localization, including complex preprocessing, inherent localization errors, and slow inference.

TIP 2023 Vision-Based UAV Self-Positioning in Low-Altitude Urban Environments
Ming Dai, Enhui Zheng, Zhenhua Feng, Jiedong Zhuang, Wankou Yang
Highlights: DenseUAV introduces a real-world sampled dataset for vision-based UAV self-localization and provides a comprehensive benchmark for the task.

TCSVT 2021 A Transformer-Based Feature Segmentation and Region Alignment Method For UAV-View Geo-Localization
Ming Dai, Jianhong Hu, Jiedong Zhuang, Enhui Zheng
Highlights: FSRA is the first successful application of Transformer models to cross-view geo-localization. It introduces an attention-map-based region partitioning and alignment strategy that alleviates performance degradation caused by viewpoint shifts.
🎖 Honors and Awards
Competition
- 2023.12 National First Prize, 5th Global Campus AI Algorithm Elite Competition (Zero-Shot Referring Expression Understanding)
- 2023.10 National First Prize (Champion), 4th “Space Cup” National Innovation and Creativity Competition (Multispectral Object Detection), Team Leader
- 2022.08 National Second Prize (Runner-up), China Postgraduate Smart City Technology and Creative Design Competition (Object Detection), Team Leader
- 2018.09 Zhejiang Provincial Robotics Competition: 2nd Prize (Shopping Track), 2nd Prize (Tourism Track), 3rd Prize (Transportation Track)
- 2017.09 1st Prize (East China Division) and 2nd Prize (National Division), Siemens Cup China Intelligent Manufacturing Challenge, Team Leader
Scholarships and Honors
- 2025 National Scholarship for Doctoral Students, Advanced Academic Individual, Southeast University
- 2022 National Scholarship for Graduate Students
- 2020 Outstanding Graduate of Zhejiang Province, Outstanding Undergraduate Graduate of China Jiliang University
- 2018 Zhejiang Provincial Government Scholarship
📖 Educations
- 2023.09 – present Ph.D. Student, School of Automation, Southeast University, Nanjing, China.
- 2020.09 – 2023.06 Master’s Student, China Jiliang University, Hangzhou, China.
- 2016.09 – 2020.06 Undergraduate Student, China Jiliang University, Hangzhou, China.
💻 Internships
- 2026.03 – current Ant Group, Agent Research, Hangzhou, China
- 2024.12 – 2026.02 Baidu, LMMs Research, Shanghai, China
- 2022.11 – 2023.05 NIO, Autonomous Driving – Algorithm, Beijing, China
- 2022.03 – 2022.08 ByteDance, E-commerce – Algorithm Hangzhou, China
💬 Services
Reviewers
- TIP, TNNLS, TCSVT, ISPRS, PR
- NeurIPS2025, CVPR2025, ICCV2025, AAAI2026, ICLR2026, CVPR2026, ICML2026, ECCV2026
Leadership
- 2018–2019 President, 1st AI and Robotics Association, China Jiliang University