I am currently a fourth-year Ph.D. student in the School of Automation at Southeast University, advised by Prof. Wankou Yang.

I have gained algorithm and research internship experience at ByteDance (2022), NIO (2023), Baidu (2024-2025), and Ant Group (2026). During my internship at Baidu, I was fortunate to work under the guidance of Dr. Jingdong Wang.

I have published 10+ first-authored papers in top international AI conferences and journals, including TPAMI, TIP, ICML, ICCV, CVPR, and NeurIPS. My work has received 1,000+ citations on Google Scholar. My research interests include MLLMs, Agentic RL, Visual Grounding, Video/Image Referring Segmentation, and Agentic Search.

Please feel free to contact me at 869906992@qq.com for questions, discussions, or potential collaborations.

🔥 News

2026.07: 🎉 Our co-first-authored paper ScanFocus has been accepted to ECCV 2026. The code is publicly available.
2026.07: 🚀 Our technical report SimpleSearch-VL is released, exploring efficient, reliable, and practical multimodal agentic search. Project/Repo
2026.06: 🎉 MomentSeg has been accepted to ECCV 2026. The project page and code are publicly available.
2026.03: 🎉 VideoSEG-O3 has been accepted to ICML 2026. The code is publicly available.
2026.02: 🎉 DeRVOS has been accepted to CVPR 2026.
2026.01: 🎉 DRL has been accepted to Pattern Recognition. The code is publicly available.
2025.11: 🎉 GC3VG, an extension of C3VG, has been accepted to TCSVT 2025.
2025.09: 🎉 InstanceVG has been accepted to TPAMI 2025. The code is publicly available.
2025.07: 🎉 Two papers have been accepted to ICCV 2025: PropVG and DeRIS. The code for PropVG and DeRIS is publicly available.
2024.12: 🎉 C3VG has been accepted to AAAI 2025 as an oral presentation. The code is publicly available.
2024.09: 🎉 SimVG has been accepted to NeurIPS 2024. The code is publicly available.
2023.12: 🎉 DenseUAV has been accepted to TIP 2023. The code is publicly available.
2021.09: 🎉 FSRA has been accepted to TCSVT 2021. The code is publicly available.

📝 Publications

Research Direction 1: Multimodal Agentic Search

arXiv 2026

arXiv 2026 SimpleSearch-VL: A Simple Recipe for Multimodal Agentic Deep Search

Ming Dai, Zhihong Lu, Jinjie Gu, Jiedong Zhuang, Yefeng Liu, Wankou Yang, Jian Wang, Chunhua Shen

Paper | Code

Highlights

Highlights: SimpleSearch-VL is an efficient, reliable, and practical framework for multimodal agentic deep search. It improves the agent’s search-and-verification process with Factorized Adaptive Rollout, evidence-verified reasoning, and self-summarized visits, achieving strong search behavior from only 5K SFT trajectories and 2K RL prompts.

Research Direction 2: Referring/Reasoning Video Object Segmentation (RVOS)

ECCV 2026

ECCV 2026 ScanFocus: A Coarse-to-Fine Framework for Spatio-Temporal Video Grounding

Kai Chen^*, Ming Dai^*, Wenxuan Cheng, Wankou Yang

Paper | Code

Highlights

Highlights: ScanFocus is a coarse-to-fine framework for spatio-temporal video grounding, decoupling global spatio-temporal scanning from local boundary refinement. It introduces Deformable Semantic-Motion Fusion for coarse proposal generation and SGTA for dense boundary-focused temporal modeling, improving precise timestamp regression.

ICML 2026

ICML 2026 VideoSEG-O3: A Multi-turn Reinforcement Learning Framework for Reasoning Video Object Segmentation

Ming Dai, Sen Yang, Boqiang Duan, Boyuan Tong, Jiedong Zhuang, Wankou Yang, Jingdong Wang

Paper | Code

Highlights

Highlights: VideoSEG-O3 is a multi-turn RL framework for RVOS, actively exploring temporal intervals and keyframes through temporal-spatial CoT instead of relying on fixed sampled frames. It further introduces SEG-aware logit calibration and a decoupled thinking trace, aligning token-level policy optimization with pixel-level mask quality.

CVPR 2026

CVPR 2026 DeRVOS: Decoupling Consistent Trajectory Generation and Multimodal Understanding for Referring Video Object Segmentation

Wenxuan Cheng^*, Ming Dai^*, Huimin Lu, Wankou Yang

Paper

Highlights

Highlights: DeRVOS decouples trajectory generation and multimodal understanding, with TAIS aligning and selecting instance trajectories for robust RVOS.

ECCV 2026

ECCV 2026 MomentSeg: Moment-Centric Sampling for Enhanced Video Pixel Understanding

Ming Dai, Sen Yang, Boqiang Duan, Wankou Yang, Jingdong Wang

Paper | Code | Project

Highlights

Highlights: MomentSeg is a MLLM method, which unifies temporal grounding and segmentation, enabling key-frame extraction without relying on any external models. In addition, we introduce a novel [FIND] token, which allows the model to perform temporal grounding without requiring any additional timestamp encoding.

Research Direction 3: Visual Grounding (REC, RES, GREC, GRES)

TPAMI 2025

TPAMI 2025 Improving Generalized Visual Grounding with Instance-aware Joint Learning

Ming Dai, Wenxuan Cheng, Jiang-jiang Liu, Lingfeng Yang, Zhenhua Feng, Wankou Yang, Jingdong Wang

Paper | Code | 中文解读

Highlights

Highlights: InstanceVG supports instance-level referring segmentation across general scenarios (no/single/multiple targets). It also provides consistent prediction across point, box, and mask.

ICCV 2025

ICCV 2025 PropVG: End-to-End Proposal-Driven Visual Grounding with Multi-Granularity Discrimination

Ming Dai, Wenxuan Cheng, Jiedong Zhuang, Jiang-jiang Liu, Hongshen Zhao, Zhenhua Feng, Wankou Yang

Paper | Code | 中文解读

Highlights

Highlights: PropVG achieves end-to-end two-stage visual grounding, overcoming the traditional drawbacks of previous two-stage approaches that relied on external detectors and were often associated with slow inference and limited performance.

ICCV 2025

ICCV 2025 DeRIS: Decoupling Perception and Cognition for Enhanced Referring Image Segmentation through Loopback Synergy

Ming Dai, Wenxuan Cheng, Jiang-jiang Liu, Sen Yang, Wenxiao Cai, Yanpeng Sun, Wankou Yang

Paper | Code

Highlights

Highlights: DeRIS analyzes a key bottleneck in visual grounding—Cognition. It decouples the VG task into perception and cognition components, and integrates them effectively through a loopback synergy mechanism.

TCSVT 2025

TCSVT 2025 GC3VG: Generalized Multi-task Visual Grounding with Coarse-to-Fine Consistency Constraints

Ming Dai, Kai Chen, Wenxuan Cheng, Jiedong Zhuang, Zhenhua Feng, Pengfei Zhu, Wankou Yang

Paper

Highlights

Highlights: GC3VG generalizes the C3VG architecture and incorporates UCRM, which implicitly captures region/instance features and explicitly aligns them via an IoU-based relational constraint. The GHA strategy ensures feature-space consistency and boosts the discriminative strength of multi-modal representations.

AAAI 2025 (Selected as Oral)

AAAI 2025 Multi-task Visual Grounding with Coarse-to-Fine Consistency Constraints

Ming Dai, Jian Li, Jiedong Zhuang, Xian Zhang, Wankou Yang

Paper | Code

Highlights

Highlights: C3VG investigates the consistency prediction problem in REC and RIS, introducing a coarse-to-fine architecture that enforces consistency through both implicit and explicit constraints.

NeurIPS 2024

NeurIPS 2024 SimVG: A Simple Framework for Visual Grounding with Decoupled Multi-modal Fusion

Ming Dai, Lingfeng Yang, Yihao Xu, Zhenhua Feng, Wankou Yang

Paper | Code | 中文解读

Highlights

Highlights: SimVG explores the importance of multi-modal understanding for the VG task, proposing a simple yet effective framework. It also adopts a synchronized distillation learning strategy between the teacher and student branches, enhancing the performance of the student branch.

Research Direction 4: Cross-View Geo-Localization

PR 2026

PR 2026 Drone Referring Localization: An Efficient Heterogeneous Spatial Feature Interaction Method For UAV Self-Localization

Ming Dai, Enhui Zheng, Jiahao Chen, Lei Qi, Zhenhua Feng, Wankou Yang

Paper | Code

Highlights

Highlights: DRL adopts an end-to-end training and inference paradigm to address common issues in image-retrieval-based UAV self-localization, including complex preprocessing, inherent localization errors, and slow inference.

TIP 2023

TIP 2023 Vision-Based UAV Self-Positioning in Low-Altitude Urban Environments

Ming Dai, Enhui Zheng, Zhenhua Feng, Jiedong Zhuang, Wankou Yang

Paper | Code | 中文解读

Highlights

Highlights: DenseUAV introduces a real-world sampled dataset for vision-based UAV self-localization and provides a comprehensive benchmark for the task.

TCSVT 2021

TCSVT 2021 A Transformer-Based Feature Segmentation and Region Alignment Method For UAV-View Geo-Localization

Ming Dai, Jianhong Hu, Jiedong Zhuang, Enhui Zheng

Paper | Code

Highlights

Highlights: FSRA is the first successful application of Transformer models to cross-view geo-localization. It introduces an attention-map-based region partitioning and alignment strategy that alleviates performance degradation caused by viewpoint shifts.

🎖 Honors and Awards

Competition

2023.12 National First Prize, 5th Global Campus AI Algorithm Elite Competition (Zero-Shot Referring Expression Understanding)
2023.10 National First Prize (Champion), 4th “Space Cup” National Innovation and Creativity Competition (Multispectral Object Detection), Team Leader
2022.08 National Second Prize (Runner-up), China Postgraduate Smart City Technology and Creative Design Competition (Object Detection), Team Leader
2018.09 Zhejiang Provincial Robotics Competition: 2nd Prize (Shopping Track), 2nd Prize (Tourism Track), 3rd Prize (Transportation Track)
2017.09 1st Prize (East China Division) and 2nd Prize (National Division), Siemens Cup China Intelligent Manufacturing Challenge, Team Leader

Scholarships and Honors

2025 National Scholarship for Doctoral Students, Advanced Academic Individual, Southeast University
2022 National Scholarship for Graduate Students
2020 Outstanding Graduate of Zhejiang Province, Outstanding Undergraduate Graduate of China Jiliang University
2018 Zhejiang Provincial Government Scholarship

📖 Educations

2023.09 – present Ph.D. Student, School of Automation, Southeast University, Nanjing, China.
2020.09 – 2023.06 Master’s Student, China Jiliang University, Hangzhou, China.
2016.09 – 2020.06 Undergraduate Student, China Jiliang University, Hangzhou, China.