Biography

I am a third-year Ph.D. candidate at Shanghai Jiao Tong University and Shanghai AI Laboratory, supervised by Prof. Jifeng Dai. I obtained my bachelor’s degree from Beihang University in 2022, where I worked with Prof. Si Liu. I also have a double bachelor’s degree in economics from Peking University. Currently, I am an intern at OpenGVLab of Shanghai AI Laboratory. Previously I was an intern at SenseTime and Sea AI Lab.

Interests
  • Artificial Intelligence
  • Computer Vision
  • Music Generation
Education
  • Ph.D. (Joint Program with Shanghai AI Lab), 2022-2027(expected)

    Department of EE, Shanghai Jiao Tong University

  • B.A. in Economics (Double Major), 2019-2022

    National School of Development, Peking University

  • B.Eng. in Computer Science, 2018-2022

    Shenyuan Honors College, Beihang University

News

  • 2025.8: 🚀 We release InternVL3.5, a leading multimodal large language model with advanced versatility, reasoning, and efficiency.

  • 2025.8: 🏆 Our paper Sparkle on VLM spatial reasoning is accepted by EMNLP 2025 Findings and awarded the Best Paper Award at IJCAI MKLM Workshop 2025.

  • 2025.7: ⭐️ Our paper PIIP on efficient multimodal understanding is accepted by TPAMI.

  • 2025.7 🏆 Our paper Limit of RLVR on reinforcement learning for LLM is awarded the Best Paper Award (2/172) of ICML AI4MATH Workshop 2025!

  • 2025.6: ⭐️ Our paper V2M Survey on vision-to-music generation is accepted by ISMIR 2025.

  • 2025.4: 🎤 Talk on Mono-InternVL at Open Multimodal Gathering Workshop hosted by NUS ShowLab. [Slides]

  • 2025.2: ⭐️ Our papers Mono-InternVL and SynerGen-VL on encoder-free MLLMs are accepted by CVPR 2025.

  • 2024.10: ⭐️ Our paper ItiNera on LLM for urban itinerary generation is accepted by EMNLP 2024. It is also awarded the Best Paper Award of KDD Urban Computing Workshop (UrbComp) 2024.

  • 2024.9: ⭐️ Our paper PIIP on efficient vision backbone is accepted by NeurIPS 2024 as Spotlight, ranking Top 10 in NeurIPS 2024 (among 15671 submissions) and Top 2 in computer vision area.

Publications



mono_v1.5 thumbnail

Mono-InternVL-1.5: Towards Cheaper and Faster Monolithic Multimodal Large Language Models

Gen Luo, Wenhan Dou, Wenhao Li, Zhaokai Wang, Xue Yang, Changyao Tian, Hao Li, Weiyun Wang, Wenhai Wang, Xizhou Zhu, Yu Qiao, Jifeng Dai

Preprint

[Paper] [Code]

mono_v1.5 thumbnail

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

Weiyun Wang*, Zhangwei Gao*, Lixin Gu*, Hengjun Pu*, Long Cui*, Xingguang Wei*, Zhaoyang Liu*, Linglin Jing*, Shenglong Ye*, Jie Shao*, Zhaokai Wang*, Zhe Chen*, Hongjie Zhang, Ganlin Yang, Haomin Wang, Qi Wei, Jinhui Yin, Wenhao Li, Erfei Cui, Guanzhou Chen, Zichen Ding, Changyao Tian, Zhenyu Wu, Jingjing Xie, Zehao Li, Bowen Yang, Yuchen Duan, Xuehui Wang, Songze Li, Xiangyu Zhao, Haodong Duan, Nianchen Deng, Bin Fu, Yinan He, Yi Wang, Conghui He, Botian Shi, Junjun He, Yingtong Xiong, Han Lv, Lijun Wu, Wenqi Shao, Kaipeng Zhang, Huipeng Deng, Biqing Qi, Jiaye Ge, Qipeng Guo, Wenwei Zhang, Wanli Ouyang, Limin Wang, Min Dou, Xizhou Zhu, Tong Lu, Dahua Lin, Jifeng Dai, Weijie Su, Bowen Zhou, Kai Chen, Yu Qiao, Wenhai Wang, Gen Luo

Technical Report

[Paper] [Code]

VMB thumbnail

Multimodal Music Generation with Explicit Bridges and Retrieval Augmentation

Baisen Wang, Le Zhuo, Zhaokai Wang, Chenxi Bao, Chengjing Wu, Xuecheng Nie, Jiao Dai, Jizhong Han, Yue Liao, Si Liu

Preprint

[Paper] [Code] [Demo]

PIIP_v2 thumbnail

Parameter-Inverted Image Pyramid Networks for Visual Perception and Multimodal Understanding

Zhaokai Wang, Xizhou Zhu, Xue Yang, Gen Luo, Hao Li, Changyao Tian, Wenhan Dou, Junqi Ge, Lewei Lu, Yu Qiao, Jifeng Dai

TPAMI 2025

[Paper] [Code]

RLVR thumbnail

Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?

Yang Yue, Zhiqi Chen, Rui Lu, Andrew Zhao, Zhaokai Wang, Yang Yue, Shiji Song, Gao Huang

ICML 2025 AI4MATH Workshop Best Paper Award (2/172)

[Paper] [Project Page] [Code]

Sparkle thumbnail

Sparkle: Mastering Basic Spatial Capabilities in Vision Language Models Elicits Generalization to Composite Spatial Reasoning

Yihong Tang*, Ao Qu*, Zhaokai Wang*, Dingyi Zhuang*, Zhaofeng Wu, Wei Ma, Shenhao Wang, Yunhan Zheng, Zhan Zhao, Jinhua Zhao

EMNLP 2025 Findings & IJCAI 2025 MKLM Workshop Best Paper Award

[Paper]

V2M survey thumbnail

Vision-to-Music Generation: A Survey

Zhaokai Wang, Chenxi Bao, Le Zhuo, Jingrui Han, Yang Yue, Yihong Tang, Victor Shea-Jay Huang, Yue Liao

ISMIR 2025

[Paper] [Repo] [Video]

Mono-InternVL thumbnail

Mono-InternVL: Pushing the Boundaries of Monolithic Multimodal Large Language Models with Endogenous Visual Pre-training

Gen Luo*, Xue Yang*, Wenhan Dou*, Zhaokai Wang*, Jiawen Liu, Jifeng Dai, Yu Qiao, Xizhou Zhu

CVPR 2025

[Paper] [Project Page] [Code] [Model] [Post] [Slides]

SynerGen-VL thumbnail

SynerGen-VL: Towards Synergistic Image Understanding and Generation with Vision Experts and Token Folding

Hao Li, Changyao Tian, Jie Shao, Xizhou Zhu, Zhaokai Wang, Jinguo Zhu, Wenhan Dou, Xiaogang Wang, Hongsheng Li, Lewei Lu, Jifeng Dai

CVPR 2025

[Paper]

PIIP thumbnail

Parameter-Inverted Image Pyramid Networks

Xizhou Zhu*, Xue Yang*, Zhaokai Wang*, Hao Li, Wenhan Dou, Junqi Ge, Lewei Lu, Yu Qiao, Jifeng Dai

NeurIPS 2024 Spotlight - Ranked Top 10 in NeurIPS 2024 (among 15671 submissions), Top 2 in Computer Vision Area

[Paper] [Code] [Post] [Slides] [Video]

ITINERA thumbnail

ITINERA: Integrating Spatial Optimization with Large Language Models for Open-domain Urban Itinerary Planning

Yihong Tang*, Zhaokai Wang*, Ao Qu*, Yihao Yan*, Zhaofeng Wu, Dingyi Zhuang, Jushi Kai, Kebing Hou, Xiaotong Guo, Han Zheng, Tiange Luo, Jinhua Zhao, Zhan Zhao, Wei Ma

EMNLP 2024 Industry Track & KDD 2024 UrbComp Workshop Best Paper Award

[Paper] [Code] [Post] [Slides] [Video]

Auto MC-Reward thumbnail

Auto MC-Reward: Automated Dense Reward Design with Large Language Models for Minecraft

Hao Li*, Xue Yang*, Zhaokai Wang*, Xizhou Zhu, Jie Zhou, Yu Qiao, Xiaogang Wang, Hongsheng Li, Lewei Lu, Jifeng Dai

CVPR 2024

[Paper] [Project Page] [Post]

Musprod thumbnail

Video Background Music Generation: Dataset, Method and Evaluation

Le Zhuo*, Zhaokai Wang*, Baisen Wang*, Yue Liao, Chenxi Bao, Stanley Peng, Songhao Han, Aixi Zhang, Fei Fang, Si Liu

ICCV 2023

[Paper] [Demo]

CMT thumbnail

Video Background Music Generation with Controllable Music Transformer

Shangzhe Di, Zeren Jiang, Si Liu, Zhaokai Wang, Leyan Zhu, Zexin He, Hongming Liu, Shuicheng Yan

ACM MM 2021 Best Paper Award (1/1942)

[Paper] [Project Page] [Code] [Demo] [Post] [News]

CNMT thumbnail

Confidence-aware Non-repetitive Multimodal Transformers for TextCaps

Zhaokai Wang, Renda Bao, Qi Wu, Si Liu

AAAI 2021

[Paper] [Code]

Awards and Honors

Best Paper Award
See certificate
Best Paper Award
See certificate
Best Paper Award
See certificate
Best Zero to One Award
See certificate
Outstanding Graduate
Best Paper Award
See certificate
Best Video Award
See certificate
First Place
See certificate

Activities

Talks:

Conference Reviewer:

  • ICCV 2023 & 2025, ECCV 2024, CVPR 2024 & 2025, EMNLP 2024, NeurIPS 2024, ICLR 2025, ICML 2025, AAAI 2025

Teaching Assistant

  • Fundamentals of Computers (2021 spring)
  • Software Engineering (2022 spring)