Zhaokai Wang

(王肇凯)

Shanghai Jiao Tong University

Biography

I am a fourth-year Ph.D. candidate at Shanghai Jiao Tong University, supervised by Prof. Jifeng Dai. I obtained my bachelor’s degree from Beihang University in 2022, where I was supervised by Prof. Si Liu. I also have a double bachelor’s degree in economics from Peking University. Currently, I am a visiting student at UCL under the supervision of Prof. Jun Wang. I was an intern at OpenGVLab (InternVL team) of Shanghai AI Laboratory from 2023 to 2025. Previously, I interned at SenseTime and Sea AI Lab.

I am actively seeking opportunities for Ph.D. visit (self-funded) or company internship globally (e.g. Europe, China, Japan) starting in mid-2026. Feel free to contact me through email (wangzhaokai [at] sjtu [dot] edu [dot] cn) or WeChat (ID: wzk_1015). Please indicate that you are reaching out from my homepage to help me filter out spam.

Interests

Artificial Intelligence
Computer Vision
Music Generation

Education

Ph.D. Student, 2022-2027 (Expected)

Shanghai Jiao Tong University
Visiting Student, 2026

Centre for Artificial Intelligence, University College London
B.A. in Economics (Double Major), 2019-2022

National School of Development, Peking University
B.Eng. in Computer Science, 2018-2022

Shenyuan Honors College, Beihang University

News

2025.11: 🏆 Our paper Limit of RLVR on reinforcement learning for LLM is awared the Best Paper Runner Up Award of NeurIPS 2025!
2025.8: 🚀 We release InternVL3.5, a leading multimodal large language model with advanced versatility, reasoning, and efficiency.
2025.8: 🏆 Our paper Sparkle on VLM spatial reasoning is accepted by EMNLP 2025 Findings and awarded the Best Paper Award at IJCAI MKLM Workshop 2025.
2025.7: ⭐️ Our paper PIIP on efficient multimodal understanding is accepted by TPAMI.
2025.6: ⭐️ Our paper V2M Survey on vision-to-music generation is accepted by ISMIR 2025.
2025.2: ⭐️ Our papers Mono-InternVL and SynerGen-VL on encoder-free MLLMs are accepted by CVPR 2025.

Publications

GenExam: A Multidisciplinary Text-to-Image Exam

Zhaokai Wang*, Penghao Yin*, Xiangyu Zhao, Changyao Tian, Yu Qiao, Wenhai Wang, Jifeng Dai, Gen Luo

Preprint

[Paper] [Code] [Dataset] [Post]

Mono-InternVL-1.5: Towards Cheaper and Faster Monolithic Multimodal Large Language Models

Gen Luo, Wenhan Dou, Wenhao Li, Zhaokai Wang, Xue Yang, Changyao Tian, Hao Li, Weiyun Wang, Wenhai Wang, Xizhou Zhu, Yu Qiao, Jifeng Dai

Preprint

[Paper] [Code]

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

Weiyun Wang*, Zhangwei Gao*, Lixin Gu*, Hengjun Pu*, Long Cui*, Xingguang Wei*, Zhaoyang Liu*, Linglin Jing*, Shenglong Ye*, Jie Shao*, Zhaokai Wang*, Zhe Chen*, Hongjie Zhang, Ganlin Yang, Haomin Wang, Qi Wei, Jinhui Yin, Wenhao Li, Erfei Cui, Guanzhou Chen, Zichen Ding, Changyao Tian, Zhenyu Wu, Jingjing Xie, Zehao Li, Bowen Yang, Yuchen Duan, Xuehui Wang, Zhi Hou, Haoran Hao, Tianyi Zhang, Songze Li, Xiangyu Zhao, Haodong Duan, Nianchen Deng, Bin Fu, Yinan He, Yi Wang, Conghui He, Botian Shi, Junjun He, Yingtong Xiong, Han Lv, Lijun Wu, Wenqi Shao, Kaipeng Zhang, Huipeng Deng, Biqing Qi, Jiaye Ge, Qipeng Guo, Wenwei Zhang, Songyang Zhang, Maosong Cao, Junyao Lin, Kexian Tang, Jianfei Gao, Haian Huang, Yuzhe Gu, Chengqi Lyu, Huanze Tang, Rui Wang, Haijun Lv, Wanli Ouyang, Limin Wang, Min Dou, Xizhou Zhu, Tong Lu, Dahua Lin, Jifeng Dai, Weijie Su, Bowen Zhou, Kai Chen, Yu Qiao, Wenhai Wang, Gen Luo

Technical Report

[Paper] [Code] [Demo]

Multimodal Music Generation with Explicit Bridges and Retrieval Augmentation

Baisen Wang, Le Zhuo, Zhaokai Wang, Chenxi Bao, Chengjing Wu, Xuecheng Nie, Jiao Dai, Jizhong Han, Yue Liao, Si Liu

ISMIR 2025 LLM4Music Workshop

[Paper] [Code] [Demo]

Parameter-Inverted Image Pyramid Networks for Visual Perception and Multimodal Understanding

Zhaokai Wang*, Xizhou Zhu*, Xue Yang*, Gen Luo, Hao Li, Changyao Tian, Wenhan Dou, Junqi Ge, Lewei Lu, Yu Qiao, Jifeng Dai

TPAMI 2025

[Paper] [Code]

Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?

Yang Yue, Zhiqi Chen, Rui Lu, Andrew Zhao, Zhaokai Wang, Yang Yue, Shiji Song, Gao Huang

NeurIPS 2025 Best Paper Runner Up (Top 1 score among 21575 submissions) & ICML 2025 AI4MATH Workshop Best Paper Award (2/172)

[Paper] [Project Page] [Code]

Sparkle: Mastering Basic Spatial Capabilities in Vision Language Models Elicits Generalization to Composite Spatial Reasoning

Yihong Tang*, Ao Qu*, Zhaokai Wang*, Dingyi Zhuang*, Zhaofeng Wu, Wei Ma, Shenhao Wang, Yunhan Zheng, Zhan Zhao, Jinhua Zhao

EMNLP 2025 Findings & IJCAI 2025 MKLM Workshop Best Paper Award

[Paper]

Vision-to-Music Generation: A Survey

Zhaokai Wang, Chenxi Bao, Le Zhuo, Jingrui Han, Yang Yue, Yihong Tang, Victor Shea-Jay Huang, Yue Liao

ISMIR 2025

[Paper] [Repo] [Video]

Mono-InternVL: Pushing the Boundaries of Monolithic Multimodal Large Language Models with Endogenous Visual Pre-training

Gen Luo*, Xue Yang*, Wenhan Dou*, Zhaokai Wang*, Jiawen Liu, Jifeng Dai, Yu Qiao, Xizhou Zhu

CVPR 2025

[Paper] [Project Page] [Code] [Model] [Post] [Slides]

SynerGen-VL: Towards Synergistic Image Understanding and Generation with Vision Experts and Token Folding

Hao Li, Changyao Tian, Jie Shao, Xizhou Zhu, Zhaokai Wang, Jinguo Zhu, Wenhan Dou, Xiaogang Wang, Hongsheng Li, Lewei Lu, Jifeng Dai

CVPR 2025

[Paper]

Parameter-Inverted Image Pyramid Networks

Xizhou Zhu*, Xue Yang*, Zhaokai Wang*, Hao Li, Wenhan Dou, Junqi Ge, Lewei Lu, Yu Qiao, Jifeng Dai

NeurIPS 2024 Spotlight - Ranked Top 10 in NeurIPS 2024 (among 15671 submissions), Top 2 in Computer Vision Area

[Paper] [Code] [Post] [Slides] [Video]

ITINERA: Integrating Spatial Optimization with Large Language Models for Open-domain Urban Itinerary Planning

Yihong Tang*, Zhaokai Wang*, Ao Qu*, Yihao Yan*, Zhaofeng Wu, Dingyi Zhuang, Jushi Kai, Kebing Hou, Xiaotong Guo, Han Zheng, Tiange Luo, Jinhua Zhao, Zhan Zhao, Wei Ma

EMNLP 2024 Industry Track & KDD 2024 UrbComp Workshop Best Paper Award

[Paper] [Code] [Post] [Slides] [Video]