I am a second-year Ph.D. student in Computer Science at Northwestern University, fortunately advised by Prof. Manling Li. I collaborate closely with the Stanford Vision and Learning Lab (SVL), working with Prof. Li Fei-Fei and Prof. Jiajun Wu on spatial intelligence and embodied agents. Before Northwestern, I received my bachelor's degree from Zhejiang University.
I am looking for 2026 summer internships focused on foundation models (MLLMs) for embodied agents — feel free to reach out!
Research vision: I study how foundation models develop spatial understanding and decision-making skills, so that embodied agents can act over long horizons and across diverse embodied experiences in complex environments.
Research Topics: Embodied World Modeling / Embodied Decision Making / Spatial Intelligence / Reasoning Agents
(* indicates equal contribution; † denotes co-advising.)
ENACT: Evaluating Embodied Cognition with World Modeling of Egocentric Interaction
Spatial Mental Modeling from Limited Views
ICCV 2025 (SP4V Workshop) Best Paper Award · The Best of ICCV (featured by Voxel51)
Reinforcing Visual State Reasoning for Multi-Turn VLM Agents
RAGEN: Understanding Self-Evolution in LLM Agents via Multi-turn Reinforcement Learning
Best Poster Award
EmbodiedBench: Comprehensive Benchmarking Multi-modal Large Language Models for Vision-Driven Embodied Agents
ICML Oral Presentation
Embodied Agent Interface: Benchmarking LLMs for Embodied Decision Making
NeurIPS Oral Presentation · SoCal NLP 2024 Best Paper Award
Rethinking the Bounds of LLM Reasoning: Are Multi-Agent Discussions the Key?
ENACT: Evaluating Embodied Cognition with World Modeling of Egocentric Interaction
SENTINEL: A Multi-Level Formal Framework for Safety Evaluation of LLM-based Embodied Agents
Spatial Mental Modeling from Limited Views
ICCV 2025 (SP4V Workshop) Best Paper Award · The Best of ICCV (featured by Voxel51)
Reinforcing Visual State Reasoning for Multi-Turn VLM Agents
RAGEN: Understanding Self-Evolution in LLM Agents via Multi-turn Reinforcement Learning
Best Poster Award
EmbodiedBench: Comprehensive Benchmarking Multi-modal Large Language Models for Vision-Driven Embodied Agents
ICML Oral Presentation
Embodied Agent Interface: Benchmarking LLMs for Embodied Decision Making
NeurIPS Oral Presentation · SoCal NLP 2024 Best Paper Award
Rethinking the Bounds of LLM Reasoning: Are Multi-Agent Discussions the Key?
Lens: A Foundation Model for Network Traffic in Cybersecurity
ENACT: Evaluating Embodied Cognition with World Modeling of Egocentric Interaction
Spatial Mental Modeling from Limited Views
ICCV 2025 (SP4V Workshop) Best Paper Award · The Best of ICCV (featured by Voxel51)
EmbodiedBench: Comprehensive Benchmarking Multi-modal Large Language Models for Vision-Driven Embodied Agents
ICML Oral Presentation
Embodied Agent Interface: Benchmarking LLMs for Embodied Decision Making
NeurIPS Oral Presentation · SoCal NLP 2024 Best Paper Award
SENTINEL: A Multi-Level Formal Framework for Safety Evaluation of LLM-based Embodied Agents
Reinforcing Visual State Reasoning for Multi-Turn VLM Agents
RAGEN: Understanding Self-Evolution in LLM Agents via Multi-turn Reinforcement Learning
Best Poster Award
EmbodiedBench: Comprehensive Benchmarking Multi-modal Large Language Models for Vision-Driven Embodied Agents
ICML Oral Presentation
Embodied Agent Interface: Benchmarking LLMs for Embodied Decision Making
NeurIPS Oral Presentation · SoCal NLP 2024 Best Paper Award
ENACT: Evaluating Embodied Cognition with World Modeling of Egocentric Interaction
Spatial Mental Modeling from Limited Views
ICCV 2025 (SP4V Workshop) Best Paper Award · The Best of ICCV (featured by Voxel51)
SENTINEL: A Multi-Level Formal Framework for Safety Evaluation of LLM-based Embodied Agents
Reinforcing Visual State Reasoning for Multi-Turn VLM Agents
RAGEN: Understanding Self-Evolution in LLM Agents via Multi-turn Reinforcement Learning
Best Poster Award
Rethinking the Bounds of LLM Reasoning: Are Multi-Agent Discussions the Key?