Workshop on
Advancing Multimodal Document Understanding: Challenges and Opportunities
ICDAR 2025 Workshop
September 21, 2025 @ Wuhan, Hubei, China
Introduction
This workshop aims to bring together experts from document analysis, NLP, computer vision, multimodal machine learning, and industry applications. It will serve as a platform to discuss solutions in multimodal document understanding and explore future research directions. The focus will be on integrating modalities, designing novel models and frameworks, and developing practical tools and evaluation benchmarks. By encouraging collaboration across academia and industry, we aim to unlock the potential of multimodal understanding to tackle real-world document processing challenges. This research can lead to impactful applications across both industry and society.
Schedule
All times in Beijing Time (UTC+08:00)
Time | Events |
---|---|
13:20-13:30 | Opening Remarks |
13:30-14:10 | Invited Talk 1: Prof. Wei Chen Title: PDF-WuKong: Efficient and General Multi-modal Long Document Understanding |
14:10-14:50 | Invited Talk 2: Prof. Zhineng Chen Title: |
14:50-15:30 | Invited Talk 3: Dr. Jingqun Tang Title: Towards a Unified Large Multimodal Model for Document Understanding and Generation |
15:30-15:50 | Coffee Break |
15:50-16:30 | Invited Talk 4: Dr. Hao Wang Title: Advancements and Applications of Vision-Language Multimodal Large Models |
16:30-17:10 | Invited Talk 5: Dr. Minghui Liao Title: |
Invited Speakers
Prof. Wei Chen
Dr. Chen Wei is an Assistant Professor at the School of Software, Huazhong University of Science and Technology (HUST) and a young faculty member at the VLR Lab. He received his Ph.D. from Fudan University's Natural Language Processing Lab. His research focuses on Natural Language Processing, Document Intelligence, and Multimodal AI. Dr. Chen has published over 20 papers in top-tier conferences and journals including ICLR, NeurIPS, ACL, and AAAI. He led the development of the open-source DISC-X series of LLMs for vertical domains such as financial, judicial, and medical applications, and also PDF-WuKong series, multi-modality long document LLMs.
Title: PDF-WuKong: Efficient and General Multi-modal Long Document Understanding
Abstract: Understanding and interacting with long, multi-modal documents, such as research papers and financial reports, remains a significant challenge for AI systems. Existing methods often face limitations in efficiency, scalability, and generalization across diverse document structures. This talk will introduce the PDF-WuKong research series, detailing our efforts to develop more effective and robust solutions for multi-modal long document understanding. We will begin by discussing PDF-WuKong 1.0, which introduced a sparse sampling mechanism to efficiently extract relevant information from lengthy PDF documents. Building on this, we will present PDF-WuKong 2.0. This iteration adopts a pure-visual input paradigm for multi-page documents, incorporating advanced architectural designs and training methodologies, with structured, page-aware reasoning process. These advancements aim to improve processing efficiency for long visual sequences, optimize resource utilization, and enhance generalization and explainability across various document types.
Prof. Zhineng Chen
Title:
Abstract:
Dr. Jingqun Tang
Jingqun Tang is a senior algorithm researcher specializing in multimodal fields at ByteDance. He received his bachelor's degree from Zhejiang University in 2016, after which he embarked on research related to computer vision, multimodality, and document intelligence. He has been engaged in document intelligence research for over seven years. Previously, he worked at Tencent Youtu Lab and NetEase Hangzhou Research Institute. He has published more than 20 papers in top-tier conferences such as CVPR, NeurIPS, ACL, and ECCV, with 10 of them as first author or corresponding author, accumulating over 1,000 citations. Additionally, he has won championships in international competitions like ICDAR competition on seven occasions.
Title: Towards a Unified Large Multimodal Model for Document Understanding and Generation
Abstract: The advancement of large multimodal models (LMMs) has significantly propelled document perception and understanding. Typical document-related LMMs feature multimodal inputs paired with purely text outputs. This restricts their application in numerous tasks requiring image outputs, such as document quality enhancement, text editing, and poster generation. Recently, many general-purpose large multimodal models have attempted to unify generation and understanding, including GPT-4o, Bagel, BLIP-3o, and Janus-Pro. However, such efforts remain insufficient in the field of document intelligence. This report will explore how to construct large models that unify generation and understanding within document intelligence. It will introduce current progress, outline existing challenges, and discuss whether generation and understanding can mutually reinforce each other, as well as the potential applications enabled by their unification.
Dr. Hao Wang
Hao Wang is a senior engineer at Huawei Consumer Business Group. He received his Ph.D. degree from Huazhong University of Science and Technology in 2023. His research interests include OCR, multimodal large language models, and visual reinforcement learning. His publications have been accepted by journals and conferences such as AAAI, CVPR, ICCV, ICDAR, TIP, and TPAMI.
Title: Advancements and Applications of Vision-Language Multimodal Large Models
Abstract: As deep learning continues to evolve and large language models achieve groundbreaking success, artificial intelligence is entering a new era—one defined by generalization and multimodality. This talk will highlight Huawei's exploration and innovation in this frontier, presenting a comprehensive overview of our research progress in multimodal large models for vision-language understanding.
The presentation will cover the following key areas:
- Foundation Model Development: An in-depth look at the design philosophy and technical innovations behind our TextHawk series of foundational multimodal models.
- Domain-Specific Advancements: Introduction of UI-Hawk, Huawei’s specialized model designed for intelligent understanding of graphical user interfaces (GUIs).
- Frontier Research Exploration: Insights into our latest work on VisuRiddles, which addresses the complex task of abstract visual reasoning.
- Real-World Deployment: Demonstration of how these cutting-edge technologies is being integrated into Huawei’s Xiaoyi smart assistant, translating research breakthroughs into meaningful user experiences.
Qi Zheng
Qi Zheng, received his B.S. and M.S. degree from Shanghai Jiao Tong University, China. He is now working at Alibaba Tongyi Lab, leading document intelligence technology and products. His current research interests include multimodal large language model and document Intelligence. He has published more than ten papers in CCF A/B tier conference and journals.
Title: Multi-page document Parsing and Understanding
Abstract: