Workshop on
Advancing Multimodal Document Understanding: Challenges and Opportunities
ICDAR 2025 Workshop
September 21, 2025 @ Wuhan, Hubei, China
Introduction
This workshop aims to bring together experts from document analysis, NLP, computer vision, multimodal machine learning, and industry applications. It will serve as a platform to discuss solutions in multimodal document understanding and explore future research directions. The focus will be on integrating modalities, designing novel models and frameworks, and developing practical tools and evaluation benchmarks. By encouraging collaboration across academia and industry, we aim to unlock the potential of multimodal understanding to tackle real-world document processing challenges. This research can lead to impactful applications across both industry and society.
Schedule
All times in Beijing Time (UTC+08:00)
Time | Events |
---|---|
13:20-13:30 | Opening Remarks |
13:30-14:10 | Invited Talk 1: Prof. Wei Chen Title: PDF-WuKong: Efficient and General Multi-modal Long Document Understanding |
14:10-14:50 | Invited Talk 2: Prof. Zhineng Chen Title: Lightweight Document Parsing: Recent Advances and Challenges |
14:50-15:30 | Invited Talk 3: Dr. Jingqun Tang Title: Towards a Unified Large Multimodal Model for Document Understanding and Generation |
15:30-15:50 | Coffee Break |
15:50-16:30 | Invited Talk 4: Dr. Hao Wang Title: Advancements and Applications of Vision-Language Multimodal Large Models |
16:30-17:10 | Invited Talk 5: Qi Zheng Title: Multi-page document Parsing and Understanding |
Invited Speakers
Prof. Wei Chen
Chen Wei is an Assistant Professor at the School of Software, Huazhong University of Science and Technology (HUST) and a young faculty member at the VLR Lab. He received his Ph.D. from Fudan University's Natural Language Processing Lab. His research focuses on Natural Language Processing, Document Intelligence, and Multimodal AI. Dr. Chen has published over 20 papers in top-tier conferences and journals including ICLR, NeurIPS, ACL, and AAAI. He led the development of the open-source DISC-X series of LLMs for vertical domains such as financial, judicial, and medical applications, and also PDF-WuKong series, multi-modality long document LLMs.
Title: PDF-WuKong: Efficient and General Multi-modal Long Document Understanding
Abstract: Understanding and interacting with long, multi-modal documents, such as research papers and financial reports, remains a significant challenge for AI systems. Existing methods often face limitations in efficiency, scalability, and generalization across diverse document structures. This talk will introduce the PDF-WuKong research series, detailing our efforts to develop more effective and robust solutions for multi-modal long document understanding. We will begin by discussing PDF-WuKong 1.0, which introduced a sparse sampling mechanism to efficiently extract relevant information from lengthy PDF documents. Building on this, we will present PDF-WuKong 2.0. This iteration adopts a pure-visual input paradigm for multi-page documents, incorporating advanced architectural designs and training methodologies, with structured, page-aware reasoning process. These advancements aim to improve processing efficiency for long visual sequences, optimize resource utilization, and enhance generalization and explainability across various document types.
Prof. Zhineng Chen
Zhineng Chen received the Ph. D. degree in Computer Science from the Institute of Computing Technology, Chinese Academy of Sciences, Beijing, in 2012. He is currently a Professor at the Institute of Trustworthy Embodied AI, College of Intelligent Robotics and Advanced Manufacturing, Fudan University, Shanghai, China. His research interests include computer vision and multimedia analysis, especially in lightweight OCR, where the models he developed have been applied to several famous companies. He has published over 90 academic papers in prestigious journals and conferences, such as IEEE TPAMI, IJCV, CVPR, ICCV, ECCV, etc.
Title: Lightweight Document Parsing: Recent Advances and Challenges
Abstract: Document parsing is emerging as a critical objective in the field of OCR. Historically, this task is accomplished mainly by integrating parsing results from a series of related single-point tasks, such as text detection, text recognition, and layout analysis. Prior research has focused on these single-point tasks, where lightweight implementations are recognized as a key enabler for developing more practical solutions. With the advances of Artificial Intelligence especially large language/multimodal models, holistic document parsing has gained increasing traction, and two prevailing paradigms become popular: (i) modular pipelines that chain specialized modules for layout, structural parsing (e.g., tables and formulas), text detection and recognition, exemplified by systems such as MinerU; and (ii) OCR-free, end-to-end approaches that predict structured semantics directly from pixels, including Donut, mPLUG-DocOwl, UReader, MonkeyOCR, etc. Lightweight implementation remains crucial for both paradigms to ensure real-world applicability. This talk traces the evolution trajectory of document parsing through the lens of lightweight design. First, it reviews existing studies in lightweight single-point tasks, including text recognition, text recognition, and text spotting. Second, it summarizes recent advances following the two aforementioned paradigms, with emphasize on their effectiveness and efficiency. Finally, it outlines key challenges in lightweight document parsing and propose several promising research directions to advance this field.
Dr. Jingqun Tang
Jingqun Tang is a senior algorithm researcher specializing in multimodal fields at ByteDance. He received his bachelor's degree from Zhejiang University in 2016, after which he embarked on research related to computer vision, multimodality, and document intelligence. He has been engaged in document intelligence research for over seven years. Previously, he worked at Tencent Youtu Lab and NetEase Hangzhou Research Institute. He has published more than 20 papers in top-tier conferences such as CVPR, NeurIPS, ACL, and ECCV, with 10 of them as first author or corresponding author, accumulating over 1,000 citations. Additionally, he has won championships in international competitions like ICDAR competition on seven occasions.
Title: Towards a Unified Large Multimodal Model for Document Understanding and Generation
Abstract: The advancement of large multimodal models (LMMs) has significantly propelled document perception and understanding. Typical document-related LMMs feature multimodal inputs paired with purely text outputs. This restricts their application in numerous tasks requiring image outputs, such as document quality enhancement, text editing, and poster generation. Recently, many general-purpose large multimodal models have attempted to unify generation and understanding, including GPT-4o, Bagel, BLIP-3o, and Janus-Pro. However, such efforts remain insufficient in the field of document intelligence. This report will explore how to construct large models that unify generation and understanding within document intelligence. It will introduce current progress, outline existing challenges, and discuss whether generation and understanding can mutually reinforce each other, as well as the potential applications enabled by their unification.
Dr. Hao Wang
Hao Wang is a senior engineer at Huawei Consumer Business Group. He received his Ph.D. degree from Huazhong University of Science and Technology in 2023. His research interests include OCR, multimodal large language models, and visual reinforcement learning. His publications have been accepted by journals and conferences such as AAAI, CVPR, ICCV, ICDAR, TIP, and TPAMI.
Title: Advancements and Applications of Vision-Language Multimodal Large Models
Abstract: As deep learning continues to evolve and large language models achieve groundbreaking success, artificial intelligence is entering a new era—one defined by generalization and multimodality. This talk will highlight Huawei’s exploration and innovation in this frontier, presenting a comprehensive overview of our research progress in multimodal large models for vision-language understanding.
The presentation will cover the following key areas:
- Foundation Model Development: An in-depth look at the design philosophy and technical innovations behind our TextHawk series of foundational multimodal models.
- Domain-Specific Advancements: Introduction of UI-Hawk, Huawei’s specialized model designed for intelligent understanding of graphical user interfaces (GUIs).
- Frontier Research Exploration: Insights into our latest work on VisuRiddles, which addresses the complex task of abstract visual reasoning.
- Real-World Deployment: Demonstration of how these cutting-edge technologies is being integrated into Huawei’s Xiaoyi smart assistant, translating research breakthroughs into meaningful user experiences.
Qi Zheng
Qi Zheng, received his B.S. and M.S. degree from Shanghai Jiao Tong University, China. He is now working at Alibaba Tongyi Lab, leading document intelligence technology and products. His current research interests include multimodal large language model and document Intelligence. He has published more than ten papers in CCF A/B tier conference and journals.
Title: Multi-page document Parsing and Understanding
Abstract: Multi-page multimodal document understanding is a practical and challenging task. There are mainly two frameworks: the first involves document parsing models that convert images into text format, followed by using LLMs for comprehension; the second directly employs OCR-free multimodal models for understanding. I will introduce two works from our team. The first paper presents a new dataset, DocHieNet, and a novel method, DHFormer, which introduces cross-page multi-level hierarchical parsing tasks. The second paper presents mPlug-DocOwl2, incorporating a high-resolution DocCompressor module and a three-stage training framework to enhance multi-page comprehension ability and balance both token efficiency and QA performance.