Workshop on
Advancing Multimodal Document Understanding: Challenges and Opportunities

ICDAR 2025 Workshop

September 21, 2025 @ Wuhan, Hubei, China

Introduction

This workshop aims to bring together experts from document analysis, NLP, computer vision, multimodal machine learning, and industry applications. It will serve as a platform to discuss solutions in multimodal document understanding and explore future research directions. The focus will be on integrating modalities, designing novel models and frameworks, and developing practical tools and evaluation benchmarks. By encouraging collaboration across academia and industry, we aim to unlock the potential of multimodal understanding to tackle real-world document processing challenges. This research can lead to impactful applications across both industry and society.

Schedule

All times in Beijing Time (UTC+08:00)

Time Events
13:20-13:30 Opening Remarks
13:30-14:10 Invited Talk 1: Prof. Wei Chen
Title: PDF-WuKong: Efficient and General Multi-modal Long Document Understanding
14:10-14:50 Invited Talk 2: Prof. Zhineng Chen
Title:
14:50-15:30 Invited Talk 3: Dr. Jingqun Tang
Title: Towards a Unified Large Multimodal Model for Document Understanding and Generation
15:30-15:50 Coffee Break
15:50-16:30 Invited Talk 4: Dr. Hao Wang
Title: Advancements and Applications of Vision-Language Multimodal Large Models
16:30-17:10 Invited Talk 5: Dr. Minghui Liao
Title:



Invited Speakers

Prof. Wei Chen

Dr. Chen Wei is an Assistant Professor at the School of Software, Huazhong University of Science and Technology (HUST) and a young faculty member at the VLR Lab. He received his Ph.D. from Fudan University's Natural Language Processing Lab. His research focuses on Natural Language Processing, Document Intelligence, and Multimodal AI. Dr. Chen has published over 20 papers in top-tier conferences and journals including ICLR, NeurIPS, ACL, and AAAI. He led the development of the open-source DISC-X series of LLMs for vertical domains such as financial, judicial, and medical applications, and also PDF-WuKong series, multi-modality long document LLMs.

Title: PDF-WuKong: Efficient and General Multi-modal Long Document Understanding

Abstract: Understanding and interacting with long, multi-modal documents, such as research papers and financial reports, remains a significant challenge for AI systems. Existing methods often face limitations in efficiency, scalability, and generalization across diverse document structures. This talk will introduce the PDF-WuKong research series, detailing our efforts to develop more effective and robust solutions for multi-modal long document understanding. We will begin by discussing PDF-WuKong 1.0, which introduced a sparse sampling mechanism to efficiently extract relevant information from lengthy PDF documents. Building on this, we will present PDF-WuKong 2.0. This iteration adopts a pure-visual input paradigm for multi-page documents, incorporating advanced architectural designs and training methodologies, with structured, page-aware reasoning process. These advancements aim to improve processing efficiency for long visual sequences, optimize resource utilization, and enhance generalization and explainability across various document types.

Prof. Zhineng Chen

Title:

Abstract:

Dr. Jingqun Tang

Jingqun Tang is a senior algorithm researcher specializing in multimodal fields at ByteDance. He received his bachelor's degree from Zhejiang University in 2016, after which he embarked on research related to computer vision, multimodality, and document intelligence. He has been engaged in document intelligence research for over seven years. Previously, he worked at Tencent Youtu Lab and NetEase Hangzhou Research Institute. He has published more than 20 papers in top-tier conferences such as CVPR, NeurIPS, ACL, and ECCV, with 10 of them as first author or corresponding author, accumulating over 1,000 citations. Additionally, he has won championships in international competitions like ICDAR competition on seven occasions.

Title: Towards a Unified Large Multimodal Model for Document Understanding and Generation

Abstract: The advancement of large multimodal models (LMMs) has significantly propelled document perception and understanding. Typical document-related LMMs feature multimodal inputs paired with purely text outputs. This restricts their application in numerous tasks requiring image outputs, such as document quality enhancement, text editing, and poster generation. Recently, many general-purpose large multimodal models have attempted to unify generation and understanding, including GPT-4o, Bagel, BLIP-3o, and Janus-Pro. However, such efforts remain insufficient in the field of document intelligence. This report will explore how to construct large models that unify generation and understanding within document intelligence. It will introduce current progress, outline existing challenges, and discuss whether generation and understanding can mutually reinforce each other, as well as the potential applications enabled by their unification.

Dr. Hao Wang

Hao Wang is a senior engineer at Huawei Consumer Business Group. He received his Ph.D. degree from Huazhong University of Science and Technology in 2023. His research interests include OCR, multimodal large language models, and visual reinforcement learning. His publications have been accepted by journals and conferences such as AAAI, CVPR, ICCV, ICDAR, TIP, and TPAMI.

Title: Advancements and Applications of Vision-Language Multimodal Large Models

Abstract: As deep learning continues to evolve and large language models achieve groundbreaking success, artificial intelligence is entering a new era—one defined by generalization and multimodality. This talk will highlight Huawei's exploration and innovation in this frontier, presenting a comprehensive overview of our research progress in multimodal large models for vision-language understanding.

The presentation will cover the following key areas:

  • Foundation Model Development: An in-depth look at the design philosophy and technical innovations behind our TextHawk series of foundational multimodal models.
  • Domain-Specific Advancements: Introduction of UI-Hawk, Huawei’s specialized model designed for intelligent understanding of graphical user interfaces (GUIs).
  • Frontier Research Exploration: Insights into our latest work on VisuRiddles, which addresses the complex task of abstract visual reasoning.
  • Real-World Deployment: Demonstration of how these cutting-edge technologies is being integrated into Huawei’s Xiaoyi smart assistant, translating research breakthroughs into meaningful user experiences.

Qi Zheng

Qi Zheng, received his B.S. and M.S. degree from Shanghai Jiao Tong University, China. He is now working at Alibaba Tongyi Lab, leading document intelligence technology and products. His current research interests include multimodal large language model and document Intelligence. He has published more than ten papers in CCF A/B tier conference and journals.

Title: Multi-page document Parsing and Understanding

Abstract: