Digital Document Processing

Aiming to meet the needs of digital publishing, mobile reading, online education and other applications, the lab of digital document processing has been focused researches on document processing, document image analysis, layout understanding and other key technologies in the past years. In the related research areas, the lab of digital document processing accomplished the tasks of 13 research projects funded by Chinese government (such as the National Science & Technology Supporting Program projects, the National High Technology R&D Program of China (863 program) projects, the National Natural Science Foundation of China (NSFC) projects). The lab won the "National Science and Technology Progress Award (the second prize)" and "Major Technological Inventions in Information Industry Award (the first prize)". Some of the projects were selected as the "Beijing Nova Program". Our researchers published more than 100 research papers in prestigious academic journals and conferences, and held more than 30 patents (three patents won the "Chinese Excellence Patent Award" and one patent won the second prize of "Beijing patent of invention"). Some of their technologies have been successfully used in practical applications like digital publishing and electronic official document system, which produced good economic and social impact.

The team of digital document processing studies the problems of digital document description, processing, display and protection on the Internet (especially the mobile Internet), and focuses on complex document image understanding, digital rights management (DRM) and document information retrieval.

 

The major research contents include:

·           Document Analysis and Recognition

-          Layout Analysis and Understanding: page segmentation, layout structure extraction, reading order detection, reconstruction of logical hierarchy structure, metadata extraction.

-          Recognition of Page Objects in Complicated Layout: detection and structure analysis of table, mathematical formulae, chemical formulae, charts, etc.

·            Understanding of Complex Document Image

-          Understanding and Reuse of Graphic Documents: mainly segmenting the text and graph regions, extracting and describing the graphic features, similarity measurement of features, analysis of complex layout of graphs, and retrieval of the relationship between text and graph.

-          Comic Image Understanding: extraction of visual patterns & page elements, page layout analysis, and content recognition & classification.

 

Fig 2.6 Workflow of Comic Image Analysis

-          Reconstruction of Solid Geometry Object: sketch extraction for the line drawings image, solid geometry object recognition, and reconstruction

·           Document based Information Retrieval

-          Retrieval of Mathematical Formulae: parsing, indexing, sorting and recommendation of formulae with various formats such as Latex, MathML, PDF, picture (including camera captured), etc.

-          Retrieval of Chemical Formulae: graph structure-based understanding, recognition, matching and indexing of chemical formulae.

·           Digital Rights Management

-          Technologies of digital rights management for multi-mode applications, including rights expression and enforcement technologies, content key management technologies under a variety of hardware environments, fine grained multi-level security management technologies and usage control methods for digital content.

Major research achievements include:

·           DRM based e-book publishing and application system has been developed based on our research. The system won the second prize of "National Scientific and Technological Progress Award" in 2009. "CHINA DIGITAL LIBRARY", which is based on this system, was presented to foreign research institutes or universities as the official gifts by Prime Minister and other state leaders during their diplomatic visits.

·           The research achievement of layout analysis was successfully adopted by an open document structure standard (CEBX), as well as a series of software on data transformation, cross-platform reader of CEBX, real-time typesetting of complex layout in mobile devices with a small screen. Furthermore, this technique is also adopted by a tool converting documents from PDF to XML.

·           The research achievement of formulae retrieval was used in Founder Huiyun online learning platform. Besides, computer-aided reading and learning system for academic texts is also developed. This system can recommend directly the relevant learning resources according to users’ click on the inquired formulae. This achievement was also adopted by Indiana University, Hong Kong University, Peking University, etc. for their courses.

Fig 2.7 A Typical Usage Example of Mathematical Formula Retrieval

·           Deep learning based layout analysis engine supports automatic locating and structure analysis of complex layout objects in scanned and camera-captured documents, such as formulae and tables. A camera-based App based on formula recognition is also developed.

·           A framework of chemical formula retrieval was established. To obtain a compact yet efficient hypergraph representation, the optimization of cyclic and acyclic subgraphs is accomplished. The results of comparative experiments on the open Wiki dataset demonstrate that the proposed method outperforms the existing methods in accuracy. As for sketched chemical structural formula recognition, we proposed a new recognition method to settle the obstacle of the segmentation of text and graph regions. This method distinguishes input strokes into characters and graphs by using interactive gestures on mobile devices. It achieves accurate recognition results and excellent user experience simultaneously.

Fig 2.8 A Framework of Chemical Structural Formula Retrieval

·           By integrating the proposed algorithms about comic layout understanding, we developed software to produce digit comic books suitable for mobile reading. This software can automatically extract the comic panels and speech balloons, and identify reading sequence with high precision and only a small amount of manual correction. The software finally converts the comic page images into CEBX documents that can be adaptively displayed on different kinds of devices with multiple sizes of screens.

Contact us
Tel: 86-10-6275 4420    
Fax: 86-10-6275 4532
Dean MailBox:wict748@pku.edu.cn
Address:No. 128 Zhongguancun North Street, Haidian District, Beijing, 100871, P. R. China
Links:
WangXuan
FOUNDER
PEKING University
© Copyright 2017 All Rights Reserved
Wangxuan Institute of Computer Technology, Peking University