Marmot Dataset
This page is a distribution site for the ground-truthed dataset for use in
document analysis and recognition experiments.
Dataset for table recognition
Description
In total, 2000 pages in PDF format were collected and the corresponding
ground-truths were extracted utilizing our semi-automatic ground-truthing
tool "Marmot".
The dataset is composed of Chinese and English pages at the proportion of
about 1:1.
-
The Chinese pages were selected from over 120 e-Books with diverse
subject areas provided by Founder Apabi library, and no more than 15
pages were selected from each book.
-
The English pages were crawled from Citeseer website.
The pages show a great variety in language type, page layout, and table
styles. Among them, over 1500 conference and journal papers were crawled,
covering various fields, spanning from the year 1970, to latest 2011
publications.
The e-Book pages are mostly in one-column layout, while the English pages
are mixed with both one-column and two-column layouts.
Download
Marmot Dataset v1.0
Table Detection Evaluator v1.0
Dataset for math formula recognition
Description
This is a ground-truth dataset and evluation tool for mathematical formula
identification. We collect documents through crawling PDF documents from
CiteSeerX.
In total,the dataset contains 400 document pages with 1575 isolated
formulas, and 7907 embedded formulas, which are selected from 194
digitally originated PDF documents.
The dataset includes not only digitally originated PDF files, but also
their corresponding document images. Also, metadatas of the documents are
included.
The ground truth of mathematical formulas in each document page includes
that the precise bounding boxes of the isolated/embedded formulas.
It also includes the objects(characters, graphics, and images) in each
isolated/embedded formula. For each object, a bounding box is provided.
For character objects, the character's Unicode and font size are provided,
too.
An evaluation tool base on the ground-truth dataset is provided. This
evaluation tool is based on the ground truth format defined in our Dataset.
This dataset is a public database that is freely usable for research
purposes.
Download
Marmot Math Dataset v1.0
Math Formula Detaction Evaluator v1.0
Dataset for math formula identification in Chinese documents
Description
This is a ground-truth dataset for mathematical formula
identification in Chinese documents.
In total,the dataset contains 200 document pages with 1166 isolated
formulas, and 3022 embedded formulas, which are selected from 24
digitally originated CEB documents.
The ground truth of mathematical formulas in each document page includes
the precise bounding boxes of the isolated/embedded formulas.
It also includes the objects(characters, graphics, and images) in each
isolated/embedded formula. For each object, a bounding box is provided.
For character objects, the character's Unicode and font size are provided,
too.
This dataset is a public database that is freely usable for research
purposes.
Download
Marmot Chinese Math Dataset v1.0
Dataset for layout analysis of fixed layout documents
Description
This is a ground-truth dataset for layout analysis of
fixed-layout documents.
In total, the dataset contains 244 pages selected from 35
Portable Document Format (PDF) documents. Primitive objects of
page content include text, images and graphics. Primitives are
further grouped into ``fragments'', which contain proximate
primitives of the same type. For example, text fragments are
usually text lines. Currently, logical labels are assigned to
fragments. Labels include body text, title, figure, figure
annotation, figure caption, figure caption continuation, list
item, list item continuation, table cell, table caption,
equation, page number, footer, header,
footnote, and marginal note.
This dataset is a public database that is freely usable for research
purposes.
Download
Layout Analysis Dataset v0.1
If you have results to report on this dataset, please send email to
SACSupport@FOUNDER.COM.CN.
Please also cite the version number of the dataset you used, in order to
facilitate comparison of results. Many thanks for your cooperation!
Copyright (c) 2011 by Institute of Computer Science and Techonology of Peking University
and Institute of Digital Publishing of Founder R&D Center, China.
Permission is granted, free of charge, to any person or group obtaining a copy of the dataset and evaluator source code with research motivation only, including without limitation the rights to use, copy, modify, and distribute all the files.
Any commercial use must be permited by Founder R&D Center, please contact SACSupport@FOUNDER.COM.CN for details.