Skip to the content

The ENP image and ground truth dataset of historical newspapers

Clausner, C, Papadopoulos, C, Pletschacher, S and Antonacopoulos, Apostolos 2015, 'The ENP image and ground truth dataset of historical newspapers' , in: 2015 13th International Conference on Document Analysis and Recognition (ICDAR) , IEEE-CPS, pp. 931-935.

[img] PDF - Published Version
Restricted to Repository staff only

Download (2MB) | Request a copy

Abstract

This paper presents a research dataset of historical newspapers comprising over 500 page images, uniquely representative of European cultural heritage from the digitization projects of 12 national and major European libraries, created within the scope of the large-scale digitisation Europeana Newspapers Project (ENP). Every image is accompanied by comprehensive ground truth (Unicode encoded full-text, layout information with precise region outlines, type labels, and reading order) in PAGE format and searchable metadata about document characteristics and artefacts. The first part of the paper describes the nature of the dataset, how it was built, and the challenges encountered. In the second part, a baseline for two state-of-the-art OCR systems (ABBYY FineReader Engine 11 and Tesseract 3.03) is given with regard to both text recognition and segmentation/layout analysis performance

Item Type: Book Section
Schools: Schools > School of Computing, Science and Engineering
Journal or Publication Title: Proceedings of the 13th International Conference on Document Analysis and Recognition (ICDAR2015)
Publisher: IEEE-CPS
ISBN: 9781479918058
Funders: European Commission
Depositing User: Professor Apostolos Antonacopoulos
Date Deposited: 22 Mar 2016 16:03
Last Modified: 22 Mar 2016 16:03
URI: http://usir.salford.ac.uk/id/eprint/38462

Actions (login required)

Edit record (repository staff only) Edit record (repository staff only)

Downloads

Downloads per month over past year