Skip to the content

Aletheia - An advanced document layout and text ground-truthing system for production environments

Clausner, C, Pletschacher, S and Antonacopoulos, A 2011, 'Aletheia - An advanced document layout and text ground-truthing system for production environments' , IEEE Xplore Digital Library , pp. 48-52.

[img] PDF - Published Version
Restricted to Registered users only

Download (1107kB) | Request a copy

    Abstract

    Large-scale digitisation has led to a number of new possibilities with regard to adaptive and learning based methods in the field of Document Image Analysis and OCR. For ground truth production of large corpora, however, there is still a gap in terms of productivity. Ground truth is not only crucial for training and evaluation at the development stage of tools but also for quality assurance in the scope of production workflows for digital libraries. This paper describes Aletheia, an advanced system for accurate and yet cost-effective ground truthing of large amounts of documents. It aids the user with a number of automated and semi-automated tools which were partly developed and improved based on feedback from major libraries across Europe and from their digitisation service providers which are using the tool in a production environment. Novel features are, among others, the support of top-down ground truthing with sophisticated split and shrink tools as well as bottom-up ground truthing supporting the aggregation of lower-level elements to more complex structures. Special features have been developed to support working with the complexities of historical documents. The integrated rules and guidelines validator, in combination with powerful correction tools, enable efficient production of highly accurate ground truth.

    Item Type: Article
    Additional Information: Proceedings of the 2011 International Conference on Document Analysis and Recognition (ICDAR), 18-21 September 2011, Beijing
    Uncontrolled Keywords: document layout analysis, ground truth, ground-truthing, digitization, layout evaluation, historical documents, document images
    Themes: Media, Digital Technology and the Creative Economy
    Schools: Colleges and Schools > College of Science & Technology
    Colleges and Schools > College of Science & Technology > School of Computing, Science and Engineering > Data Mining and Pattern Recognition Research Centre
    Colleges and Schools > College of Science & Technology > School of Computing, Science and Engineering
    Journal or Publication Title: IEEE Xplore Digital Library
    Publisher: IEEE Computer Society
    Refereed: Yes
    Series Name: ICDAR '11
    ISSN: 1520-5363
    Depositing User: Mr Christian Clausner
    Date Deposited: 05 Oct 2012 14:03
    Last Modified: 20 Aug 2013 18:31
    URI: http://usir.salford.ac.uk/id/eprint/26267

    Actions (login required)

    Edit record (repository staff only)

    No Altmetrics available

    Downloads per month over past year

    View more statistics