Quality prediction system for large-scale digitisation workflows

Clausner, C ORCID: https://orcid.org/0000-0001-6041-1002, Pletschacher, S ORCID: https://orcid.org/0000-0003-0541-0968 and Antonacopoulos, Apostolos ORCID: https://orcid.org/0000-0001-9552-0233 2016, 'Quality prediction system for large-scale digitisation workflows' , Proceedings of the 12th IAPR International Workshop on Document Analysis Systems (DAS2016), 2016 , pp. 138-143.

[img] PDF
Restricted to Repository staff only

Download (256kB) | Request a copy


The feasibility of large-scale OCR projects can so far only be assessed by running pilot studies on subsets of the target document collections and measuring the success of different workflows based on precise ground truth, which can be very costly to produce in the required volume. The premise of this paper is that, as an alternative, quality prediction may be used to approximate the success of a given OCR workflow. A new system is thus presented where a classifier is trained using metadata, image and layout features in combination with measured success rates (based on minimal ground truth). Subsequently, only document images are required as input for the numeric prediction of the quality score (no ground truth required). This way, the system can be applied to any number of similar (unseen) documents in order to assess their suitability for being processed using the particular workflow. The usefulness of the system has been validated using a realistic dataset of historical newspaper pages.

Item Type: Article
Schools: Schools > School of Computing, Science and Engineering
Journal or Publication Title: Proceedings of the 12th IAPR International Workshop on Document Analysis Systems (DAS2016)
Publisher: IEEE
Related URLs:
Funders: European Commission
Depositing User: Professor Apostolos Antonacopoulos
Date Deposited: 22 Mar 2016 16:14
Last Modified: 15 Feb 2022 20:32
URI: https://usir.salford.ac.uk/id/eprint/38466

Actions (login required)

Edit record (repository staff only) Edit record (repository staff only)


Downloads per month over past year