An open-source framework for integrating multi-source layout and text recognition tools into scalable OCR workflows

Kay-Michael Würzner
Konstantin Baierer

Bibliotheca Baltica 2018 Rostock, 2018-10-05

https://ocr-d.github.io/2018-10-05-baltica/
https://ocr-d.github.io/2018-10-05-baltica/slides.pdf

Overview

  1. Why OCR-D
  2. The OCR-D initiative
  3. Architecture
  4. State of OCR-D tools
  5. Scalability
  6. Open source

Users
want
text data

massive amounts

structured

easily accessible

Libraries
provide
text data

large amounts

unstructured

hard to access

Why the discrepancy?

Underspecified requirements on OCR by funders and users

OCR of historical data of little economic interest

Little competition

Academical solutions non-sustainable

Inflexible workflows

The OCR-D initiative

DFG-organized expert Workshop (2014)
Verfahren zur Verbesserung von OCR-Ergebnissen

Result: A concerted effort for improving OCR is seen as required.

OCR-D Coordination project

Phase 1: Exploring the domain (2015-2017)

Surveyed (open-source) ecosystem
around OCR and OLR

Identified Tasks

Prepared call for proposals for DFG

Phase 2: Module project stage (2018-2019)

Phase 3: Going productive (2018-2020)

Integrate with existing digitization workflow software, e.g. Kitodo

Make OCR-D-developed software uniformly deployable

Advise DFG on OCR requirements for "Praxisrichtlinien"

Not in this talk (but in OCR-D)

Ground Truth for OCR

OCR engine training

OCR research data repository

Workflow composition and provenance

Long-time preservation of OCR

Architecture

"Multi source"

existing tools by OCR-D partners
(tesseract, PoCoTo, LAREX...)

new developments within OCR-D
(font identification, post-correction...)

existing tools outside OCR-D
(ocropus, kraken, ScanTailor, OLENA...)

😍 Modular 😍

😭 Monolithic 😭

Specification

+

Implementation

OCR-D/spec

METS + PAGE-XML (+ ALTO)

Structured tool descriptions

Command line interface

HTTP interface

Actionable documentation

OCR-D/core

Validation and helper functions

Python library

Shell library

Why Python?

Python widely used in computer vision and machine learning
(keras, pytorch...)

Wrapping existing tools with minimal friction
(ocropus, kraken ...)

Bindings for low-level APIs
(opencv, tesserocr ...)

Why shell?

Lowest common denominator

Wrap arbitrary command line tools

Process callout possible in every
framework/workflow engine/programming environment

State of the OCR-D toolset

Preprocessing

Tool Developer Functionality Wrapper
anyOCR DFKI Kaiserslautern binarization, cropping, deskewing, dewarping (python)
OLENA OCR-D binarization shell
tesseract UB Mannheim,
ASV Leipzig
binarization python
OCRopus OCR-D binarization python
kraken OCR-D binarization python
ImageMagick OCR-D binarization, conversion shell

Layout recognition

Tool Developer Functionality Wrapper
anyOCR DFKI Kaiserslautern block+line seg8n, block class7n, document analysis (python)
LAREX Uni Würzburg block+line seg8n, block class7n (shell)
OCRopus OCR-D line seg8n python
kraken OCR-D line seg8n python
tesseract UB Mannheim,
ASV Leipzig
block+line seg8n python
dh_segment OCR-D block+line seg8n (shell)

Text recognition

Tool Developer Functionality Wrapper
OCRopus OCR-D text recognition python
kraken OCR-D text recognition python
tesseract UB Mannheim,
ASV Leipzig
text recognition python
calamari OCR-D text recognition (python)
ocrad OCR-D text recognition (shell)

Postprocessing

Tool Developer Functionality Wrapper
corASV ASV Leipzig post correction (python)
PoCoTo CIS München post correction python
keraslm ASV Leipzig post correction python
ocrevalUAtion OCR-D evaluation (shell)

Your tool?

Scalability

<IMPRESSIVE NUMBER HERE>

Geared towards real digitization scenarios

Cooperation with Kitodo and commercial providers

Frequent reality check with current practices ("Pilotbibliotheken")

Different deployment options

Modularity + uniform interfaces

Adaptive workflows

(Instantiation and composition up to users)

Open Source is more than "Open Source"

Step 1: Get funded!

Step 2: Develop!

Step 3: Publish code!

Sustainability and reuse!

Best Practices

Transparency from day one

Unit tests

Unified test assets

Continuous Integration

Semantic versioning

Docker base image

Releases to GitHub, PyPI, DockerHub

Community

Issues

Pull requests

Code review

Support chat

OCR-D/docs

Developer documentation

"Cookbook"

User guide

Documentation Documentation

🙇 Thank you 🙇

ocr-d.de

ocr-d.github.io

ocr-d.github.io/docs

github.com/OCR-D

gitter.im/OCR-D/Lobby