Developments of the
OCR-D Coordination project

Konstantin Baierer

OCR-D MP Dev WS, 2019-02-27

https://kba.cloud/2019-02-27-ocrd-dev-ws/

Overview

  • OCR-D/spec
  • OCR-D/core
  • Side projects
  • Presentations/Papers

OCR-D/spec

Page handling

  • Approach with GROUPID not best practice
  • Standard way of assigning files to pages: <mets:structMap TYPE="PHYSICAL"/>

Page handling (old)


            <mets:fileGrp>
              <mets:file MIMETYPE="image/jpg" ID="OCR-D-IMG-0001" GROUPID="page0001">
                <mets:FLocat xlink:href="OCR-D-IMG/OCR-D-IMG-0001"/>
              </mets:file>
            </mets:fileGrp>
            

Page handling (new)


            <mets:fileGrp>
              <mets:file MIMETYPE="image/jpg" ID="OCR-D-IMG-0001">
                <mets:FLocat xlink:href="OCR-D-IMG/OCR-D-IMG-0001"/>
              </mets:file>
            </mets:fileGrp>
            

            <mets:structMap TYPE="PHYSICAL">
              <mets:div TYPE="physSequence">
                <mets:div TYPE="page" ID="page0001">
                  <mets:fptr FILEID="OCR-D-IMG-0001"/>
                </mets:div>
              </mets:div>
            </mets:structMap>
            

Relative paths > file URL

  • Allow relative paths in mets:file

Unified Logging

  • Target: STDOUT
  • Format: "TIME LEVEL LOGGERNAME - MESSAGE"
  • Levels: TRACE, DEBUG, INFO, ERROR, FATAL

PAGE-XML consistency

The text assigned to

  • all glyphs of a word
  • all words of a line
  • all lines of block

should be consistent when concatenating.

PAGE-XML consistency levels

  • lax: Disregard whitespace
  • strict: Strict validation
  • fix: Automatic correction

File Groups in ocrd-tool.json

The default expected input and output file groups can be provided in ocrd-tool.json


            {
              "tools": {
                "ocrd-kraken-binarize": {
                  "executable": "ocrd-kraken-binarize",
                  "input_file_grp": "OCR-D-IMG",
                  "output_file_grp": "OCR-D-IMG-BIN",
                  ...
                }
            }
            

Basic process information

Processing with OCR-D/core toolchain will record changes as mets:agent


            <mets:agent
              TYPE="OTHER"
              OTHERTYPE="SOFTWARE"
              ROLE="OTHER"
              OTHERROLE="preprocessing/optimization/binarization">
              <mets:name>ocrd_tesserocr v0.1.2</mets:name>
            </mets:agent>
            

OCRD-ZIP

Based on BagIt, an open standard for archiving data


            $ (cd OCR-D/assets/data/page_dewarp ; find)

            ./bagit.txt
            ./bag-info.txt
            ./manifest-sha512.txt
            ./tagmanifest-sha512.txt
            ./data
            ./data/mets.xml
            ./data/OCR-D-IMG
            ./data/OCR-D-IMG/OCR-D-IMG-linguistics_thesis_a
            ./data/OCR-D-IMG/OCR-D-IMG-linguistics_thesis_b
            ./data/OCR-D-IMG/OCR-D-IMG-boston_cooking_a
            ./data/OCR-D-IMG/OCR-D-IMG-boston_cooking_b
            

OCRD-ZIP

Metadata


            $ (cd OCR-D/assets/data/page_dewarp ; find)

            ./bagit.txt
            ./bag-info.txt
            ./manifest-sha512.txt
            ./tagmanifest-sha512.txt
            ./data
            ./data/mets.xml
            ./data/OCR-D-IMG
            ./data/OCR-D-IMG/OCR-D-IMG-linguistics_thesis_a
            ./data/OCR-D-IMG/OCR-D-IMG-linguistics_thesis_b
            ./data/OCR-D-IMG/OCR-D-IMG-boston_cooking_a
            ./data/OCR-D-IMG/OCR-D-IMG-boston_cooking_b
            

OCRD-ZIP

Checksums


            $ (cd OCR-D/assets/data/page_dewarp ; find)

            ./bagit.txt
            ./bag-info.txt
            ./manifest-sha512.txt
            ./tagmanifest-sha512.txt
            ./data
            ./data/mets.xml
            ./data/OCR-D-IMG
            ./data/OCR-D-IMG/OCR-D-IMG-linguistics_thesis_a
            ./data/OCR-D-IMG/OCR-D-IMG-linguistics_thesis_b
            ./data/OCR-D-IMG/OCR-D-IMG-boston_cooking_a
            ./data/OCR-D-IMG/OCR-D-IMG-boston_cooking_b
            

OCRD-ZIP

Workspace


            $ (cd OCR-D/assets/data/page_dewarp ; find)

            ./bagit.txt
            ./bag-info.txt
            ./manifest-sha512.txt
            ./tagmanifest-sha512.txt
            ./data
            ./data/mets.xml
            ./data/OCR-D-IMG
            ./data/OCR-D-IMG/OCR-D-IMG-linguistics_thesis_a
            ./data/OCR-D-IMG/OCR-D-IMG-linguistics_thesis_b
            ./data/OCR-D-IMG/OCR-D-IMG-boston_cooking_a
            ./data/OCR-D-IMG/OCR-D-IMG-boston_cooking_b
            

Font Family

Use PAGE XML
<pg:TextStyle>
element with
fontFamily attribute


            <Word>
              <TextStyle fontFamily="Arial" fontSize="17.0" bold="true"/>
                <!-- [...] -->
            </Word>
            

Clusters of typesets

  • textura
  • rotunda
  • bastarda
  • antiqua
  • greek
  • hebrew
  • italic
  • fraktur

Confidence of (multiple) font families

Separate multiple font families with comma

Suffix colon and `0..1` float to font family


            <TextStyle fontFamily="Arial:0.8, Times:0.7, Courier:0.4"/>
            <TextStyle fontFamily="Arial:.8, Times:0.5"/>
            <TextStyle fontFamily="Arial:1"/>
            <TextStyle fontFamily="Arial"/>
            

Columns

  • Based on CSS grid layout conventions
  • x/y coordinates for columns/rows
  • "Layout": pg:OrderedGroup
  • "Cells": pg:OrderedGroupIndexed

<OrderedGroup caption="column_2_2"> <!-- two-column two-row layout -->
    <OrderedGroupIndexed caption="column_1_1">...</OrderedGroupIndexed> <!-- upper-left column -->
    <OrderedGroupIndexed caption="column_1_2">...</OrderedGroupIndexed> <!-- upper-right column -->
    <OrderedGroupIndexed caption="column_2_1">...</OrderedGroupIndexed> <!-- lower-left column -->
    <OrderedGroupIndexed caption="column_2_2">...</OrderedGroupIndexed> <!-- lower-right column -->
</OrderedGroup>
            

OCR-D/core

3 > 2

Python 2.x not supported anymore

Python 3.4 hardly supported

Python 3.5 okay for now

Use 3.6+ if you can

Refactoring

Untangle separate concerns into individual modules

All part of OCR-D/core (monorepo)

Published as separate modules to PyPI

Single set of tests

Striving for 100% coverage

API docs on ocr-d.github.io

Refactoring

  • ocrd_utils: Shared utility functions and constants
  • ocrd_models: APIs to METS, PAGE, EXIF...
  • ocrd_validators: Validate workspaces, ocrd-tool.json, parameters, OCRD-ZIP...
  • ocrd_modelfactory: PAGE from image, EXIF from filename etc.
  • ocrd: CLI, wrapper code, shell lib...

Improved ocrd process

  • Lightweight workflow executor
  • Chain multiple spec-compliant API

            ocrd process -m mets.xml \
              "kraken-binarize -I OCR-D-IMG -O OCR-D-IMG-BIN-KRAKEN" \
              "tesserocr-segment -I OCR-D-IMG-BIN-KRAKEN -O OCR-D-SEG-BLOCK -p params.json"
              "calamari-ocr -I OCR-D-SEG-BLOCK -O OCR-D-OCR-CALA"
              "cis-aio -I OCR-D-OCR-CALA -O OCR-D-OCR-CIS"
            

ocrd zip to work with OCRD-ZIP


Usage: ocrd zip [OPTIONS] COMMAND [ARGS]...

  Bag/Spill/Validate OCRD-ZIP bags

Options:
  --help  Show this message and exit.

Commands:
  bag       Bag workspace as OCRD-ZIP at DEST
  spill     Spill/unpack OCRD-ZIP bag at SRC to DEST SRC must exist an be
            an...
  validate  Validate OCRD-ZIP SRC must exist an be an OCRD-ZIP, either a
            ZIP...
            

Side Projects

kba/mollusc

  • Prototyping multi-engine training
  • Specs for training config, model description, engine parameters...

deutschestextarchiv/tocrify

  • Enrich OCR with structural information from METS

ocrd-train

  • Makefile based approach to tesseract training

page-xml-cropper

  • Manually crop to printspace for 1-2 pages of input
  • Automatically/heuristically crop the rest

kba/kitodo-ocrd

  • Kitodo plugin to run OCR-D tools as part of the digitization workflow

ocrd-fork-ocropy

  • Python3 compatible version
  • Please try!

            pip install ocrd-fork-ocropy
            

Upcoming presentations/papers

DATeCH 2019 #1 - OCR-D in general

  • specs
  • software
  • module projects

DATeCH 2019 #2 - Ground Truth

  • METS + PAGE XML + BagIt + Repositories
  • Implementation of PRIMA ontology for extrinsic and intrinsic GT features

ICDAR 2019 - Multi-engine Training

  • Uniform interface to different OCR engines
  • Specifications of exchange formats
  • Training and application
  • Software Prototype

BID 2019

  • Presentation "Von der Vision zur Umsetzung: Der aktuelle Entwicklungsstand von OCR-D",
    18.03. 09:00
  • Workshop "OCR-D in der Praxis: Ein gemeinsamer Ausblick mit Dienstleistern und Anwendern",
    18.03. 16:00

DHd 2019

  • Workshop "Vom gedruckten Werk zu elektronischem Volltext als Forschungsgrundlage"
    25.03. 14:00

🙇 Thank you 🙇

ocr-d.de

ocr-d.github.io

ocr-d.github.io/docs

github.com/OCR-D

gitter.im/OCR-D/Lobby

And now: https://tinyurl.com/ocrd-2019-02-28