Developments of the
OCR-D Coordination project

Konstantin Baierer

OCR-D MP Dev WS, 2019-02-27

https://kba.cloud/2019-02-27-ocrd-dev-ws/

Overview

OCR-D/spec
OCR-D/core
Side projects
Presentations/Papers

OCR-D/spec

Page handling

Approach with GROUPID not best practice
Standard way of assigning files to pages: <mets:structMap TYPE="PHYSICAL"/>

Page handling (old)


            <mets:fileGrp>
              <mets:file MIMETYPE="image/jpg" ID="OCR-D-IMG-0001" GROUPID="page0001">
                <mets:FLocat xlink:href="OCR-D-IMG/OCR-D-IMG-0001"/>
              </mets:file>
            </mets:fileGrp>

Page handling (new)


            <mets:fileGrp>
              <mets:file MIMETYPE="image/jpg" ID="OCR-D-IMG-0001">
                <mets:FLocat xlink:href="OCR-D-IMG/OCR-D-IMG-0001"/>
              </mets:file>
            </mets:fileGrp>


            <mets:structMap TYPE="PHYSICAL">
              <mets:div TYPE="physSequence">
                <mets:div TYPE="page" ID="page0001">
                  <mets:fptr FILEID="OCR-D-IMG-0001"/>
                </mets:div>
              </mets:div>
            </mets:structMap>

Relative paths > file URL

Allow relative paths in mets:file

Unified Logging

Target: STDOUT
Format: "TIME LEVEL LOGGERNAME - MESSAGE"
Levels: TRACE, DEBUG, INFO, ERROR, FATAL

PAGE-XML consistency

The text assigned to

all glyphs of a word
all words of a line
all lines of block

should be consistent when concatenating.

PAGE-XML consistency levels

lax: Disregard whitespace
strict: Strict validation
fix: Automatic correction

File Groups in ocrd-tool.json

The default expected input and output file groups can be provided in ocrd-tool.json


            {
              "tools": {
                "ocrd-kraken-binarize": {
                  "executable": "ocrd-kraken-binarize",
                  "input_file_grp": "OCR-D-IMG",
                  "output_file_grp": "OCR-D-IMG-BIN",
                  ...
                }
            }

Basic process information

Processing with OCR-D/core toolchain will record changes as mets:agent


            <mets:agent
              TYPE="OTHER"
              OTHERTYPE="SOFTWARE"
              ROLE="OTHER"
              OTHERROLE="preprocessing/optimization/binarization">
              <mets:name>ocrd_tesserocr v0.1.2</mets:name>
            </mets:agent>

OCRD-ZIP

Based on BagIt, an open standard for archiving data


            $ (cd OCR-D/assets/data/page_dewarp ; find)

            ./bagit.txt
            ./bag-info.txt
            ./manifest-sha512.txt
            ./tagmanifest-sha512.txt
            ./data
            ./data/mets.xml
            ./data/OCR-D-IMG
            ./data/OCR-D-IMG/OCR-D-IMG-linguistics_thesis_a
            ./data/OCR-D-IMG/OCR-D-IMG-linguistics_thesis_b
            ./data/OCR-D-IMG/OCR-D-IMG-boston_cooking_a
            ./data/OCR-D-IMG/OCR-D-IMG-boston_cooking_b

OCRD-ZIP

Metadata


            $ (cd OCR-D/assets/data/page_dewarp ; find)

            ./bagit.txt
            ./bag-info.txt
            ./manifest-sha512.txt
            ./tagmanifest-sha512.txt
            ./data
            ./data/mets.xml
            ./data/OCR-D-IMG
            ./data/OCR-D-IMG/OCR-D-IMG-linguistics_thesis_a
            ./data/OCR-D-IMG/OCR-D-IMG-linguistics_thesis_b
            ./data/OCR-D-IMG/OCR-D-IMG-boston_cooking_a
            ./data/OCR-D-IMG/OCR-D-IMG-boston_cooking_b

OCRD-ZIP

Checksums


            $ (cd OCR-D/assets/data/page_dewarp ; find)

            ./bagit.txt
            ./bag-info.txt
            ./manifest-sha512.txt
            ./tagmanifest-sha512.txt
            ./data
            ./data/mets.xml
            ./data/OCR-D-IMG
            ./data/OCR-D-IMG/OCR-D-IMG-linguistics_thesis_a
            ./data/OCR-D-IMG/OCR-D-IMG-linguistics_thesis_b
            ./data/OCR-D-IMG/OCR-D-IMG-boston_cooking_a
            ./data/OCR-D-IMG/OCR-D-IMG-boston_cooking_b

OCRD-ZIP

Workspace


            $ (cd OCR-D/assets/data/page_dewarp ; find)

            ./bagit.txt
            ./bag-info.txt
            ./manifest-sha512.txt
            ./tagmanifest-sha512.txt
            ./data
            ./data/mets.xml
            ./data/OCR-D-IMG
            ./data/OCR-D-IMG/OCR-D-IMG-linguistics_thesis_a
            ./data/OCR-D-IMG/OCR-D-IMG-linguistics_thesis_b
            ./data/OCR-D-IMG/OCR-D-IMG-boston_cooking_a
            ./data/OCR-D-IMG/OCR-D-IMG-boston_cooking_b

Font Family

Use PAGE XML
<pg:TextStyle>
element with
fontFamily attribute


            <Word>
              <TextStyle fontFamily="Arial" fontSize="17.0" bold="true"/>
                <!-- [...] -->
            </Word>

Clusters of typesets

textura
rotunda
bastarda
antiqua
greek
hebrew
italic
fraktur

Confidence of (multiple) font families

Separate multiple font families with comma

Suffix colon and `0..1` float to font family


            <TextStyle fontFamily="Arial:0.8, Times:0.7, Courier:0.4"/>
            <TextStyle fontFamily="Arial:.8, Times:0.5"/>
            <TextStyle fontFamily="Arial:1"/>
            <TextStyle fontFamily="Arial"/>

Columns

Based on CSS grid layout conventions
x/y coordinates for columns/rows
"Layout": pg:OrderedGroup
"Cells": pg:OrderedGroupIndexed


<OrderedGroup caption="column_2_2"> <!-- two-column two-row layout -->
    <OrderedGroupIndexed caption="column_1_1">...</OrderedGroupIndexed> <!-- upper-left column -->
    <OrderedGroupIndexed caption="column_1_2">...</OrderedGroupIndexed> <!-- upper-right column -->
    <OrderedGroupIndexed caption="column_2_1">...</OrderedGroupIndexed> <!-- lower-left column -->
    <OrderedGroupIndexed caption="column_2_2">...</OrderedGroupIndexed> <!-- lower-right column -->
</OrderedGroup>

OCR-D/core

3 > 2

Python 2.x not supported anymore

Python 3.4 hardly supported

Python 3.5 okay for now

Use 3.6+ if you can

Refactoring

Untangle separate concerns into individual modules

All part of OCR-D/core (monorepo)

Published as separate modules to PyPI

Single set of tests

Striving for 100% coverage

API docs on ocr-d.github.io

Refactoring

ocrd_utils: Shared utility functions and constants
ocrd_models: APIs to METS, PAGE, EXIF...
ocrd_validators: Validate workspaces, ocrd-tool.json, parameters, OCRD-ZIP...
ocrd_modelfactory: PAGE from image, EXIF from filename etc.
ocrd: CLI, wrapper code, shell lib...

Improved `ocrd process`

Lightweight workflow executor
Chain multiple spec-compliant API


            ocrd process -m mets.xml \
              "kraken-binarize -I OCR-D-IMG -O OCR-D-IMG-BIN-KRAKEN" \
              "tesserocr-segment -I OCR-D-IMG-BIN-KRAKEN -O OCR-D-SEG-BLOCK -p params.json"
              "calamari-ocr -I OCR-D-SEG-BLOCK -O OCR-D-OCR-CALA"
              "cis-aio -I OCR-D-OCR-CALA -O OCR-D-OCR-CIS"

`ocrd zip` to work with OCRD-ZIP


Usage: ocrd zip [OPTIONS] COMMAND [ARGS]...

  Bag/Spill/Validate OCRD-ZIP bags

Options:
  --help  Show this message and exit.

Commands:
  bag       Bag workspace as OCRD-ZIP at DEST
  spill     Spill/unpack OCRD-ZIP bag at SRC to DEST SRC must exist an be
            an...
  validate  Validate OCRD-ZIP SRC must exist an be an OCRD-ZIP, either a
            ZIP...

Side Projects

kba/mollusc

Prototyping multi-engine training
Specs for training config, model description, engine parameters...

deutschestextarchiv/tocrify

Enrich OCR with structural information from METS

ocrd-train

Makefile based approach to tesseract training

page-xml-cropper

Manually crop to printspace for 1-2 pages of input
Automatically/heuristically crop the rest

kba/kitodo-ocrd

Kitodo plugin to run OCR-D tools as part of the digitization workflow

ocrd-fork-ocropy

Python3 compatible version
Please try!


            pip install ocrd-fork-ocropy

Upcoming presentations/papers

DATeCH 2019 #1 - OCR-D in general

specs
software
module projects

DATeCH 2019 #2 - Ground Truth

METS + PAGE XML + BagIt + Repositories
Implementation of PRIMA ontology for extrinsic and intrinsic GT features

ICDAR 2019 - Multi-engine Training

Uniform interface to different OCR engines
Specifications of exchange formats
Training and application
Software Prototype

BID 2019

Presentation "Von der Vision zur Umsetzung: Der aktuelle Entwicklungsstand von OCR-D",
18.03. 09:00
Workshop "OCR-D in der Praxis: Ein gemeinsamer Ausblick mit Dienstleistern und Anwendern",
18.03. 16:00

DHd 2019

Workshop "Vom gedruckten Werk zu elektronischem Volltext als Forschungsgrundlage"
25.03. 14:00

🙇 Thank you 🙇

gitter.im/OCR-D/Lobby

And now: https://tinyurl.com/ocrd-2019-02-28

Developments of theOCR-D Coordination project

Konstantin Baierer

Overview

OCR-D/spec

Page handling

Page handling (old)

Page handling (new)

Relative paths > file URL

Unified Logging

PAGE-XML consistency

PAGE-XML consistency levels

File Groups in ocrd-tool.json

Basic process information

OCRD-ZIP

OCRD-ZIP

OCRD-ZIP

OCRD-ZIP

Font Family

Clusters of typesets

Confidence of (multiple) font families

Columns

OCR-D/core

3 > 2

Refactoring

Refactoring

Improved ocrd process

ocrd zip to work with OCRD-ZIP

Side Projects

kba/mollusc

deutschestextarchiv/tocrify

ocrd-train

page-xml-cropper

kba/kitodo-ocrd

ocrd-fork-ocropy

Upcoming presentations/papers

DATeCH 2019 #1 - OCR-D in general

DATeCH 2019 #2 - Ground Truth

ICDAR 2019 - Multi-engine Training

BID 2019

DHd 2019

🙇 Thank you 🙇

Developments of the
OCR-D Coordination project

Improved `ocrd process`

`ocrd zip` to work with OCRD-ZIP