OCR-D MP Dev WS, 2019-02-27
GROUPID
not best practice<mets:structMap TYPE="PHYSICAL"/>
<mets:fileGrp>
<mets:file MIMETYPE="image/jpg" ID="OCR-D-IMG-0001" GROUPID="page0001">
<mets:FLocat xlink:href="OCR-D-IMG/OCR-D-IMG-0001"/>
</mets:file>
</mets:fileGrp>
<mets:fileGrp>
<mets:file MIMETYPE="image/jpg" ID="OCR-D-IMG-0001">
<mets:FLocat xlink:href="OCR-D-IMG/OCR-D-IMG-0001"/>
</mets:file>
</mets:fileGrp>
<mets:structMap TYPE="PHYSICAL">
<mets:div TYPE="physSequence">
<mets:div TYPE="page" ID="page0001">
<mets:fptr FILEID="OCR-D-IMG-0001"/>
</mets:div>
</mets:div>
</mets:structMap>
mets:file
STDOUT
"TIME LEVEL LOGGERNAME - MESSAGE"
TRACE, DEBUG, INFO, ERROR, FATAL
The text assigned to
should be consistent when concatenating.
lax
: Disregard whitespacestrict
: Strict validationfix
: Automatic correctionThe default expected input and output file groups can be provided in ocrd-tool.json
{
"tools": {
"ocrd-kraken-binarize": {
"executable": "ocrd-kraken-binarize",
"input_file_grp": "OCR-D-IMG",
"output_file_grp": "OCR-D-IMG-BIN",
...
}
}
Processing with OCR-D/core toolchain will record changes as mets:agent
<mets:agent
TYPE="OTHER"
OTHERTYPE="SOFTWARE"
ROLE="OTHER"
OTHERROLE="preprocessing/optimization/binarization">
<mets:name>ocrd_tesserocr v0.1.2</mets:name>
</mets:agent>
Based on BagIt, an open standard for archiving data
$ (cd OCR-D/assets/data/page_dewarp ; find)
./bagit.txt
./bag-info.txt
./manifest-sha512.txt
./tagmanifest-sha512.txt
./data
./data/mets.xml
./data/OCR-D-IMG
./data/OCR-D-IMG/OCR-D-IMG-linguistics_thesis_a
./data/OCR-D-IMG/OCR-D-IMG-linguistics_thesis_b
./data/OCR-D-IMG/OCR-D-IMG-boston_cooking_a
./data/OCR-D-IMG/OCR-D-IMG-boston_cooking_b
Metadata
$ (cd OCR-D/assets/data/page_dewarp ; find)
./bagit.txt
./bag-info.txt
./manifest-sha512.txt
./tagmanifest-sha512.txt
./data
./data/mets.xml
./data/OCR-D-IMG
./data/OCR-D-IMG/OCR-D-IMG-linguistics_thesis_a
./data/OCR-D-IMG/OCR-D-IMG-linguistics_thesis_b
./data/OCR-D-IMG/OCR-D-IMG-boston_cooking_a
./data/OCR-D-IMG/OCR-D-IMG-boston_cooking_b
Checksums
$ (cd OCR-D/assets/data/page_dewarp ; find)
./bagit.txt
./bag-info.txt
./manifest-sha512.txt
./tagmanifest-sha512.txt
./data
./data/mets.xml
./data/OCR-D-IMG
./data/OCR-D-IMG/OCR-D-IMG-linguistics_thesis_a
./data/OCR-D-IMG/OCR-D-IMG-linguistics_thesis_b
./data/OCR-D-IMG/OCR-D-IMG-boston_cooking_a
./data/OCR-D-IMG/OCR-D-IMG-boston_cooking_b
Workspace
$ (cd OCR-D/assets/data/page_dewarp ; find)
./bagit.txt
./bag-info.txt
./manifest-sha512.txt
./tagmanifest-sha512.txt
./data
./data/mets.xml
./data/OCR-D-IMG
./data/OCR-D-IMG/OCR-D-IMG-linguistics_thesis_a
./data/OCR-D-IMG/OCR-D-IMG-linguistics_thesis_b
./data/OCR-D-IMG/OCR-D-IMG-boston_cooking_a
./data/OCR-D-IMG/OCR-D-IMG-boston_cooking_b
Use PAGE XML
<pg:TextStyle>
element with
fontFamily
attribute
<Word>
<TextStyle fontFamily="Arial" fontSize="17.0" bold="true"/>
<!-- [...] -->
</Word>
Separate multiple font families with comma
Suffix colon and `0..1` float to font family
<TextStyle fontFamily="Arial:0.8, Times:0.7, Courier:0.4"/>
<TextStyle fontFamily="Arial:.8, Times:0.5"/>
<TextStyle fontFamily="Arial:1"/>
<TextStyle fontFamily="Arial"/>
x/y
coordinates for columns/rowspg:OrderedGroup
pg:OrderedGroupIndexed
<OrderedGroup caption="column_2_2"> <!-- two-column two-row layout -->
<OrderedGroupIndexed caption="column_1_1">...</OrderedGroupIndexed> <!-- upper-left column -->
<OrderedGroupIndexed caption="column_1_2">...</OrderedGroupIndexed> <!-- upper-right column -->
<OrderedGroupIndexed caption="column_2_1">...</OrderedGroupIndexed> <!-- lower-left column -->
<OrderedGroupIndexed caption="column_2_2">...</OrderedGroupIndexed> <!-- lower-right column -->
</OrderedGroup>
Python 2.x not supported anymore
Python 3.4 hardly supported
Python 3.5 okay for now
Use 3.6+ if you can
Untangle separate concerns into individual modules
All part of OCR-D/core (monorepo)
Published as separate modules to PyPI
Single set of tests
Striving for 100% coverage
API docs on ocr-d.github.io
ocrd_utils
: Shared utility functions and constantsocrd_models
: APIs to METS, PAGE, EXIF...ocrd_validators
: Validate workspaces, ocrd-tool.json, parameters, OCRD-ZIP...ocrd_modelfactory
: PAGE from image, EXIF from filename etc.ocrd
: CLI, wrapper code, shell lib...ocrd process
ocrd process -m mets.xml \
"kraken-binarize -I OCR-D-IMG -O OCR-D-IMG-BIN-KRAKEN" \
"tesserocr-segment -I OCR-D-IMG-BIN-KRAKEN -O OCR-D-SEG-BLOCK -p params.json"
"calamari-ocr -I OCR-D-SEG-BLOCK -O OCR-D-OCR-CALA"
"cis-aio -I OCR-D-OCR-CALA -O OCR-D-OCR-CIS"
ocrd zip
to work with OCRD-ZIP
Usage: ocrd zip [OPTIONS] COMMAND [ARGS]...
Bag/Spill/Validate OCRD-ZIP bags
Options:
--help Show this message and exit.
Commands:
bag Bag workspace as OCRD-ZIP at DEST
spill Spill/unpack OCRD-ZIP bag at SRC to DEST SRC must exist an be
an...
validate Validate OCRD-ZIP SRC must exist an be an OCRD-ZIP, either a
ZIP...
Makefile
based approach to tesseract training
pip install ocrd-fork-ocropy
And now: https://tinyurl.com/ocrd-2019-02-28