Working with OCR-D-(Ground-Truth)-Repository

Upload bagit container from scratch to OCR-D(-GT)-Repository

Example to upload a scanned page to OCR-D-Repo.

Preparation: Create Workspace

Requirements: ocrd (Version 1.0.0) See Setup OCR-D Stack

Activate virtualenv

user@hostname:~$source ~/env-ocrd/bin/activate
(env-ocrd) user@hostname:~$

Initialize Workspace

(env-ocrd) user@hostname:~$ ocrd workspace init communist_manifesto
(env-ocrd) user@hostname:~$ cd communist_manifesto

Create Folder for Scanned Page

(env-ocrd) user@hostname:~/communist_manifesto$ mkdir OCR-D-IMG

Download Image (Google)

(env-ocrd) user@hostname:~/communist_manifesto$ wget -O OCR-D-IMG/OCR-D-IMG_0015.jpg

Add Image to Workspace

(env-ocrd) user@hostname:~/communist_manifesto$ ocrd workspace add -g P0015 -G OCR-D-IMG -i OCR-D-IMG_0015 -m image/jpg OCR-D-IMG/OCR-D-IMG_0015.jpg

Set Unique ID for Workspace

(env-ocrd) user@hostname:~/communist_manifesto$ ocrd workspace set-id 'communist_manifesto'

Validate Workspace

For some images, the resolution of the image is not set. To avoid validation errors, the resolution check is skipped. For further details see ‘ocrd workspace validate –help’.

(env-ocrd) user@hostname:~/communist_manifesto$ ocrd workspace validate --skip pixel_density mets.xml

Create BagIt Container

(env-ocrd) user@hostname:~/communist_manifesto$ cd ..
(env-ocrd) user@hostname:~/$ ocrd zip bag -i communist_manifesto -d communist_manifesto/

Validate BagIt Container

(env-ocrd) user@hostname:~/$ ocrd zip validate

Upload BagIt Container

user@hostname:~/$ curl -u ingest:GENERATED_PASSWORD -v -F "" http://localhost:8080/api/v1/metastore/bagit 

Download all BagIt Containers

user@hostname:~/Download$ wget -O listOfContainers.json

user@hostname:~/Download$ ocrdzips=$(cat listOfContainers.json | tr ",[]\"" "\n")

user@hostname:~/Download$ for addr in $ocrdzips
  wget $addr
  filename=$(basename -- "$addr")

  mkdir $directory
  cd $directory
  unzip ../$filename
  cd ..

List all Documents (in Browser) The list shows all ingested documents with its

Download Document Download of the complete document as bagit container.

List all Files inside Document All files of given resourceID referenced inside the mets.xml are listed here.

Download Single File Download/view single file (Tiff) of given resourceID, file group and fileID.

List Metadata List metadata of the document (e.g.: title, author, year, identifier, languages, classifications) of given resourceID.

List Ground Truth Metadata List all semantic labels of given resourceID.

Search Inside Repository

All searches will return a list of fitting resourceIDs. In order to further investigate the found resources, the listings above can be used.

Search via browser

Search on command line

Search for Semantic Label Search for documents with e.g. uneven illumination.

Search for Documents Containing Multiple Semantic Labels at Once,condition/acquisition/content-or-background/included-objects/preceeding-or-proceeding

Search for Documents with Classification ‘Fachtext’

Search for Documents with Language ‘deu’

Search for Documents with Identifier ‘16488’

Search for Documents with Specific Identifier and Type Search for document with specific identifier of a specific type. Possible types are:

user@hostname:~/Download$ allocrdzips=$(cat listOfAllContainers.json tr “,[]"” “\n”)

Get IDs of fitting containers

user@hostname:~/Download$ wget -O filteredList.json

user@hostname:~/Download$ filteredIds=$(cat filteredList.json tr “,[]"” “\n”)

user@hostname:~/Download$ for bagitid in $filteredIds do for addr in $allocrdzips do if echo “$addr” | grep -q “$bagitid”; then wget $addr filename=$(basename – “$addr”) directory=”${filename%.*}”

  mkdir $directory
  cd $directory
  unzip ../$filename
  cd ..
fi   done done ```