This document describes an exchange format to bundle a workspace described by a METS file following OCR-D’s conventions.
METS is the exchange format of choice by OCR-D for describing relations of files such as images and metadata about those images such as PAGE or ALTO files. METS is a textual format, not suitable for embedding arbitrary, potentially binary, data. For various use cases (such as transfer via network, long-term preservation, reproducible tests etc.) it is desirable to have a self-contained representation of a workspace.
With such a representation, data producers are not forced to provide dereferenceable HTTP-URL for the files they produce and data consumers are not forced to dereference all HTTP-URL.
While METS does have mechanisms for embedding XML data and even base64-encoded binary data, the tradeoffs in file size, parsing speed and readability are too great to make this a viable solution for a mass digitization scenario.
Instead, we propose an exchange format (“OCRD-ZIP”) based on the BagIt spec used for data ingestion adopted in the web archiving community.
As a baseline, an OCRD-ZIP must adhere to v0.97+ of the BagIt specs, i.e.
- all files in
- a file
- a file
In accordance with the BagIt standard,
bagit.txt MUST consist of exactly
these two lines:
BagIt-Version: 1.0 Tag-File-Character-Encoding: UTF-8
bag-info.txtMUST additionally contain these tags:
bag-info.txtMAY additionally contain these tags:
Ocrd-Manifestation-Depth: Whether all URL are dereferenced as files or only some
BagIt-Profile-Identifier must be the string
Ocrd-Mets can be provided to declare that the METS file will not be the
mets.xml but another path relative to
Implementations MUST check for the
Ocrd-Mets tag: If it has a value, look for the
METS file at that location, relative to
/data. Otherwise, assume the default
Specify whether the bag contains the full manifestation of the data referenced in the METS (
or only those files that were
file:// URLs before (
A globally unique identifier identifying the work/works/parts of works this bundle of file represents.
This is to be used for repositories to identify new ingestions of existing works.
To ensure global uniqueness, the identifier should be prefixed with an identifier of the organization, e.g. an ISIL or domain name.
The SHA512 checksum of the
manifest-sha512.txt file of the version this bag
was based on, if any.
An OCRD-ZIP MUST be a serialized as a ZIP file.
Checksums for the files in
/data must be calculated with the
algorithm only and provided as
Since the checksum of this manifest file can be relevant (see
Ocrd-Base-Version-Checksum), in addition to the requirements
of the BagIt spec, the entries MUST be sorted.
NOTE: These checksums can be generated with
find data -type f | sort -sf |xargs sha512sum > manifest-sha512.txt.
File names must be relative to METS
Within an OCRD-ZIP, all local file resources referenced in the METS (and consequently all those referenced in other files within the workspace – see rule “If in PAGE then in METS” must be relative to the location of the METS file.
/tmp/foo/ws1/data ├── mets.xml ├── foo.tif └── foo.xml
file:///tmp/foo/ws1/data/foo.tif(file URL scheme with absolute path)
file:///foo.tif(relative path written as absolute path)
When in data then in METS
All files except
mets.xml itself that are contained in
data directory must
be referenced in a
mets:file/mets:Flocat in the
When in METS and not in data
Due to partial OCRD-ZIP not all files may be part of the payload. If so they have to be mentioned in fetch.txt and in all payload manifest files.
Optional metadata about the payload
In addition to the actual data files in
/data, the following metadata files
are allowed to be present in the root of the bag:
README.md: An extended, human-readable description of the dataset in the Markdown syntax
Makefile: A GNU make build file to reproduce the data in
build.sh: A bash script to reproduce the data in
sources.csv: A comma-separated values list to be used in the scripts. For straightforward HTTP downloads, prefer fetch.txt.
These files are purely for documentation and should not be used by processors in any way.
Packing a workspace as OCRD-ZIP
To pack a workspace to OCRD-ZIP:
- Create a temporary folder
fin the source METS:
file://from the beginning of the
- If it is not a file path (begins with
- Download/Copy the file to a location within
TMP/data. The structure SHOULD be
USEattribute of the parent
IDattribute of the
- Replace the URL of
fwith the path relative to
mets:FLocatof the METS
- all other files in the workspace, esp. PAGE-XML
- Write out the changed METS to
TMPas a BagIt bag
Unpacking OCRD-ZIP to a workspace
- Unzip OCRD-ZIP
zto a folder
- If the value
Ocrd-Metsis different from
TMP/datato an appropriate location to use as a workspace
Appendix A - BagIt profile definition
BagIt-Profile-Info: BagIt-Profile-Identifier: https://ocr-d.github.io/bagit-profile.json BagIt-Profile-Version: '1.2.0' Source-Organization: OCR-D External-Description: BagIt profile for OCR data Contact-Name: Konstantin Baierer Contact-Email: email@example.com Version: 0.1 Bag-Info: Bagging-Date: required: false Source-Organization: required: false Ocrd-Mets: required: false default: 'mets.xml' Ocrd-Manifestation-Depth: required: false default: partial values: ["partial", "full"] Ocrd-Identifier: required: true Ocrd-Checksum: required: false # echo -n | sha512sum default: 'cf83e1357eefb8bdf1542850d66d8007d620e4050b5715dc83f4a921d36ce9ce47d0d13c5d85f2b0ff8318d2877eec2f63b931bd47417a81a538327af927da3e' Manifests-Required: ['sha512'] Tag-Manifests-Required:  Tag-Files-Required:  Tag-Files-Allowed: - README.md - Makefile - build.sh - sources.csv - metadata/*.xml - metadata/*.txt Allow-Fetch.txt: true Serialization: required Accept-Serialization: application/zip Accept-BagIt-Version: - '1.0'
Appendix B - IANA considerations
Proposed media type of OCRD-ZIP: