ocrd.processor.base module

Processor base class and helper functions.

class ocrd.processor.base.Processor(workspace, ocrd_tool=None, parameter=None, input_file_grp='INPUT', output_file_grp='OUTPUT', page_id=None, show_resource=None, list_resources=False, show_help=False, subcommand=None, show_version=False, dump_json=False, dump_module_dir=False, version=None)[source]

Bases: object

A processor is a tool that implements the uniform OCR-D command-line interface for run-time data processing. That is, it executes a single workflow step, or a combination of workflow steps, on the workspace (represented by local METS). It reads input files for all or requested physical pages of the input fileGrp(s), and writes output files for them into the output fileGrp(s). It may take a number of optional or mandatory parameters.

Instantiate, but do not process. Unless list_resources or show_resource or show_help or show_version or dump_json or dump_module_dir is true, setup for processing (parsing and validating parameters, entering the workspace directory).

Parameters:

workspace (Workspace) – The workspace to process. Can be None even for processing (esp. on multiple workspaces), but then needs to be set before running.

Keyword Arguments:
  • ocrd_tool (string) – JSON of the ocrd-tool description for that processor. Can be None for processing, but needs to be set before running.

  • parameter (string) – JSON of the runtime choices for ocrd-tool parameters. Can be None even for processing, but then needs to be set before running.

  • input_file_grp (string) – comma-separated list of METS ``fileGrp``s used for input.

  • output_file_grp (string) – comma-separated list of METS ``fileGrp``s used for output.

  • page_id (string) – comma-separated list of METS physical page IDs to process (or empty for all pages).

  • show_resource (string) – If not None, then instead of processing, resolve given resource by name and print its contents to stdout.

  • list_resources (boolean) – If true, then instead of processing, find all installed resource files in the search paths and print their path names.

  • show_help (boolean) – If true, then instead of processing, print a usage description including the standard CLI and all of this processor’s ocrd-tool parameters and docstrings.

  • subcommand (string) – ‘worker’ or ‘server’, only used here for the right –help output

  • show_version (boolean) – If true, then instead of processing, print information on this processor’s version and OCR-D version. Exit afterwards.

  • dump_json (boolean) – If true, then instead of processing, print ocrd_tool on stdout.

  • dump_module_dir (boolean) – If true, then instead of processing, print moduledir on stdout.

show_help(subcommand=None)[source]
show_version()[source]
verify()[source]

Verify that the input_file_grp fulfills the processor’s requirements.

process()[source]

Process the workspace from the given input_file_grp to the given output_file_grp for the given page_id under the given parameter.

(This contains the main functionality and needs to be overridden by subclasses.)

add_metadata(pcgts)[source]

Add PAGE-XML MetadataItemType MetadataItem describing the processing step and runtime parameters to PcGtsType pcgts.

resolve_resource(val)[source]

Resolve a resource name to an absolute file path with the algorithm in https://ocr-d.de/en/spec/ocrd_tool#file-parameters

Parameters:

val (string) – resource value to resolve

list_all_resources()[source]

List all resources found in the filesystem and matching content-type by filename suffix

property module

The top-level module this processor belongs to.

property moduledir

The filesystem path of the module directory.

property input_files

List the input files (for single-valued input_file_grp).

For each physical page:

  • If there is a single PAGE-XML for the page, take it (and forget about all other files for that page)

  • Else if there is a single image file, take it (and forget about all other files for that page)

  • Otherwise raise an error (complaining that only PAGE-XML warrants having multiple images for a single page)

Algorithm <https://github.com/cisocrgroup/ocrd_cis/pull/57#issuecomment-656336593>_

Returns:

A list of ocrd_models.ocrd_file.OcrdFile objects.

zip_input_files(require_first=True, mimetype=None, on_error='skip')[source]

List tuples of input files (for multi-valued input_file_grp).

Processors that expect/need multiple input file groups, cannot use input_files. They must align (zip) input files across pages. This includes the case where not all pages are equally present in all file groups. It also requires making a consistent selection if there are multiple files per page.

Following the OCR-D functional model, this function tries to find a single PAGE file per page, or fall back to a single image file per page. In either case, multiple matches per page are an error (see error handling below). This default behaviour can be changed by using a fixed MIME type filter via mimetype. But still, multiple matching files per page are an error.

Single-page multiple-file errors are handled according to on_error:

  • if skip, then the page for the respective fileGrp will be silently skipped (as if there was no match at all)

  • if first, then the first matching file for the page will be silently selected (as if the first was the only match)

  • if last, then the last matching file for the page will be silently selected (as if the last was the only match)

  • if abort, then an exception will be raised.

Multiple matches for PAGE-XML will always raise an exception.

Keyword Arguments:
  • require_first (boolean) – If true, then skip a page entirely whenever it is not available in the first input fileGrp.

  • mimetype (string) – If not None, filter by the specified MIME type (literal or regex prefixed by //). Otherwise prefer PAGE or image.

Returns:

A list of ocrd_models.ocrd_file.OcrdFile tuples.

ocrd.processor.base.generate_processor_help(ocrd_tool, processor_instance=None, subcommand=None)[source]

Generate a string describing the full CLI of this processor including params.

Parameters:
  • ocrd_tool (dict) – this processor’s tools section of the module’s ocrd-tool.json

  • processor_instance – the processor implementation (for adding any module/class/function docstrings)

ocrd.processor.base.run_cli(executable, mets_url=None, resolver=None, workspace=None, page_id=None, overwrite=None, log_level=None, log_filename=None, input_file_grp=None, output_file_grp=None, parameter=None, working_dir=None, mets_server_url=None)[source]

Open a workspace and run a processor on the command line.

If workspace is not none, reuse that. Otherwise, instantiate an Workspace for mets_url (and working_dir) by using ocrd.Resolver.workspace_from_url() (i.e. open or clone local workspace).

Run the processor CLI executable on the workspace, passing: - the workspace, - page_id - input_file_grp - output_file_grp - parameter (after applying any parameter_override settings)

(Will create output files and update the in the filesystem).

Parameters:

executable (string) – Executable name of the module processor.

ocrd.processor.base.run_processor(processorClass, mets_url=None, resolver=None, workspace=None, page_id=None, log_level=None, input_file_grp=None, output_file_grp=None, show_resource=None, list_resources=False, parameter=None, parameter_override=None, working_dir=None, mets_server_url=None, instance_caching=False)[source]

Instantiate a Pythonic processor, open a workspace, run the processor and save the workspace.

If workspace is not none, reuse that. Otherwise, instantiate an Workspace for mets_url (and working_dir) by using ocrd.Resolver.workspace_from_url() (i.e. open or clone local workspace).

Instantiate a Python object for processorClass, passing: - the workspace, - page_id - input_file_grp - output_file_grp - parameter (after applying any parameter_override settings)

Warning: Avoid setting the instance_caching flag to True. It may have unexpected side effects. This flag is used for an experimental feature we would like to adopt in future.

Run the processor on the workspace (creating output files in the filesystem).

Finally, write back the workspace (updating the METS in the filesystem).

Parameters:

processorClass (object) – Python class of the module processor.