ocrd_validators.page_validator module

API for validating OcrdPage.

exception ocrd_validators.page_validator.ConsistencyError(tag, ID, file_id, actual, expected)[source]

Bases: Exception

Exception representing a consistency error in textual transcription across levels of a PAGE-XML. (Element text strings must be the concatenation of their children’s text strings, joined by white space.)

Construct a new ConsistencyError.

Parameters:
  • tag (string) – Level of the inconsistent element (parent)

  • ID (string) – ID of the inconsistent element (parent)

  • file_id (string) – mets:id of the PAGE file

  • actual (string) – Value of parent’s TextEquiv[0]/Unicode

  • expected (string) – Concatenated values of children’s TextEquiv[0]/Unicode, joined by white-space

exception ocrd_validators.page_validator.CoordinateConsistencyError(tag, ID, file_id, outer, inner)[source]

Bases: Exception

Exception representing a consistency error in coordinate confinement across levels of a PAGE-XML. (Element coordinate polygons must be properly contained in their parents’ coordinate polygons.)

Construct a new CoordinateConsistencyError.

Parameters:
  • tag (string) – Level of the offending element (child)

  • ID (string) – ID of the offending element (child)

  • file_id (string) – mets:id of the PAGE file

  • outer (string) – Coordinate points of the parent

  • inner (string) – Coordinate points of the child

exception ocrd_validators.page_validator.CoordinateValidityError(tag, ID, file_id, points, reason='unknown')[source]

Bases: Exception

Exception representing a validity error of an element’s coordinates in PAGE-XML. (Element coordinate polygons must have at least 3 points, and must not

self-intersect or be non-contiguous or be negative.)

Construct a new CoordinateValidityError.

Parameters:
  • tag (string) – Level of the offending element (child)

  • ID (string) – ID of the offending element (child)

  • points (string) – Coordinate points

  • reason (string) – description of the problem

ocrd_validators.page_validator.compare_without_whitespace(a, b)[source]

Compare two strings, ignoring all whitespace.

ocrd_validators.page_validator.page_get_reading_order(ro, rogroup)[source]

Add all elements from the given reading order group to the given dictionary.

Given a dict ro from layout element IDs to ReadingOrder element objects, and an object rogroup with additional ReadingOrder element objects, add all references to the dict, traversing the group recursively.

ocrd_validators.page_validator.make_poly(polygon_points)[source]

Instantiate a Polygon from a list of point pairs, or return an error string

ocrd_validators.page_validator.make_line(line_points)[source]

Instantiate a LineString from a list of point pairs, or return an error string

ocrd_validators.page_validator.validate_consistency(node, page_textequiv_consistency, page_textequiv_strategy, check_baseline, check_coords, report, file_id, joinRelations=None, readingOrder=None, textLineOrder=None, readingDirection=None)[source]

Check whether the text results on an element is consistent with its child element text results, and whether the coordinates of an element are fully within its parent element coordinates.

ocrd_validators.page_validator.concatenate(nodes, concatenate_with, page_textequiv_strategy, joins=None)[source]

Concatenate nodes textually according to https://ocr-d.github.io/page#consistency-of-text-results-on-different-levels

ocrd_validators.page_validator.get_text(node, page_textequiv_strategy='first')[source]

Get the first or most confident among text results (depending on page_textequiv_strategy). For the strategy best, return the string of the highest scoring result. For the strategy first, return the string of the lowest indexed result. If there are no scores/indexes, use the first result. If there are no results, return the empty string.

ocrd_validators.page_validator.set_text(node, text, page_textequiv_strategy)[source]

Set the first or most confident among text results (depending on page_textequiv_strategy). For the strategy best, set the string of the highest scoring result. For the strategy first, set the string of the lowest indexed result. If there are no scores/indexes, use the first result. If there are no results, add a new one.

class ocrd_validators.page_validator.PageValidator[source]

Bases: object

Validator for OcrdPage <../ocrd_models/ocrd_models.ocrd_page.html>.

static validate(filename=None, ocrd_page=None, ocrd_file=None, page_textequiv_consistency='strict', page_textequiv_strategy='first', check_baseline=True, check_coords=True)[source]

Validates a PAGE file for consistency by filename, OcrdFile or passing OcrdPage directly.

Parameters:
  • filename (string) – Path to PAGE

  • ocrd_page (OcrdPage) – OcrdPage instance

  • ocrd_file (OcrdFile) – OcrdFile instance wrapping OcrdPage

  • page_textequiv_consistency (string) – ‘strict’, ‘lax’, ‘fix’ or ‘off’

  • page_textequiv_strategy (string) – Currently only ‘first’

  • check_baseline (bool) – whether Baseline must be fully within TextLine/Coords

  • check_coords (bool) – whether *Region/TextLine/Word/Glyph must each be fully contained within Border/*Region/TextLine/Word, resp.

Returns:

report (ValidationReport) Report on the validity