Conventions for PAGE
In addition to these conventions, refer to the PAGE API docs for extensive documentation on the PAGE XML format itself.
The preliminary media type of a PAGE
application/vnd.prima.page+xml, which MUST be used as the
MIMETYPE of a
representing a PAGE document.
One page in one PAGE
A single PAGE XML file represents one page in the original document.
<pc:Page> element MUST have an attribute
image which MUST always be the source image.
The PAGE XML root element
<pc:PcGts> MUST have exactly one
URL for imageFilename / filename
imageFilename of the
filename of the
<pg:AlternativeImage> element MUST be a URL. A local filename should be a
All URL used in
filename MUST be referenced in a fileGrp
Original image as imageFilename
imageFilename attribute of the
<pg:Page> MUST reference the original
image and MUST NOT change between processing steps.
AlternativeImage for derived images
To encode images derived from the original image, the
should be used. Its
filename attribute should reference the URL of the
comments attribute should be one or more (separated by comma) terms of
the following list:
comments attribute of the
<pg:AlternativeImage> attribute should be used
AlternativeImage on sub-page level elements
For the results of image processing that changes the positions of pixels (e.g.
cropping, rotation, dewarping),
AlternativeImage on page level and polygon of
recognized zones is not sufficient for accessing the section of the image that a region is based on
since coordinates are always relative to the original image.
For such use cases,
<pg:AlternativeImage> may be used as a child of
Attaching text recognition results to elements
A PAGE document can attach recognized text to typographical units of
a page at different levels, such as block (
<pg:TextLine>), word (
<pg:Word>) or glyph (
To attach recognized text to an element
E, it must be encoded as
UTF-8 in a single
U within a
T must be the last element of
Leading and trailing whitespace (
U+000A) in the content of a
<pg:Unicode> is not significant and must be removed from the string by
To encode an actual space character at the start or end of the content
<pg:Unicode>, use a non-breaking space
Text recognition confidence
The confidence score describing the assumed correctness of the text recognition results in a
<pg:TextEquiv> can be expressed in an attribute
@conf as a float value
0 means “certainly wrong” and
1 means “certainly
Attaching multiple text recognition results to elements
Alternative text recognition results can be expressed by using multiple
<pg:TextEquiv> wherever a single
<pg:TextEquiv> would be allowed. When
<pg:TextEquiv>, they each must have an attribute
an integer number unique per set of
<pg:TextEquiv> that allows ranking them
in order of preference.
@index of the first (preferred)
<pg:TextEquiv> must be
Consistency of text results on different levels
Since text results can be defined on different levels and those levels can be nested, text results information can be redundant. To avoid inconsistencies, the following assertions must be true:
- text of
<pg:Word>must be equal to the text of all
<pg:Glyph>contained within, concatenated with empty string
- text of
<pg:TextLine>must be equal to the text of all
<pg:Word>contained within, concatenated with a single space (
- text of
<pg:TextRegion>must be equal to the text of all
<pg:TextLine>contained within, concatenated with a newline (
NOTE: “Concatenation” means joining a list of strings with a separator, no separator is added to the start or end of the resulting string.
These assertions are only to be enforced for the first
<pg:TextEquiv> of both
containing and contained elements, i.e. the only
<pg:TextEquiv> of an element
@index = 1 if multiple text
results are attached.
A consistency checker must support four levels of strictness:
If any of the assertions fail for a PAGE document, an exception should be raised and the document no further processed
If any of the assertions fail for a PAGE document, another comparison disregarding all whitespace shall be made. If this still fails, an exception should be raised and the document no further processed
If any of the assertions fail for a specific element in PAGE document, the text results of this element must be recreated, by concatenating the text results of its children elements. This algorithm needs to be recursive, i.e. if any of the children elements is itself inconsistent, its text results must be recreated in the same way before concatenation.
These consistency checks are so restrictive to spot data that cannot be unambigiously processed. However, there are valid use cases where the “index-1-consistency” is too narrow, esp. in post-correction with language models. For such use cases, it must be possible to disable the consistency validation altogether in the workflow.
<Word> <Glyph> <TextEquiv index="1"><Unicode>f</Unicode></TextEquiv> <TextEquiv index="2"><Unicode>t</Unicode></TextEquiv> </Glyph> <Glyph> <TextEquiv index="1"><Unicode>o</Unicode></TextEquiv> </Glyph> <Glyph> <TextEquiv><Unicode>o</Unicode></TextEquiv> </Glyph> <Glyph> <TextEquiv><Unicode>t</Unicode></TextEquiv> </Glyph> <TextEquiv index="1"><Unicode>foof</Unicode></TextEquiv> <TextEquiv index="2"><Unicode>toot</Unicode></TextEquiv> </Word>
In this example, the
<pg:Word> has text
the concatenation of the first text results of the contained
foot. As a result:
- Validation should raise an exception for inconsistency.
- Data consumers should assume the text result to be
Typographical information (type, cut etc.) must be documented in PAGE XML using the
See the PAGE documentation on TextStyle for all possible values.
<TextStyle> element can be used in all relevant elements:
<Word> <TextStyle fontFamily="Arial" fontSize="17.0" bold="true"/> <!-- [...] --> </Word>
pg:TextStyle/@fontFamily attribute can list one or more font
families, separated by comma (
font-families := font-family ("," font-family)* font-family := font-family-name (":" confidence)? font-family-name := ["A" - "Z" | "a" - "z" | "0" - "9"]+ | '"' ["A" - "Z" | "a" - "z" | "0" - "9" | " "]+ '"' confidence := ("0" | "1")? "." ["0" - "9"]+
Font family names that contain a space must be quoted with double quotes (
Clusters of typesets
Sometimes it is necessary to not express that an element is typeset in a specific font family but in font family from a cluster of related font groups.
For such typeset clusters, the
pg:TextStyle/@fontFamily attribute should be re-used.
This specification doesn’t restrict the naming of font families. However, we recommend to choose one of the following list of type groups names if applicable:
Font families and confidence
Providing multiple font families means that the element in question is set in one of the font families listed.
It is not possible to declare that multiple font families are used in an element. Instead, data producers are advised to increase output granularity until every element is set in a single font family.
The degree of confidence in the font family can be expressed by concatenating
font family names with colon (
:) followed by a float between
is certainly wrong) and
1 (information is certainly correct).
If a font family is not suffixed with a confidence value, the confidence is
considered to be
<TextStyle fontFamily="Arial:0.8, Times:0.7, Courier:0.4"/> <TextStyle fontFamily="Arial:.8, Times:0.5"/> <TextStyle fontFamily="Arial:1"/> <TextStyle fontFamily="Arial"/>
To model columns, use constructs in the
<pg:ReadingOrder> of the PAGE
A grid layout must be wrapped in a
<pg:OrderedGroup> with a
@caption that has the form
<vertical> is the number of columns and
<horizontal> is the number of rows.
<OrderedGroup caption="column_1_1"> <!-- the default: single column layout --> <OrderedGroup caption="column_1_2"> <!-- two-column layout --> <OrderedGroup caption="column_1_3"> <!-- three-column layout --> <OrderedGroup caption="column_2_3"> <!-- three-column layout split in top and bottom -->
Regions that belong to the same column must be grouped within
<pg:OrderedGroupIndexed> with a caption that begins with
<y> is the row position and
<x> is the column position (counting starts at
<OrderedGroup caption="column_2_2"> <!-- two-column two-row layout --> <OrderedGroupIndexed caption="column_1_1">...</OrderedGroupIndexed> <!-- upper-left column --> <OrderedGroupIndexed caption="column_1_2">...</OrderedGroupIndexed> <!-- upper-right column --> <OrderedGroupIndexed caption="column_2_1">...</OrderedGroupIndexed> <!-- lower-left column --> <OrderedGroupIndexed caption="column_2_2">...</OrderedGroupIndexed> <!-- lower-right column --> </OrderedGroup>