Level 1

Particularites of the printing technology and typographical aspects are not taken into account and are not documented in the Ground Truth corpus. A normalization is carried out to a greater extent. The following characters are normalized:
  • long-s to round-s
  • umlaute (e above a vowel) to äöüÄÖÜ
  • sz to ß
  • Virgel to comma
  • Quotation marks are transferred to today's use and are not differentiated
  • Separators are transferred to today's use and are not differentiated
  • the round-r in connection with c ist dissolved to etc.
  • The reproduction of spaces is limited to the separation of words.
  • Punctuation marks are always used in conjunction with the preceding word.