The OCR-D project

Jun 1, 2016

OCR-D is a coordination project that is aimed at the further development of Optical Character Recognition (OCR) processes for historical prints.

Workflow and methods of automatic text recognition are investigated, described and, if necessary, optimized. A major goal is to conceptually prepare the transformation of prints of the German-speaking countries from the 16th to 19th century into electronic full text.

The Herzog August Bibliothek Wolfenbüttel, the Berlin-Brandenburg Academy of Sciences and Humanities in Berlin, the Staatsbibliothek zu Berlin Preußischer Kulturbesitz and the Karlsruhe Institute of Technology are participating in this project. The Bayerische Staatsbibliothek was also involved until 31 August 2016. The project is supported by experts, scientists and libraries.

In recent years, scientific libraries in particular have digitised extensive collections of images. Searchable full texts can be automatically generated from these image data using OCR procedures. The added value provided by the use of digital full texts is indispensable in many scientific disciplines today, especially in the field of humanities research.

So far, however, access to the electronic full text is often not possible or only possible in an insufficient form. Many historical holdings are available in digitalised form through the “Verzeichnisse der im deutschen Sprachbereich erschienenen Drucke” (VD). Results from common OCR procedures have so far been insufficient. In particular, old print types, especially fracture, are hardly recognized.

There is a need for development here, which we are uncovering in OCR-D. We build on the already existing tools and investigations. By a new combination, in rare cases also by new development, the OCR process for VD prints shall be specialized. Thereby we are looking for answers to current technical, information scientific and organisational problems.

The project is funded by the German Research Foundation (DFG) and will run for three years until September 2018. In the first phase, needs will be identified and concepts for the further development will be developed. The cooperation structure will be consolidated and continued in the second phase. In this phase, calls for proposals for pilot projects will be issued, which will enable other institutions to participate. In all steps we welcome a lively exchange with colleagues from related projects and institutions as well as service providers.

At the end of the overall project, a consolidated procedure for the OCR processing of digitised material from the printed German cultural heritage of the 16th to 19th centuries will be developed.