The Page-Building a Pennsylvania German Thesaurus through the Correction of OCR Errors


  • Camilla Balsamo
  • Barbara Hans-Bianchi



digital humanities, Pennsylvania German, OCR, digitalization


The aim of our project is to build an online Thesaurus of Pennsylvania German. This North American minority language has been in contact with American English ever since its emergence around 1800 and is still lacking a received standard, producing thus quite of a linguistic challenge. The research has been carried out on different text-based sources. We opted for Open Source software: firstly, we scanned images of source-text and converted them using an OCR. Errors occur repeatedly, in conversion phase, either due to flecks of the original text, or through the process of machine encoding. Errors might be varying: non-word detection, word-boundary, fails in punctuation, lack of diacritical marks in the output, tokenization errors and misrecognition of partof-speech (POS). We are processing text through the programming language Python, working on inputs that automatically fix errors without causing further issues. In order to write the algorithms of correction, several techniques have been studied and developed. Some were barely statistical – like minimum edit distance techniques – but most of the work was done relying on linguistics: transition probabilities are quite often language-dependent; they represent the probability that a given character (sequence, or POS) shall or shall not be followed or preceded by some other given character (or sequence, or POS).


