The Page-Building a Pennsylvania German Thesaurus through the Correction of OCR Errors


  • Camilla Balsamo
  • Barbara Hans-Bianchi



digital humanities, Pennsylvania German, OCR, digitalization


The aim of our project is to build an online Thesaurus of Pennsylvania German. This North American minority language has been in contact with American English ever since its emergence around 1800 and is still lacking a received standard, producing thus quite of a linguistic challenge. The research has been carried out on different text-based sources. We opted for Open Source software: firstly, we scanned images of source-text and converted them using an OCR. Errors occur repeatedly, in conversion phase, either due to flecks of the original text, or through the process of machine encoding. Errors might be varying: non-word detection, word-boundary, fails in punctuation, lack of diacritical marks in the output, tokenization errors and misrecognition of partof-speech (POS). We are processing text through the programming language Python, working on inputs that automatically fix errors without causing further issues. In order to write the algorithms of correction, several techniques have been studied and developed. Some were barely statistical – like minimum edit distance techniques – but most of the work was done relying on linguistics: transition probabilities are quite often language-dependent; they represent the probability that a given character (sequence, or POS) shall or shall not be followed or preceded by some other given character (or sequence, or POS).


Beam, C. Richard and Joshua R. Brown / Jennifer L. Trout. The Comprehensive Pennsylvania German Dictionary. 12 voll. A. Morgantown: Masthof, 2004-2011.

Bloomberg, Dan S. “Determining the Resolution of Scanned Document Images.” Document Recognition and
Retrieval VI.3651 (1999): 10-22.

Buffington, Albert F., and Preston Albert Barba. A Pennsylvanian German Grammar. Allentown: Schlechter’s, 1954.

Fishman, Joshua. “Language Maintenance and Language Shift as Fields of Inquiry: A Definition of the Field and Suggestions for Further Development.” Linguistics 9 (1964): 32-70.

Frey, J. William. A Simple Grammar of Pennsylvania Dutch. Lancaster: Brookshire, 2009.

Fuller, Janet M. “When Cultural Maintenance Means Linguistic Convergence: Pennsylvania German Evidence for the Matrix Language Turnover Hypothesis.” Language in Society 25.4 (1996): 493-514.

Haag, Earl C. A Pennsylvania German Reader and Grammar. University Park and London: Pennsylvania State University, 1982.

Hans-Bianchi, Barbara. “Pennsylvaniadeutsch: Wege der Verschriftung einer Minderheitensprache.“ Baig VII (2014): 113-131. Last visited 14/12/18.

---. “Kodifizierung als Überlebensstrategie? Orthographische Kodifizierungsversuche in Pennsylvania Deitsch.” Die Kodifizierung der Sprache. Strukturen, Funktionen, Konsequenzen (= WespA, vol. 17). Eds. Wolf Peter Klein and Sven Staffeldt. 2016. 42-69.

Horne, Abraham R. Horne’s Pennsylvania German Manual. Allentown: Horne, 1875.

Rauch, Edward H. Pennsylvania Dutch Hand-book / Pennsylvania Deitsh Hond-Booch. Mauch Chunk: Rauch, 1879.

Keiser, Steven Hartman. Pennsylvania German in the American Midwest. Durham: Duke University Press, 2012.

Klein, Wolf Peter. “Gibt es einen Kodex für die Grammatik des Neuhochdeutschen und, wenn ja, wie viele? Oder: Ein Plädoyer für Sprachkodexforschung.” Sprachverfall? Dynamik Wandel Variation. Eds. Albrecht Plewnia and Andreas Witt. Berlin: Mouton de Gruyter, 2014. 219-242.

Lambert, Marcus B. A Dictionary of the Non-English Words in the Pennsylvania-German Dialect. Lancaster: Pennsylvania German Society, 1924.

Louden, Mark L. “Pennsylvania German in the Twenty-first Century.” Sprachinselwelten. Entwicklung und Beschreibung der deutschen Sprachinseln am Anfang des 21. Jahrhunderts. Eds. Nina Berend and
Elisabeth Knipf-Komlósi. Bern: Peter Lang, 2006. 89-107.

---. Pennsylvania Dutch. The Story of an American Language. Baltimore: Johns Hopkins University Press. 2016.

Schiller, Anne, Simone Teufel and Christine Stöckert. Guidelines für das Tagging deutscher Textcorpora mit STTS. 1999. Last visited 14/12/18.

Seifert, Lester W. J. “The Word Geography of Pennsylvania German: Extent and Causes.” A Word Atlas of Pennsylvania German. Eds. Mark L. Louden, Howard Martin and Joseph C. Salmons. University of Wisconsin-Madison: Max Kade Institute, 2001. 81-102.

Springmann, Uwe. “A High Accuracy OCR Method to Convert Early Printings into Digital Text.” 2015.

Stine, Eugene S. Pennsylvania German Dictionary: Pennsylvania German-English, English-Pennsylvania German. Kutztown: Pennsylvania German Society, 1996.

Stolberg, Doris. Changes between the Lines. Diachronic Contact Phenomena in Written Pennsylvania German. Berlin: Mouton de Gruyter, 2015.

Van Pottelberge, Jeroen. Der am-Progressiv. Struktur und Parallele Entwicklung in den KontinentalWestgermanischen Sprachen. Tübingen: Narr, 2004.

Villa, Daniel J. and Susana V. Rivera-Mills. “An Integrated Multi-generational Model for Language Maintenance and Shift. The Case of Spanish in the Southwest”. Spanish in Context 6:1 (2009): 26-42.

Werner, Michael. Lexikalische Sprachkontaktphänomene in Schriftlichen Texten des Pennsylvaniadeutschen. Ph.D. dissertation. University of Mannheim. 1996.

Zacharias, Peter. Pennsylvania Dutch Dictionary. (Online dictionary based on Lambert 1924). Last visited 14/12/18.