The Page-Building a Pennsylvania German Thesaurus through the Correction of OCR Errors

Camilla Balsamo; Barbara Hans-Bianchi

doi:10.13136/2281-4582/2018.i12.393

Authors

Camilla Balsamo
Barbara Hans-Bianchi

DOI:

https://doi.org/10.13136/2281-4582/2018.i12.393

Keywords:

digital humanities, Pennsylvania German, OCR, digitalization

Abstract

The aim of our project is to build an online Thesaurus of Pennsylvania German. This North American minority language has been in contact with American English ever since its emergence around 1800 and is still lacking a received standard, producing thus quite of a linguistic challenge. The research has been carried out on different text-based sources. We opted for Open Source software: firstly, we scanned images of source-text and converted them using an OCR. Errors occur repeatedly, in conversion phase, either due to flecks of the original text, or through the process of machine encoding. Errors might be varying: non-word detection, word-boundary, fails in punctuation, lack of diacritical marks in the output, tokenization errors and misrecognition of partof-speech (POS). We are processing text through the programming language Python, working on inputs that automatically fix errors without causing further issues. In order to write the algorithms of correction, several techniques have been studied and developed. Some were barely statistical – like minimum edit distance techniques – but most of the work was done relying on linguistics: transition probabilities are quite often language-dependent; they represent the probability that a given character (sequence, or POS) shall or shall not be followed or preceded by some other given character (or sequence, or POS).

References

Beam, C. Richard and Joshua R. Brown / Jennifer L. Trout. The Comprehensive Pennsylvania German Dictionary. 12 voll. A. Morgantown: Masthof, 2004-2011.

Bloomberg, Dan S. “Determining the Resolution of Scanned Document Images.” Document Recognition and
Retrieval VI.3651 (1999): 10-22.

Buffington, Albert F., and Preston Albert Barba. A Pennsylvanian German Grammar. Allentown: Schlechter’s, 1954.

Fishman, Joshua. “Language Maintenance and Language Shift as Fields of Inquiry: A Definition of the Field and Suggestions for Further Development.” Linguistics 9 (1964): 32-70.

Frey, J. William. A Simple Grammar of Pennsylvania Dutch. Lancaster: Brookshire, 2009.

Fuller, Janet M. “When Cultural Maintenance Means Linguistic Convergence: Pennsylvania German Evidence for the Matrix Language Turnover Hypothesis.” Language in Society 25.4 (1996): 493-514.

Haag, Earl C. A Pennsylvania German Reader and Grammar. University Park and London: Pennsylvania State University, 1982.

Hans-Bianchi, Barbara. “Pennsylvaniadeutsch: Wege der Verschriftung einer Minderheitensprache.“ Baig VII (2014): 113-131. http://www.associazioneitalianagermanistica.it/images/bollettini/9_HansBianchi_113-131_DEF.pdf. Last visited 14/12/18.

---. “Kodifizierung als Überlebensstrategie? Orthographische Kodifizierungsversuche in Pennsylvania Deitsch.” Die Kodifizierung der Sprache. Strukturen, Funktionen, Konsequenzen (= WespA, vol. 17). Eds. Wolf Peter Klein and Sven Staffeldt. 2016. 42-69. https://opus.bibliothek.uniwuerzburg.de/opus4-
wuerzburg/frontdoor/deliver/index/docId/13808/file/WespA17_Kodex_Klein_Staffeldt.pdf

Horne, Abraham R. Horne’s Pennsylvania German Manual. Allentown: Horne, 1875.

Rauch, Edward H. Pennsylvania Dutch Hand-book / Pennsylvania Deitsh Hond-Booch. Mauch Chunk: Rauch, 1879.

Keiser, Steven Hartman. Pennsylvania German in the American Midwest. Durham: Duke University Press, 2012.

Klein, Wolf Peter. “Gibt es einen Kodex für die Grammatik des Neuhochdeutschen und, wenn ja, wie viele? Oder: Ein Plädoyer für Sprachkodexforschung.” Sprachverfall? Dynamik Wandel Variation. Eds. Albrecht Plewnia and Andreas Witt. Berlin: Mouton de Gruyter, 2014. 219-242.

Lambert, Marcus B. A Dictionary of the Non-English Words in the Pennsylvania-German Dialect. Lancaster: Pennsylvania German Society, 1924.

Louden, Mark L. “Pennsylvania German in the Twenty-first Century.” Sprachinselwelten. Entwicklung und Beschreibung der deutschen Sprachinseln am Anfang des 21. Jahrhunderts. Eds. Nina Berend and
Elisabeth Knipf-Komlósi. Bern: Peter Lang, 2006. 89-107.

---. Pennsylvania Dutch. The Story of an American Language. Baltimore: Johns Hopkins University Press. 2016.

Schiller, Anne, Simone Teufel and Christine Stöckert. Guidelines für das Tagging deutscher Textcorpora mit STTS. 1999. http://www.sfs.uni-tuebingen.de/resources/stts-1999.pdf. Last visited 14/12/18.

Seifert, Lester W. J. “The Word Geography of Pennsylvania German: Extent and Causes.” A Word Atlas of Pennsylvania German. Eds. Mark L. Louden, Howard Martin and Joseph C. Salmons. University of Wisconsin-Madison: Max Kade Institute, 2001. 81-102.

Springmann, Uwe. “A High Accuracy OCR Method to Convert Early Printings into Digital Text.” 2015. http://cistern.cis.lmu.de/ocrocis/tutorial.pdf

Stine, Eugene S. Pennsylvania German Dictionary: Pennsylvania German-English, English-Pennsylvania German. Kutztown: Pennsylvania German Society, 1996.

Stolberg, Doris. Changes between the Lines. Diachronic Contact Phenomena in Written Pennsylvania German. Berlin: Mouton de Gruyter, 2015.

Van Pottelberge, Jeroen. Der am-Progressiv. Struktur und Parallele Entwicklung in den KontinentalWestgermanischen Sprachen. Tübingen: Narr, 2004.

Villa, Daniel J. and Susana V. Rivera-Mills. “An Integrated Multi-generational Model for Language Maintenance and Shift. The Case of Spanish in the Southwest”. Spanish in Context 6:1 (2009): 26-42.

Werner, Michael. Lexikalische Sprachkontaktphänomene in Schriftlichen Texten des Pennsylvaniadeutschen. Ph.D. dissertation. University of Mannheim. 1996.

Zacharias, Peter. Pennsylvania Dutch Dictionary. (Online dictionary based on Lambert 1924).
https://www.padutchdictionary.com. Last visited 14/12/18.

The Page-Building a Pennsylvania German Thesaurus through the Correction of OCR Errors

Authors

DOI:

Keywords:

Abstract

References

Downloads

Published

Issue

Section

License

Logo DOAJ

Language

Information