The pilot GerManC project: newspaper texts
The initial stage of the project ("GerManC") consisted of a twelve month pilot study which aimed on the one hand to collect a sample corpus of 100,000 words selected from newspapers of the period and on the other to develop analytical tools to facilitate linguistic investigation. It took 2000 word samples from five regions (North German, West Central German, East Central German, West Upper German, East Upper German) within three periods of fifty years (1650-1700, 1701-1750 and 1751-1800). Three samples were taken from each period for each region, giving a total size for the project of 100,000 words. This pilot was successfully completed in April 2007, and the GerManC corpus of newspaper texts is available through the Oxford Text Archive, and it is also accessible on this site (see below, under "Corpus").
The texts for the GerManC corpus were digitized by Astrid Ensslin, Alan Scott (a recent postgraduate student now working as a post-doctoral researcher on the Germanic possessive -s project in Manchester) and Martin Durrell, using the double-keying technique - each text was keyed in by two individuals and the results compared electronically to eliminate errors. Astrid Ensslin annotated them in a way compatible with the English corpora and other similar projects. In particular they were fully marked up in accordance with the TEI Lite (Text Encoding Initiative) guidelines, using XML (Extensible Markup Language). Paul Bennett and Astrid Ensslin developed a set of analytical tools, notably to tag the corpus for part of speech, and progress was made in developing software to identify some of the grammatical features of each word, and for lemmatization, i.e. associating each word with its lexeme and so compiling a dictionary of all the words in the corpus. The resulting corpus is already an invaluable resource for the study of the development of German, including in comparison with other languages, and the completed GerManC Plus corpus will ultimately form part of a chain of historical corpora of German from the earliest times to the present day. To this end the team has worked closely with the Deutsch Diachron Digital project for a large-scale historical corpus of German initiated at the Humboldt University in Berlin and, through collaboration with the Institute for German Language (IDS) in Mannheim, with the German-based TextGrid project for the development of an integrated electronic resource grid to support research in the humanities.