[University home]

School of Languages, Linguistics and Cultures

ARCHER: A Representative Corpus of Historical English Registers

ARCHER 3.2

The current phase of the project, ARCHER 3.2, is intended to enhance the usefulness of the corpus in a number of ways over the period 2008-11 and continuing 2011-13. David Denison and Nuria Yáñez-Bouza are coordinating the work being carried out in the international consortium and will report on progress on this site. We had a British Academy Small Research Grant from 1 June 2010 to 31 December 2011 (extended to 1 June 2012) to help fund some of the work at Manchester. For a graphic summary of planned improvements, see Nuria Yáñez-Bouza's poster (based on that presented at ICAME 32 and now updated to May 2012).

New texts

Consortium members are contributing new texts to help fill the gaps in coverage in national varieties, genres and periods.

Manfred Krug, now at the University of Bamberg, and Anna Rosen have sent materials from American journals and letters for the period 1800-1849. Bamberg have compiled some further American journals for the periods 1750-1999.

Anne Curzan at the University of Michigan, together with former colleague Chris Palmer and the late Richard W. Bailey, have sent in American science texts 1750-1999.

Marianne Hundt and Pius Meyer at the University of Zürich have sent in British and American sermons (some of them based on books in the John Rylands Library in Manchester). Zürich have rescued a number of British files in advertising and American files in news which had been compiled during the ARCHER-2 phase but never included in the corpus.

Arja Nurmi and Matti Rissanen at the University of Helsinki and Merja Kytö at Uppsala University have identified editions for letters to augment the existing texts. These have been transcribed under the supervision of Christian Mair at Freiburg.

The Research Unit on Variation, Linguistic Change and Grammaticalization at the University of Santiago de Compostela have sent in British legal texts for all periods from 1600 to 1999.

It has been decided to split Journals and Diaries into two separate genres. Nuria Yáñez-Bouza at Manchester has taken charge of assigning existing texts to the appropriate genre and has compiled more materials with the help of assistants at Manchester and Zurich, in addition to the contribution by the Bamberg team.

Manfred Krug's team has collected, typed and proof-read prose and drama texts from 1600-1649. Those were not included in ARCHER 3.1, mainly because of the many spelling issues and because genre labels are difficult for these early periods. From versions 1 and 2 we have restored American legal texts 1750-1999; American advertising 1750-1999; American drama 1800-49, 1900-49; American fiction 1800-49, 1900-49.

Text categories

Tagging

The principal UK contribution is to add morphological tagging to the entire corpus, both existing texts and new additions. Paul Rayson at Lancaster and Nick Smith at Salford will carry out the automatic phase of the tagging. The tagset will be broadly compatible with that used in the British National Corpus (BNC). The corpus, both tagged and untagged, will be in XML form; Marianne Hundt and Gerold Schneider at Zurich are supervising initial conversion to XML. ARCHER is essentially an original-spelling corpus, albeit mostly based on editions. A search facility that allows use of modern-spelling equivalents may be added, though not in version 3.2.

Correction

There has been a great deal of work, continuing over summer 2012, to improve the accuracy of the corpus text, the consistency of file structure and mark-up, and the bibliographic documentation.

 

Page last updated 16 May 2012.