Computer resources for linguists
Data sources for historical English
Note
- Most historical texts held at Manchester are stored together on one fileserver. You can get to them very easily by using MonoConc Pro, or more long-windedly by navigating using Windows Explorer or My Computer to \\uk-ac-man-ss7\VOL3\shared\hum\LLC. The folder English_corpora contains subfolders with a large number of texts suitable for reading in any Windows program or processing with MonoConc, etc.
- The subfolders are subdivided in a rough-and-ready way by period: OE, ME, eModE, lModE, PDE.
- Manuals (many of them from the 1999 ICAME CD) are stored in the folder corpora-manuals under English_corpora. Click on the link to see the ICAME manuals. There is also a shortcut on the Linguistic Resources menu.
Corpora
- ICAME Corpus Collection on CD-ROM
Mostly copied to fileserver April 2003. This contains the following historical collections, samples of which can also be accessed on-line, together with various concordance programs. (See also PDE corpora.)
- Helsinki Corpus (OE, ME and eModE)
DD and NYB have printed copies of the manual, also available online. - Helsinki Corpus of Older Scots
- Corpus of Early English Correspondence sampler (1418-1680)
- Zurich English Newspapers (1671-1791)
- The Lampeter Corpus of Early Modern English Tracts (1641-1732)
- Complete Toronto Corpus of Old English (OE)
A convenient local version, from ca 1994, is available through MonoConc Pro or other concordancers in the OE folder of English_corpora; it is not quite complete or up-to-date, however. We are subscribed to the current online version, available on campus or using the VPN, from JRUL databases under 'Dictionary of Old English Corpus'. For a handy, hypertext index of texts and editions, click here; an older, downloadable index at M.S.H.S. (Poitiers) categorises texts by date, dialect and whether verse or prose.
- The Penn-York-Helsinki series of parsed historical corpora
These corpora all use the CorpusSearch 2 software, a command-line search engine with a very specific syntax. Once mastered, however, it enables extremely well-focused grammatical searches to be made. There are two shortcuts under Linguistic Resources: one opens the Help pages in your browser, the other opens a window in which you type the commands to run the program CorpusSearch 2, specifying the query, the input files and the output file. Our implementation is PC-based, but CorpusSearch 2 is a Java program which can run on Macs too. The corpora are stored in the most appropriate folder of English_corpora.
- The York-Toronto-Helsinki Parsed Corpus of Old English Prose
100,000 words of OE prose texts selected from the excerpts in the Helsinki Corpus. Acquired December 2003.
- The York-Helsinki Parsed Corpus of Old English Poetry
71,000 words of OE poetic texts selected from the excerpts in the Helsinki Corpus. Acquired August 2009.
- PPCME2: Penn-Helsinki Parsed Corpus of Middle English (2nd edition)
1.3 million words. Licence fee paid June 2002. Installed Feb 2003.
- PPCEME: Penn-Helsinki Parsed Corpus of Early Modern English
1.8 million words. Licence fee paid December 2005, acquired March 2011.
- PPCMBE: Penn-Helsinki Parsed Corpus of Modern British English
950,000 words, 1700-1914. Acquired March 2011.
- PCEEC: The Parsed Corpus of Early English Correspondence
2.2 million words (c. 1410-1681). Acquired August 2009.
- Glossarial Database of Middle English (Chaucer and Gower)
DD never managed to get an ancient standalone version to work, but the online version may work better.
- MEG-C: The Middle English Grammar Corpus (Stavanger and Glasgow)
'The Middle English Grammar Corpus (MEG-C) consists of samples of Middle English texts, transcribed from manuscript or facsimile reproduction. Shorter texts are included in their entirety, and longer ones in 3000-word samples. In the first instance, we include texts localised in the Linguistic Atlas of Late Mediaeval English, from the period 1350-1500. [...] However, the Corpus will eventually also cover earlier texts, as well as texts showing non-regional varieties of Middle English.'
- Corpus of Middle English Prose and Verse (Michigan)
146 texts, packaged with searchable bibliographic information and software that permits simple, proximity and Boolean searches of the texts. Available online from JRUL E-Resources under 'Middle English Compendium'.
- Middle English Medical Texts (MEMT) (John Benjamins, based on work done in Helsinki)
This is "an electronic corpus including 86 texts and 495,322 words from three traditions of medical writing (surgical treatises, specialized texts, and remedy books) from 1375 to 1500, and an appendix of recipes from c. 1330" (from JB website). Available only in the Main Library as a standalone CD, details under JRUL E-Resources.
- ICAMET: Innsbruck Computer Archive of Machine-Readable English Texts
We have the sampler version of the Prose Corpus 1100-1500, 108 works (in 131 files, about 4 million words) and the Letter Corpus 1386-1688 (containing 254 complete letters from different sources, arranged diachronically). DD is authorised to supply the texts and manual (in pdf form) to users from the Department on receipt of a signed copy of the 'Declaration of fair academic use' for sending to Professor Manfred Markus, who has kindly provided the CDs.
- A Corpus of Irish English
From Raymond Hickey's Irish English Resource Centre (under Surveys/data): 'The corpus gathers together the main documents for the English language in Ireland throughout its history. These begin in the early 14th century and continue up to the present-day. There are various genres represented in the corpus, reflecting the diversity of text types to be found in the history of Irish English: poetry, glossaries, sketches and full-length plays.' Available in English_corpora\lModE from 25xi05.
- ARCHER: A Representative Corpus of Historical English Registers (1650-1990), version 3.1
1.8 million words of British and American English in a number of genres. Available from David Denison for personal use only on signing of user agreement. For copyright reasons, the corpus can only be used at Manchester, Salford or Lancaster in the UK or at specific universities in the USA, Germany, Sweden, Spain and Finland. There is a tagged version of some files. NYB and DD are actively involved in correcting, documenting and tagging ARCHER 3.1 and in adding new data for ARCHER 3.2.
- English language of the north-west in the late Modern English period (1761-90)
300,000 words of plain, local letters. Available either as prettily formatted HTML or as a text file. Send access request form to DD.
- Corpus of late Modern English Prose (1861-1919)
100,000 words of informal educated prose. Available from DD.
- CLMET: The Corpus of Late Modern English Texts (1710-1920)
10 million words gathered from Project Gutenberg and the Oxford Text Archive by Hendrik De Smet (Leuven). In English_corpora\lModE from 11i05.
- COHA: The Corpus of Historical American English (early 1800s to 2000s)
400 million words. Uses same interface as COCA.
- Nineteenth Century U.S Newspapers Digital Archive
Available online via JRUL Electronic Resources.
- The Google News Archive search
A very useful site for recent and older newspaper text, allowing you to search for any word, phrase or combination, and within specified date limits.
- Time Magazine, 1923-2006 (100+ million words)
Can provide large source of very recent data. Uses same interface as COCA.
- The Salamanca Corpus
(English dialects in literature 1500-1950)
'Consisting of documents representative of literary dialects and dialect literature, the Salamanca Corpus has been conceived as an electronic repository of diachronic dialect material which might bridge some of the gaps still existing in the field. It aims to cover a time span of no less than four centuries (c.1500-c.1950), thereby presenting documents in which dialect traits from pre-1974 English counties are documented.'
Literary texts and collections
We have at least the following in-house (stored in the appropriate folder in English_corpora) and can get others for you from the Oxford Text Archive and elsewhere:
- Letters of Jane Austen
- Austen novels
- Milton texts
- Paston Letters of the 15th Century
- Lollard Sermons
- Chaucer's Boece and Treatise on the Astrolabe
- Layamon's Brut
- etc.
There are hundreds of texts - of varying degrees of reliability and legitimacy - available over the internet. Try one of these sites (EEBO is British-based, Literature Online Anglo-American, the rest American-based):
- Middle English Compendium
Access to Corpus of ME Prose and Verse, ME Dictionary and bibliography - available via JRUL E-Resources. - Early English Books Online (EEBO)
Available via JRUL E-Resources, containing 'most of the books printed in the English Language between 1453 and 1700 in full-text'. - Eighteenth Century Collections Online (ECCO)
'Every significant English-language and foreign-language title printed in the United Kingdom and beyond in the period 1700-1800, and with multiple full-text search options across all 33 million pages', available via JRUL E-Resources. - JISC Historic Books allows semantic search of EEBO, ECCO and 19th-century British Library books from a single interface.
- Literature Online (LION)
Available via JRUL E-Resources, containing such Chadwyck-Healey databases as the Bible in English, Early English Prose Fiction, English Drama, English Poetry (searchable by word or phrase). - Bibliomania
- Great Books Online (searchable by word or phrase)
- the Gutenberg Project
- Literary Resources on the Net (Jack Lynch)
- Voice of the Shuttle
Other resources
- Rylands Medieval Collection
'The Special Collections Division of the John Rylands University Library [...] holds outstanding collections of rare books, manuscripts and archives from the Middle Ages. Our Medieval Collection contains complete works of paramount importance in a variety of key subject areas, including History, Theology, Art, Literature, Language, and the History of Science and Medicine.
The basis of the collection is a project, started in October 2008, to digitise and describe our collection of over forty Middle English manuscripts. The collection illustrates progress so far, including a medieval cookery book.'
The images are of very high resolution indeed. Undergraduate and postgraduate students may be able to use them as the basis of a dissertation project under the supervision of Nuria Yáñez-Bouza or David Denison. - Parker Library on the web
'Corpus Christi College and the Stanford University Libraries welcome you to Parker on the Web - an interactive, web-based workspace designed to support use and study of the manuscripts in the historic Parker Library at Corpus Christi College, Cambridge.' The extraordinarily rich medieval manuscript collection at CCCC.
This page last updated 8 Apr 2012.
