Computer resources for linguists
Data sources for languages other than English
Note
Experimental section of site, which was previously largely confined to English-language resources. Suggestions for additions and updates welcome.
Corpora
Danish
- Ordnet is the home of the following corpus, plus two dictionaries listed in the next section:
- KorpusDK
A corpus of modern written Danish (56 million tokens).
German
- Schweizer Text Korpus
From announcement on Corpora-List 20 August 2008: "The project aims at constructing a digital corpus of 20th century standard German in Switzerland. It will contain a wide variety of texts of different genres and on different topics. The corpus is going to be an integral part of a distributed corpus developed in cooperation with Austrian, German and South Tyrolian partner projects, on the one hand. On the other hand, it will serve specifically Swiss research interests in lexicography and other linguistic fields."
Romani
- Romani Morpho-Syntax Database
The database is interactive and has a mapping function. The Romani Project is based here in the Department. Its home page says that it "provides information on the Romani language and on linguistic research on Romani".
Swedish
- Corpora at the Department of Linguistics, Göteborg University
This page provides links to a number of corpora (mainly Swedish) and information about others. Information about transcription methods, coding and software connected with the corpora is also provided here.
- Göteborg Spoken Language Corpus
The GSLC is a corpus of transcribed spoken Swedish taken from a variety of social activities (1.4 million tokens). From the description on the corpus's homepage: 'Based on the fact that spoken language varies considerably in different social activities with regard to pronunciation, vocabulary and grammar, the goal of the corpus is to include spoken language from as many social activities as possible.' Registration is necessary.
- Språkbanken (The Language Bank)
This site provides access to various written corpora of Swedish (contemporary and historical) as well as a number of dictionaries and databases. All can be found via the central Språkbanken site; a generic search interface for all the corpora can be accessed here. Two of the most useful corpora are given below, and three dictionaries in the next section:- the Parole corpus
Ca. 19 million tokens of written Swedish tagged for word class.
- the Parole corpus
- the Stockholm Umeå Corpus Version 2.0
SUC 2.0; ca. 300,000 tokens of written Swedish.
Miscellaneous
- LDC-Online
A large corpus at the Linguistic Data Consortium including archive of newstext in Arabic and Chinese, available via Electronic Resources service of JRUL under Databases, "L". A special username or password is needed for this resource; ask PB or DD or in the Library.
Dictionaries
Danish
- Den Danske Ordbog
A corpus-based dictionary of modern Danish (to appear during 2009).
- Ordbog over det danske Sprog
A dictionary of Danish from the period 1700-1950.
Swedish
Links
- OLAC Language Resources Catalogue. 'This catalog, developed by the Open Language Archives Community (OLAC), provides access to a wealth of information about thousands of languages, including details of text collections, audio recordings, dictionaries, and software, sourced from dozens of digital and traditional archives.'
- GerManC
The project, located in the School, aims "to compile a representative historical corpus of written German for the years 1650-1800".
- Discover Irish website, Raymond Hickey
This page last updated 3 Jun 2011.
