Language Corpora
Welcome to Corpus Resources at the Department of Linguistics!
The Department maintains over 150 searchable language corpora, totaling around 8 billion words of annotated text in a variety of languages, as diverse as Coptic, Hebrew, Japanese and Mandarin Chinese. We are also members of the Linguistic Data Consortium ( ), meaning students and faculty have access to a broad range of datasets for free on request.
We offer access to searchable corpora using our online interfaces – some datasets are openly available, some require GU logins due to licensing restrictions, and some have additional special restrictions. To browse our already indexed corpora check out these interfaces:
- CQP Web Interface for part-of-speech tagged corpora:
- ANNIS interface for richly annotated treebanks and other corpora:
- Multilingual parallel corpora:

A corpus search example
For assistance with corpus resources, please contact the department’s Corpus Resources Coordinator at .