Language Corpora
Welcome to Corpus Resources at the Department of Linguistics!
The Department maintains over 150 searchable language corpora, totaling around 8 billion words of annotated text in a variety of languages, as diverse as Coptic, Hebrew, Japanese and Mandarin Chinese. We are also members of the Linguistic Data Consortium (https://www.ldc.upenn.edu/ ), meaning students and faculty have access to a broad range of datasets for free on request.
We offer access to searchable corpora using our online interfaces – some datasets are openly available, some require GU logins due to licensing restrictions, and some have additional special restrictions. To browse our already indexed corpora check out these interfaces:
- CQP Web Interface for part-of-speech tagged corpora: https://gucorpling.org/cqp/
- ANNIS interface for richly annotated treebanks and other corpora: https://gucorpling.org/annis-corpora/
- Multilingual parallel corpora: https://gucorpling.org/paravoz/
For assistance with corpus resources, please contact the department’s Corpus Resources Coordinator at corpora@georgetown.edu .