Language Corpora

Welcome to Corpus Resources at the Department of Linguistics!

The Department maintains over 150 searchable language corpora, totaling around 8 billion words of annotated text in a variety of languages, as diverse as Coptic, Hebrew, Japanese and Mandarin Chinese. We are also members of the Linguistic Data Consortium (https://www.ldc.upenn.edu/ ), meaning students and faculty have access to a broad range of datasets for free on request.

We offer access to searchable corpora using our online interfaces – some datasets are openly available, some require GU logins due to licensing restrictions, and some have additional special restrictions. To browse our already indexed corpora check out these interfaces:

CQP Web Interface for part-of-speech tagged corpora: https://gucorpling.org/cqp/
ANNIS interface for richly annotated treebanks and other corpora: https://gucorpling.org/annis-corpora/
Multilingual parallel corpora: https://gucorpling.org/paravoz/

For assistance with corpus resources, please contact the department’s Corpus Resources Coordinator at corpora@georgetown.edu .