Language Corpora

Welcome to Corpus Resources at the Department of Linguistics!

The Department maintains over 150 searchable language corpora, totaling around 8 billion words of annotated text in a variety of languages, as diverse as Coptic, Hebrew, Japanese and Mandarin Chinese. We are also members of the Linguistic Data Consortium (https://www.ldc.upenn.edu/ ), meaning students and faculty have access to a broad range of datasets for free on request. 

We offer access to searchable corpora using our online interfaces – some datasets are openly available, some require GU logins due to licensing restrictions, and some have additional special restrictions. To browse our already indexed corpora check out these interfaces:

A corpus search example

For assistance with corpus resources, please contact the department’s Corpus Resources Coordinator at corpora@georgetown.edu .