Georgetown University Multilayer (GUM) corpus

What is GUM?

GUM is the Georgetown University Multilayer corpus, a collection of richly annotated digital texts used for linguistic research and making Natural Language Processing applications. The corpus is built and expanded each year by Georgetown students as part of the course LING-4427, Computational Corpus Linguistics. The data we collect is chosen by students and currently includes interviews, news, travel guides, how-to guides, biographies, short stories, reddit forum discussions, academic writing, conversations, textbooks, political speeches, and YouTube CC Vlogs.

What kinds of analyses do we apply?

Every document in GUM is analyzed by the student responsible for it on very many levels: we annotate document structure in XML (paragraphs, headings, lists, images), split text into words, add part of speech tags, make a dependency syntax tree for each sentence, annotate every entity mentioned in the document for type (person, place, etc.), information status and coreferentiality (where else is it mentioned and how?), entity linking (links every entity in a text to its corresponding entity in Wikipedia), and finally we add a discourse parse of the document using Rhetorical Structure Theory.

The data, as well as more information about the analyses and how students can contribute, can be found on the GUM website.