Metalinguistic Natural Language Understanding

Principal Investigator: Nathan Schneider

How people use language is a topic of study in a variety of fields, including linguistics, literature, law, and language education. Extensive descriptions about many languages already exist in natural language itself (e.g., grammar textbooks, writing advice, and linguistics articles). Descriptions in (for example) English may include example sentences supplemented with technical terminology and formal notation. This project will develop natural language processing (NLP) algorithms to process and mine these textual resources so that language analysts can better detect and synthesize patterns of interest. First, algorithms will be developed to recognize where a piece of text is commenting on the meaning of a word, or giving an example of how it could be used. Second, algorithms for enriching text with technical descriptions will be improved to report better estimates of their own strengths and weaknesses. Finally, capabilities for retrieving specific uses of an ambiguous word in a large text collection will be developed to aid analysts. The algorithmic contributions in this project are expected to have direct application to technologies in various fields where close analysis of language in text collections is crucial, including law and linguistics. On a wider scale, these capabilities have the potential to be transformative for artificial intelligence (AI), allowing humans and machines to teach each other explicitly about how language works, to deftly access scholarly work about language, and to give and interpret language advice (e.g., writing assistance).

This project develops algorithms and tasks with an eye toward technologies that would enable humans to more efficiently and accurately conduct metalinguistic inquiries about text. Key challenges to be addressed are: (1) Detecting textual metalanguage: This project will formulate tasks and algorithms to recognize metalinguistic descriptions (such as the use/mention distinction, definitions, linguistic examples) in text, focusing on three genres where they are abundant: law, language discussion forums, and linguistics. A new benchmark dataset and shared task will be developed to compare metalinguistic taggers. (2) Improving model confidence calibration, focusing on taggers with long-tail tagsets. Better probability estimates will enable analysts to make informed decisions about how to balance automatic and manual processing and can anticipate rates of different types of errors. (3) Query-by-example algorithms will be developed for retrieving specific usages of an ambiguous word or phrase from a large text collection. Tools leveraging such algorithms would open the way to new kinds of corpus-based investigations by linguists, lexicographers, language teachers, and literary scholars.

This project is supported by an NSF CAREER grant. Learn more about the project and funding on the NSF website.