PIs: Yulia Tsvetkov (CMU), Nathan Schneider (Georgetown), Noah Smith (UW), Shuly Wintner (Haifa)
Most people in the world today are multilingual. Though multilingualism is a gradual phenomenon, previous research has primarily examined text from second language learners who have not yet achieved fluency. This project focuses on text produced by nonnative but highly fluent speakers. Fluent but nonnative language differs subtly from native, monolingual language in the frequencies of certain concepts, constructions, and collocations. This raises the possibility that language technologies – typically trained on "standard" native language – are systematically biased in ways that render them less useful for the majority of users. This project develops methods to examine large datasets of fluent nonnative language to detect the subtle influences of the native language and delivers natural language processing (NLP) tools for these language varieties. Its methods will be applicable beyond the populations in this study, including NLP-based measurement for social science and research seeking to better understand cognition in the bilingual mind. Native language identification will enable potential applications in language learning, cybersecurity, geolocation, personalization, and more. The project will openly share implementations and data, and includes educational activities that bring research into education.
This project advances natural language processing techniques to shed light on the differences in language use by fluent speakers with varying linguistic backgrounds: native speakers, highly fluent nonnative speakers, and translators when translating from another language into English. It is known that classifiers can be trained to discriminate with high accuracy among these populations, even though humans have difficulty telling them apart. This project focuses on semantic phenomena, which can confound even fluent nonnative speakers. If current NLP models are biased toward native language, then they may not support accurate measurement in nonnative text; the project develops new techniques to mitigate this bias. This project delivers a range of new models for native language identification, new measurement models and multi-variety models for language-variety-aware NLP tools, new semantic annotations in several Englishes, and a study on nonnative annotation. These novel methods for studying variation within a language and building such variation into our NLP systems will lead to unprecedented flexibility in computational models of natural language semantics.