Modeling Crosslinguistic Influences Between Language Varieties

Project leads: Nathan Schneider (Georgetown University), Yulia Tsvetkov (Carnegie Mellon University), Noah Smith (University of Washington), Shuly Wintner (University of Haifa)

Most people in the world today are multilingual. Though multilingualism is a gradual phenomenon, previous research has primarily examined text from second language learners who have not yet achieved fluency. This project focuses on text produced by nonnative but highly fluent speakers. Fluent but nonnative language differs subtly from native, monolingual language in the frequencies of certain concepts, constructions, and collocations. This raises the possibility that language technologies — typically trained on “standard” native language — are systematically biased in ways that render them less useful for the majority of users. This project will develop methods to examine large datasets of fluent nonnative language to detect the subtle influences of the native language and deliver natural language processing (NLP) tools for these language varieties. Its methods will be applicable beyond the populations in this study, including NLP-based measurement for social science and research seeking to better understand cognition in the bilingual mind. Native language identification will enable potential applications in language learning, cybersecurity, geolocation, personalization, and more. The project will openly share implementations and data, and will include educational activities that bring research into education.

This project will advance natural language processing techniques to shed light on the differences in language use by fluent speakers with varying linguistic backgrounds: native speakers, highly fluent nonnative speakers, and translators when translating from another language into English. It is known that classifiers can be trained to discriminate with high accuracy among these populations, even though humans have difficulty telling them apart. This project will focus on semantic phenomena, which can confound even fluent nonnative speakers. If current NLP models are biased toward native language, then they may not support accurate measurement in nonnative text; the project will develop new techniques to mitigate this bias. This project will deliver a range of new models for native language identification, new measurement models and multi-variety models for language-variety-aware NLP tools, new semantic annotations in several Englishes, and a study on nonnative annotation. These novel methods for studying variation within a language and building such variation into our NLP systems will lead to unprecedented flexibility in computational models of natural language semantics.

This project is supported by an NSF-BSF grant.