ERTIM research axes

L'Équipe de Recherche Textes, Informatique, Multilinguisme (ERTIM) is a research team, Inalco's own unit, created in 2005, working mainly in Automatic Language Processing (NLP for Natural Language Processing).

ERTIM's scientific project is organized around the following themes:

Focus 1: Digital humanities

Resp.: Mathieu Valette

Research in the humanities, social sciences and literary disciplines is now strongly encouraged to use digital tools. ERTIM is particularly solicited in several social science fields (political science, history, art, etc.) to support the use of NLP and corpus linguistics methods in research. The aim of this axis is to develop methods and algorithms in various fields of digital humanities.

Axis 2: Language diversity

Presidents: Pierre Magistry and Ilaine Wang

In line with Inalco's mission to preserve the world's languages, ERTIM is interested in the issue of language diversity in automatic processing techniques. The focus is on "poorly endowed" languages and heritage languages. Two angles of approach are envisaged: on the one hand, to study what these languages pose as specific questions for NLP and for the necessary adaptation of methods (limited corpora and other resources, effects of transfer learning, lack of standardization...); on the other hand, to show what NLP can contribute to the description and preservation of these languages. A language without automatic processing tools or poorly supported by IT systems is all the more endangered today.

Phase 3: NLP Methodology

Resp.: Damien Nouvel

The scientific work carried out at ERTIM mainly calls on NLP methods, which require expertise in the languages and corpora processed to better design, evaluate and exploit the tasks carried out. Technological advances in automatic data processing are undeniable: in recent years, the development and use of AI methods based on neural networks to build language models have become a central issue in NLP, both in industry and in the scientific community. The unit will work on all aspects of these technologies to produce studies, design algorithms or resources.

Axis 4: Linguistic information acquisition

Resp.: Kata Gabor

The work in this axis will fall within the field of computational linguistics. The unit will seek to identify, model and extract information relevant to linguistic competence from performance data, which may come from corpora or, more recently, from representations created by corpus-based language models. In this context, the issue of the distinction between generalization (statistical, linguistic) and lexical memorization is of particular interest. ERTIM will seek to implement methodologies that enable linguistic patterns to be distilled from data, and to verify their validity, with a critical eye on purely experimental approaches (probes, for example) and particular attention to formulating linguistic hypotheses that facilitate the interpretation of experiments.