PRC project "DALiH - Digitizing Armenian Linguistic Heritage" wins ANR's AAPG 2021 award

1 February 2022
  • SeDyL

  • Search

The collaborative research project (CRP) "DALiH - Digitizing Armenian Linguistic Heritage" (Numérisation du patrimoine linguistique arménien : Corpus multivariationnel d'arménien et traitement des données), led by Victoria Khurshudyan (Inalco, SeDyL, CNRS, IRD), is the winner of the Call for Generic Projects (AAPG) 2021 of the French National Research Agency (ANR).
Digitizing Armenian Linguistic Héritage - DALiH - logo
Digitizing Armenian Linguistic Héritage - DALiH - logo © DALiH‎
Contenu central

The 42-month DALiH project is part of ANR research axis CE38 - Digital revolution: relationships to knowledge and culture.

Victoria Khurshudyan is a lecturer in Armenian linguistics and director of the Eurasia department at Inalco, a member of UMR SeDyL (Structure and Dynamics of Languages) since 2012.

A graduate of the Brussov State Linguistic University in Yerevan, she defended her doctoral thesis in language science at the Institute of Linguistics, Russian State Humanities University (РГГУ), Moscow, Russia. Her areas of research include Armenian linguistic variation, linguistic typology, as well as computational linguistics from an automatic language processing perspective. From 2006 to 2009, she coordinated the National Corpus of Eastern Armenian (EANC) project at the Institute of Language, Russian Academy of Sciences.

Digitization of Armenian Linguistic Heritage (DALiH): Armenian multivariational corpus and data processing

Collaborative Research Project (CRP) funded by the French National Research Agency ANR-21-CE38-0006.

The project Digitization of the Armenian Linguistic Heritage (DALiH): Armenian multivariational corpus and data processing aims to build for the first time a unified free-access and open-source digital linguistic platform covering the entire variational spectrum variants of the Armenian language, with annotated corpora for:

1) Classical Armenian;
2) Modern Western Armenian;
3) a pilot corpus of Middle Armenian;
4) three pilot corpora of dialects, and
5) an updated corpus of Modern Eastern Armenian based on the EANC (Eastern Armenian National Corpus).

Research will be carried out from a linguistic and Automatic Language Processing (ALP) perspective to provide comprehensive grammatical annotation as well as Automatic Speech Recognition (ASR) models for the target Armenian varieties. Several new machine learning and rule-based system approaches will be developed to process the written and spoken databases and test their validity for further corpus expansion, in a context of multi-parameter linguistic variation for an under-endowed language.

Research in computational linguistics, aimed in particular at automatic language identification, computation of distance between varieties, lexical and morphological disambiguation, will be conducted with a view to revisiting the state of the art and introducing new research issues supported by the written and oral data made available by the project.

Partners:
Institut national des langues et civilisations orientales (Inalco)
Structure et Dynamique des Langues (SeDyL, CNRS, IRD, Inalco)
Text, Informatics, Multilingualism Research Team (ERTIM, Inalco)
Laboratoire d'Informatique de Paris-Nord (LIPN, CNRS, Université Sorbonne Paris Nord)
Digital Library of Classical Armenian Literature (Digilib, American University of Armenia)
Russian Language Institute, Russian Academy of Sciences (RAS)
Laboratoire d'excellence "Fondements Empiriques de la Linguistique" (Labex EFL)

Digitizing Armenian Linguistic Heritage (DALiH): Armenian Multivariational Corpus and Data Processing

Project funded by French National Research Agency ANR-21-CE38-0006.

The project Digitizing Armenian Linguistic Heritage (DALiH): Armenian Multivariational Corpus and Data Processing aims at building for the first time an open-access and open-source unified digital linguistic platform for the whole spectrum of Armenian language variation, more particularly annotated corpora for :

1) Classical Armenian;
2) Modern Western Armenian;
3) a pilot corpus of Middle Armenian;
4) three pilot corpora of dialects, and
5) one updated Modern Eastern Armenian corpus on the basis of the existing one.

Research will be conducted in Natural language processing (NLP) and linguistic perspectives in order to provide full grammatical annotation and Automatic speech recognition (ASR) models for the target Armenian varieties. Multi-approach deep-learning and rule-based resources will be designed in order to process the written and oral databases and to cross-check their value for further corpus enlargement, in a context of multiparameter language variation for an under-resourced language.

NLP-based linguistic researches, such as language identification and variety distance measuring, lexical and morphological disambiguation, will be carried out to revisit the existing research issues and to introduce new ones backed by the newly available processed written and oral data.

Partners:
Institut National des Langues et Civilisations Orientales (INALCO)
Structure et Dynamique des Langues (SeDyL, CNRS, IRD, INALCO)
Équipe de recherche texte, informatique, multilinguisme (ERTIM, INALCO)
Laboratoire d'Informatique de Paris-Nord (LIPN, CNRS, Université Sorbonne Paris Nord)
Digital Library of Classical Armenian Literature (Digilib, American University of Armenia)
Russian Language Institute, Russian Academy of Sciences (RAS)
Laboratoire d'excellence "Empirical Foundations of Linguistics" (Labex EFL)

Projet DALiH - logos des partenaires
Projet DALiH - logos des partenaires © DR‎

Projet DALiH - Visuel (2.41 MB, .pdf)