Digitizing Armenian Linguistic Heritage - DALiH

Summary

The project aims to build for the first time a unified, open and open-source digital linguistic platform for all varieties of Armenian. Each language variety will be represented by a comprehensive textual database, accompanied by full morphological annotation.

Dates and duration

04/2021 - 09/2026 (54 months)

Scientific coordination and team

Victoria KHURSHUDYAN (SeDyL, Inalco)
Anaïd DONABEDIAN (SeDyL, Inalco)
Nadi TOMEH, LIPN (Université Sorbonne Paris Nord)
Thierry CHARNOIS (LIPN, Université Sorbonne Paris Nord)
Damien NOUVEL (ERTIM, Inalco)
Ilaine WANG (ERTIM, Inalco)
Hovhannes KIZOGHYAN (Digilib, American University of Armenia)
Vladimir PLUNGIAN (Russian Academy of Sciences)
Petr KOCHAROV (Julius-Maximilians-Universität Würzburg)

Partners

Institut National des Langues et Civilisations Orientales (INALCO)
Structure et Dynamique des Langues (SeDyL, CNRS, IRD, INALCO)
Equipe de recherche texte, informatique, multilinguisme (ERTIM, INALCO)
Laboratoire d'Informatique de Paris-Nord (LIPN, CNRS, Université Sorbonne Paris Nord)
Digital Library of Classical Armenian Literature (Digilib, American University of Armenia)
Russian Language Institute, Russian Academy of Sciences (RAS)
Laboratoire d'excellence "Fondements Empiriques de la Linguistique" (Labex EFL)

Objectives

The DALiH project aims to create a unified open-access digital platform for all varieties of the Armenian language: Modern Classical, Middle, Eastern and Western Armenian, as well as three dialects. The aim is to document and digitize this linguistic heritage, to develop morphologically annotated corpora and various automatic processing models (APMs) for these poorly endowed varieties, and to make these resources available to the scientific community and the general public.

Methodology

The project combines linguistic and computational approaches: collection and OCRization of written texts, field surveys and transcription of oral data, morphological annotation using hybrid models (rules, recurrent neural networks, transformers). Automatic speech recognition (ASR) tools will be developed. An iterative methodology combining automatic annotation and manual correction via collaborative platforms guarantees the quality of the data produced.

Expected results

DALiH will produce methodological advances for NLP applied to sparsely endowed languages with high variation. It will provide a better understanding of Armenian variation, particularly endangered varieties such as Western Armenian and dialects. The novel resources produced (annotated corpora, grammatical dictionaries, ASR models) will have a scientific, pedagogical and societal impact, serving as a reference for other sparsely endowed languages with non-Latin characters.

Deliverables

Multivariational annotated corpora (~450m. tokens); grammatical dictionaries; annotation models and ASR; open-access web platform; downloadable datasets; scientific publications (ACL, LREC, COLING); workshops and international conference; teaching aids.

Key words

Armenian, corpus linguistics, automatic language processing, poorly endowed languages, morphological annotation, automatic speech recognition, linguistic variation, digital humanities

References

Khurshudyan et al. (2009; 2021)

Donabedian (2018; 2021)

Vidal et al. (2020; 2021)

Arkhangelskiy (2020)

Baevski et al. (2020)

Manjavacas et al. (2019)

EANC: www.eanc.net

Calfa: www.calfa.fr

Funding agency

Agence nationale de la recherche (ANR) - Appel à projets générique - AAPG 2021