Digitizing Armenian Linguistic Heritage - DALiH
Summary
The project aims to build for the first time a unified, open and open-source digital linguistic platform for all varieties of Armenian. Each language variety will be represented by a comprehensive textual database, accompanied by full morphological annotation.
Dates and duration
04/2021 - 09/2026 (54 months)
Scientific coordination and team
Victoria KHURSHUDYAN (SeDyL, Inalco)
Anaïd DONABEDIAN (SeDyL, Inalco)
Nadi TOMEH, LIPN (Université Sorbonne Paris Nord)
Thierry CHARNOIS (LIPN, Université Sorbonne Paris Nord)
Damien NOUVEL (ERTIM, Inalco)
Ilaine WANG (ERTIM, Inalco)
Hovhannes KIZOGHYAN (Digilib, American University of Armenia)
Vladimir PLUNGIAN (Russian Academy of Sciences)
Petr KOCHAROV (Julius-Maximilians-Universität Würzburg)
Partners
- Institut National des Langues et Civilisations Orientales (INALCO)
- Structure et Dynamique des Langues (SeDyL, CNRS, IRD, INALCO)
- Equipe de recherche texte, informatique, multilinguisme (ERTIM, INALCO)
- Laboratoire d'Informatique de Paris-Nord (LIPN, CNRS, Université Sorbonne Paris Nord)
- Digital Library of Classical Armenian Literature (Digilib, American University of Armenia)
- Russian Language Institute, Russian Academy of Sciences (RAS)
- Laboratoire d'excellence "Fondements Empiriques de la Linguistique" (Labex EFL)
Objectives
The DALiH project aims to create a unified open-access digital platform for all varieties of the Armenian language: Modern Classical, Middle, Eastern and Western Armenian, as well as three dialects. The aim is to document and digitize this linguistic heritage, to develop morphologically annotated corpora and various automatic processing models (APMs) for these poorly endowed varieties, and to make these resources available to the scientific community and the general public.
Methodology
The project combines linguistic and computational approaches: collection and OCRization of written texts, field surveys and transcription of oral data, morphological annotation using hybrid models (rules, recurrent neural networks, transformers). Automatic speech recognition (ASR) tools will be developed. An iterative methodology combining automatic annotation and manual correction via collaborative platforms guarantees the quality of the data produced.
Expected results
DALiH will produce methodological advances for NLP applied to sparsely endowed languages with high variation. It will provide a better understanding of Armenian variation, particularly endangered varieties such as Western Armenian and dialects. The novel resources produced (annotated corpora, grammatical dictionaries, ASR models) will have a scientific, pedagogical and societal impact, serving as a reference for other sparsely endowed languages with non-Latin characters.
Deliverables
Multivariational annotated corpora (~450m. tokens); grammatical dictionaries; annotation models and ASR; open-access web platform; downloadable datasets; scientific publications (ACL, LREC, COLING); workshops and international conference; teaching aids.
Key words
Armenian, corpus linguistics, automatic language processing, poorly endowed languages, morphological annotation, automatic speech recognition, linguistic variation, digital humanities
References
Khurshudyan et al. (2009; 2021)
Donabedian (2018; 2021)
Vidal et al. (2020; 2021)
Arkhangelskiy (2020)
Baevski et al. (2020)
Manjavacas et al. (2019)
EANC: www.eanc.net
Calfa: www.calfa.fr
Funding agency
Agence nationale de la recherche (ANR) - Appel à projets générique - AAPG 2021