Sinitic Languages Diversity and Digital Humanities - DLS-HN
Abstract
The development of digital humanities (DH) practices and the growing availability of corpora are opening up new areas of application and challenges for automatic language processing (ALP). The NLP of Sinitic languages is no exception, although it does present some specific problems. We propose a project that aims to describe and address some of these challenges by approaching the issue from the angle of variation.
We will distinguish three axes of variation: temporal (diachronic), geographical (dialectal/diatopic) and grapholinguistic (language-writing relationship). In this way, we wish to question the formal representations (data normalization and vectorization) and corpus choices at the basis of any processing of Sinitic languages.
We will study several situations of variation and different applications of NLP to HN and for heritage languages.
Our contribution will be twofold. On the one hand, it will concern the evaluation and design of NLP methods on data located at different positions along these axes, and on the other, the dissemination of these methods and their applications. We will work on both written and oral data.
The temporal axis will be explored mainly through the corpus of the Shun-Pao, the first daily newspaper printed in Sinogram between 1872 and 1949. This corpus enables us to address both linguistic and historical issues, and will be worked on in collaboration with historians involved in the ENP-China project.
The geographical axis will be studied through the cases of Taiwanese Hokkien and Teochew (with a focus on the variant spoken in France). These are two languages from the same family, relatively close to each other and distant from Mandarin. They are, however, in quite different sociolinguistic situations, and will enable us to explore transfer methods in NLP. This part will be done in collaboration with Taiwanese colleagues and Wikimedia France to facilitate a return to the speakers.
Dates and duration
12/2023 - 05/2027 (48 months)
Scientific coordination
Pierre MAGISTRY (ERTIM, Inalco)
Objectives
To study the limits of methods and tools for the automatic processing of poorly endowed, little or non-standardized Sinitic languages, particularly studying the cases of Teochew, Taiwanese (Taigi) and various stages of Classical Chinese.
Methodology
- Collection and analysis of oral and written corpora
- Training and evaluation of speech (synthesis, recognition) and text processing models (language models, segmentation, entity recognition, virtual keyboards)
Expected results
- Language tooling in language technology
- Dissemination of models
- Evaluation of existing models
- Valorization of minority languages (notably Teochew from France)
Key words
Sinitic languages (minnan, classical Chinese), automatic language processing (NLP), language technologies
References
HAL CV of the holder: https://cv.hal.science/pierre-magistry (articles and posters)
Donor
National Research Agency (ANR) - Generic call for projects - AAPG 2023