Cross-lingual information retrieval for scientific datasets in less-resourced languages - CLingS

Abstract

This project aims to develop a cross-lingual information retrieval system tailored to scientific literature in under-represented languages. It addresses critical challenges in language technologies, including the absence of annotated scientific datasets, the lack of language-specific models for less-resourced languages, and the dominance of English in scientific communication. The project will focus on constructing comparable corpora in targeted domains (linguistics, medicine, mathematics, geography, and jurisprudence), training language models adapted to scientific discourse, and aligning them into a shared multilingual embedding space. These models will power a retrieval system that enables structured information access across languages, supported by retrieval-augmented generation (graph-RAG) and multi-agent architectures.
Scientifically, the project introduces a novel approach to multilingual scientific computing by bridging language gaps at both the formal and conceptual levels. It investigates the structure of scientific discourse across typologically and sociolinguistically diverse languages, contributing to both NLP and intellectual history. The system will support linguistic self-representation in scientific contexts, enrich terminological resources, and provide tools for analyzing the diffusion and transformation of knowledge across cultures and time. Methodologies will ensure reproducibility and rigorous evaluation, laying the groundwork for inclusive, linguistically diverse knowledge systems.
Practically, the project will result in: (1) curated datasets in the targeted scientific domains for seven languages – Belarusian, Estonian, Punjabi, Slovak, Taiwanese (Tâigí), Ukrainian, Yiddish; (2) encoder and decoder language models fine-tuned for scientific texts in each selected language; (3) shared multilingual vector spaces for scientific information alignment; (4) an interactive AI-based platform for scientific information retrieval and generation; and (5) benchmarks and evaluation protocols suitable for multilingual scientific discourse.
The system developed will not only facilitate cross-lingual research but also support terminology development, translation practices, and multilingual education. It will provide open access tools to educators, researchers, and policy makers in communities historically excluded from global scientific discourse. Through documentation and open-source sharing, the project promotes replicability and adoption across other low-resource contexts.

Dates and duration

01/2026 - 12/2028 (36 months)

Scientific coordination and team

Partner Country, city Responsible
Inalco (coordinating institution) France, Paris Valentina FEDCHENKO
Institute of Informatics of the Slovak Academy of Sciences Slovakia, Bratislava Milan RUSKO
University of Constantine Philosopher (unfunded partner) Slovakia, Nitra Martin DIWEG-PUKANEC
National Academy for Educational Research Taiwan, Taipei Ka-I LIM

Objectives

To compile and annotate scientific corpora covering the following fields: linguistics/philology, medicine, mathematics, geography and jurisprudence.

Train and deploy a cross-lingual scientific information retrieval system enabling document- and passage-level searching.

Develop a terminology mapping and alignment engine capable of learning correspondences between scientific lexicons.

Conduct a user-centered evaluation involving domain experts and linguists.

Make all datasets, models and tools available under a permissive open source license.

Methodology

Creation and enrichment of specialized multilingual corpora.
Development of a cross-linguistic information retrieval system.
Terminology alignment and evaluation tools.
Reproducible model adaptation pipelines (tokenization, fine-tuning, evaluation).
Contrastive and dense search learning on scientific corpora.
Integration of terminological annotations for conceptual accuracy.
Validation loops with experts (human-in-the-loop).

Expected results

- Centralized cross-linguistic scientific research platform.
- Multi-agent architecture for distributed content processing.
- Multilingual query interfaces.
- APIs for integration with existing scientific databases.
- Transferable methods for adapting models to poorly endowed scientific domains.
- Evaluation protocols for multilingual scientific IR.

Deliverables

- Standardized scientific corpora in 7 poorly endowed languages.
- Language-specific scientific embedding models.
- Multilingual terminology graphs linking concepts.
- Reference datasets for multilingual scientific information retrieval evaluation.
- Scientific publications (articles).

Keywords

information retrieval, RAG, scientific literature, moderately endowed languages, terminological alignment

References

Fedchenko V. et al. (eds.). Elye Falkovitsh. Jiddisch. Phonetik, Graphemik, lexik und Grammatik. Düsseldorf: De Gruyter, 2024.

Yen-Chun Lai et al. Construction of Large Language Models for Taigi and Hakka Using Transfer Learning. 27th Conference of the Oriental COCOSDA, 2024.

Funding agency

Agence nationale de la recherche (ANR) & CHIST-ERA - ERA-NET call for projects "Science in your language" (SOL)

ANR - logo
CHIST-ERA - logo