Aquilign Multilingual Segmenter

Aquilign Multilingual Segmenter is a token-classification model for phrase-level segmentation of medieval and historical texts.

The model is designed to detect custom segmentation delimiters in multilingual historical corpora and is used as part of the Aquilign alignment workflow.

Model Description

The segmenter is based on a trainable BertForTokenClassification model from Hugging Face’s transformers library.

It was fine-tuned on historical prose from the Multilingual Segmentation Dataset to identify phrase-level segmentation boundaries.

Supported Languages

Latin
French
Castilian
Portuguese
Catalan
English
Italian

Intended Use

This model is intended for:

phrase-level segmentation of medieval texts
preprocessing parallel corpora before alignment
multilingual medieval text alignment workflows
digital philology and computational humanities research

It is especially designed to be used with Aquilign.

Related Resources

Citation

If you use this model, please cite the related dataset and publication.

Dataset

@dataset{ing2025multilingual,
  author       = {Ing, L. and Gille Levenson, M. and Macedo, C.},
  title        = {Multilingual Segmentation Dataset for Historical Prose (13th--16th c.)},
  year         = {2025},
  publisher    = {Zenodo},
  version      = {1.0},
  doi          = {10.5281/zenodo.16992629},
  url          = {https://doi.org/10.5281/zenodo.16992629},
  license      = {Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International}
}

Related Publication

@inproceedings{ing-etal-2026-phrase,
  title = {Phrase-Level Segmentation on Medieval Corpora for Aligning Multilingual Texts},
  author = {Ing, Lucence and Gille Levenson, Matthias and Macedo, Carolina},
  booktitle = {Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026)},
  month = {May},
  year = {2026},
  pages = {936--946},
  address = {Palma, Mallorca, Spain},
  publisher = {European Language Resources Association (ELRA)},
  doi = {10.63317/32huzuuokpfr}
}

Downloads last month: 202

Safetensors

Model size

0.2B params

Tensor type

F32