Aquilign Multilingual Segmenter

Aquilign Multilingual Segmenter is a token-classification model for phrase-level segmentation of medieval and historical texts.

The model is designed to detect custom segmentation delimiters in multilingual historical corpora and is used as part of the Aquilign alignment workflow.

Model Description

The segmenter is based on a trainable BertForTokenClassification model from Hugging Face’s transformers library.

It was fine-tuned on historical prose from the Multilingual Segmentation Dataset to identify phrase-level segmentation boundaries.

Supported Languages

  • Latin
  • French
  • Castilian
  • Portuguese
  • Catalan
  • English
  • Italian

Intended Use

This model is intended for:

  • phrase-level segmentation of medieval texts
  • preprocessing parallel corpora before alignment
  • multilingual medieval text alignment workflows
  • digital philology and computational humanities research

It is especially designed to be used with Aquilign.

Related Resources

Citation

If you use this model, please cite the related dataset and publication.

Dataset

@dataset{ing2025multilingual,
  author       = {Ing, L. and Gille Levenson, M. and Macedo, C.},
  title        = {Multilingual Segmentation Dataset for Historical Prose (13th--16th c.)},
  year         = {2025},
  publisher    = {Zenodo},
  version      = {1.0},
  doi          = {10.5281/zenodo.16992629},
  url          = {https://doi.org/10.5281/zenodo.16992629},
  license      = {Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International}
}

Related Publication

@inproceedings{ing-etal-2026-phrase,
  title = {Phrase-Level Segmentation on Medieval Corpora for Aligning Multilingual Texts},
  author = {Ing, Lucence and Gille Levenson, Matthias and Macedo, Carolina},
  booktitle = {Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026)},
  month = {May},
  year = {2026},
  pages = {936--946},
  address = {Palma, Mallorca, Spain},
  publisher = {European Language Resources Association (ELRA)},
  doi = {10.63317/32huzuuokpfr}
}
Downloads last month
202
Safetensors
Model size
0.2B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support