Arabic โ€” Wikilangs Models

Open-source tokenizers, n-gram & Markov language models, vocabulary stats, and word embeddings trained on Arabic Wikipedia by Wikilangs.

๐ŸŒ Language Page ยท ๐ŸŽฎ Playground ยท ๐Ÿ“Š Full Research Report

Language Samples

Example sentences drawn from the Arabic Wikipedia corpus:

ุชุตุบูŠุฑ K \ ูƒูŠ \ ู‡ูˆ ุงู„ุญุฑู ุงู„ุญุงุฏูŠ ุงู„ุนุดุฑ ููŠ ุงู„ุฃุจุฌุฏูŠุฉ The Oxford English Dictionary, 2nd ed., online ูˆูŠู…ุซู„ ู‡ุฐุง ุงู„ุญุฑู ุงู„ุตูˆุช ุงู„ุทุจู‚ูŠ ุงู„ูˆู‚ููŠ ุงู„ู…ู‡ู…ูˆุณ ููŠ ุงู„ูƒูŠู…ูŠุงุกุŒ ูŠุฑู…ุฒ K ู„ุนู†ุตุฑ ุงู„ุจูˆุชุงุณูŠูˆู… ู…ุฑุงุฌุน ู„ุงุชูŠู†ูŠุฉ

: ุฅุญุฏู‰ ูˆู„ุงูŠุงุช ุงู„ูˆู„ุงูŠุงุช ุงู„ู…ุชุญุฏุฉ ุงู„ุฃู…ุฑูŠูƒูŠุฉ. ู…ุฏูŠู†ุฉ ู†ูŠูˆูŠูˆุฑูƒ: ุฃูƒุจุฑ ู…ุฏู† ุงู„ูˆู„ุงูŠุงุช ุงู„ู…ุชุญุฏุฉ ุงู„ุฃู…ุฑูŠูƒูŠุฉ ูˆุฅุญุฏู‰ ุฃูƒุจุฑู‡ุง ููŠ ุงู„ุนุงู„ู…. ู…ู‚ุงุทุนุฉ ู†ูŠูˆูŠูˆุฑูƒ: ุฅุญุฏู‰ ู…ู‚ุงุทุนุงุช ูˆู„ุงูŠุฉ ู†ูŠูˆูŠูˆุฑูƒ. ุชูˆุถูŠุญ ุฃุณู…ุงุก ุฃู…ุงูƒู†

ุฃุจูˆ ุฅุจุฑุงู‡ูŠู… ุงู„ูุงุฑุงุจูŠ ุฃุฏูŠุจ ู†ุญูˆูŠ ู„ุบูˆูŠ ุฃุจูˆ ู†ุตุฑ ู…ุญู…ุฏ ุงู„ูุงุฑุงุจูŠ ููŠู„ุณูˆู ู…ุดุงุฆูŠ ู…ุณู„ู… ูˆุทุจูŠุจ

ุฅุณุญุงู‚ ู†ูŠูˆุชู† ุนุงู„ู… ุฅู†ุฌู„ูŠุฒูŠ ู†ูŠูˆุชู† ูˆุญุฏุฉ ู‚ูŠุงุณ ุงู„ู‚ูˆุฉ. ุฐูƒูˆุฑ ุฅู†ุฌู„ูŠุฒูŠุฉ ุชูˆุถูŠุญ ุฃุณู…ุงุก ุฃู…ุงูƒู†

ุจูˆุชุงู† (ู…ู…ู„ูƒุฉ) ุจูˆุชุงู† ู…ู…ู„ูƒุฉ ููŠ ุฌุจุงู„ ุงู„ู‡ู…ุงู„ุงูŠุง ุจูŠู† ุงู„ู‡ู†ุฏ ูˆุงู„ุตูŠู†. ุจูˆุชุงู† (ูƒูŠู…ูŠุงุก) ุฃุญุฏ ุงู„ุฃู„ูƒุงู†ุงุชุŒ ูŠุชูƒูˆู† ู…ู† ุฃุฑุจุน ุฐุฑุงุช ูƒุฑุจูˆู†.

Quick Start

Load the Tokenizer

import sentencepiece as spm

sp = spm.SentencePieceProcessor()
sp.Load("ar_tokenizer_32k.model")

text = "ุงุณุชูˆุฏูŠูˆู‡ุงุช ุฃูู„ุงู… ูˆุงู„ุช ุฏูŠุฒู†ูŠ ุฃูู„ุงู… ูˆุงู„ุช ุฏูŠุฒู†ูŠ ู…ู†ุชุฌุน ูˆุงู„ุช ุฏูŠุฒู†ูŠ ุงู„ุนุงู„ู…ูŠ ุฏูŠุฒู†ูŠ ู„ุงู†ุฏ"
tokens = sp.EncodeAsPieces(text)
ids    = sp.EncodeAsIds(text)

print(tokens)  # subword pieces
print(ids)     # integer ids

# Decode back
print(sp.DecodeIds(ids))
Tokenization examples (click to expand)

Sample 1: ุงุณุชูˆุฏูŠูˆู‡ุงุช ุฃูู„ุงู… ูˆุงู„ุช ุฏูŠุฒู†ูŠ ุฃูู„ุงู… ูˆุงู„ุช ุฏูŠุฒู†ูŠ ู…ู†ุชุฌุน ูˆุงู„ุช ุฏูŠุฒู†ูŠ ุงู„ุนุงู„ู…ูŠ ุฏูŠุฒู†ูŠ ู„ุงู†ุฏโ€ฆ

Vocab Tokens Count
8k โ–ุงุณุช ูˆุฏูŠ ูˆู‡ ุงุช โ–ุฃูู„ุงู… โ–ูˆุงู„ุช โ–ุฏูŠ ุฒ ู†ูŠ โ–ุฃูู„ุงู… โ€ฆ (+22 more) 32
16k โ–ุงุณุช ูˆุฏูŠ ูˆู‡ุงุช โ–ุฃูู„ุงู… โ–ูˆุงู„ุช โ–ุฏูŠุฒู†ูŠ โ–ุฃูู„ุงู… โ–ูˆุงู„ุช โ–ุฏูŠุฒู†ูŠ โ–ู…ู†ุช โ€ฆ (+10 more) 20
32k โ–ุงุณุชูˆุฏูŠูˆู‡ุงุช โ–ุฃูู„ุงู… โ–ูˆุงู„ุช โ–ุฏูŠุฒู†ูŠ โ–ุฃูู„ุงู… โ–ูˆุงู„ุช โ–ุฏูŠุฒู†ูŠ โ–ู…ู†ุชุฌุน โ–ูˆุงู„ุช โ–ุฏูŠุฒู†ูŠ โ€ฆ (+7 more) 17
64k โ–ุงุณุชูˆุฏูŠูˆู‡ุงุช โ–ุฃูู„ุงู… โ–ูˆุงู„ุช โ–ุฏูŠุฒู†ูŠ โ–ุฃูู„ุงู… โ–ูˆุงู„ุช โ–ุฏูŠุฒู†ูŠ โ–ู…ู†ุชุฌุน โ–ูˆุงู„ุช โ–ุฏูŠุฒู†ูŠ โ€ฆ (+7 more) 17

Sample 2: ุจุงุณูƒุงู„ ู‚ุฏ ุชุนู†ูŠ: ุงู„ุจุงุณูƒุงู„ุŒ ูˆุญุฏุฉ ู‚ูŠุงุณ ุงู„ุถุบุท ู„ุบุฉ ุจุงุณูƒุงู„ุŒ ู„ุบุฉ ุจุฑู…ุฌุฉ ุงู„ููŠู„ุณูˆู ุจุงุณูƒุงู„ุŒโ€ฆ

Vocab Tokens Count
8k โ–ุจุง ุณูƒ ุงู„ โ–ู‚ุฏ โ–ุชุนู†ูŠ : โ–ุงู„ุจุง ุณูƒ ุงู„ ุŒ โ€ฆ (+29 more) 39
16k โ–ุจุงุณูƒุงู„ โ–ู‚ุฏ โ–ุชุนู†ูŠ : โ–ุงู„ุจุงุณูƒ ุงู„ ุŒ โ–ูˆุญุฏุฉ โ–ู‚ูŠุงุณ โ–ุงู„ุถุบุท โ€ฆ (+18 more) 28
32k โ–ุจุงุณูƒุงู„ โ–ู‚ุฏ โ–ุชุนู†ูŠ : โ–ุงู„ุจุงุณูƒ ุงู„ ุŒ โ–ูˆุญุฏุฉ โ–ู‚ูŠุงุณ โ–ุงู„ุถุบุท โ€ฆ (+15 more) 25
64k โ–ุจุงุณูƒุงู„ โ–ู‚ุฏ โ–ุชุนู†ูŠ : โ–ุงู„ุจุงุณูƒ ุงู„ ุŒ โ–ูˆุญุฏุฉ โ–ู‚ูŠุงุณ โ–ุงู„ุถุบุท โ€ฆ (+15 more) 25

Sample 3: ุฌู…ู‡ูˆุฑูŠุฉ ุงู„ูƒูˆู†ุบูˆ ุงู„ุฏูŠู…ู‚ุฑุงุทูŠุฉุŒ ุฒุงุฆูŠุฑ ุณุงุจู‚ู‹ุงุŒ ุนุงุตู…ุชู‡ุง ูƒูŠู†ุดุงุณุง. ุฌู…ู‡ูˆุฑูŠุฉ ุงู„ูƒูˆู†ุบูˆุŒ ุนุงุตโ€ฆ

Vocab Tokens Count
8k โ–ุฌู…ู‡ูˆุฑูŠุฉ โ–ุงู„ูƒูˆู† ุบูˆ โ–ุงู„ุฏูŠู…ู‚ุฑุงุทูŠุฉ ุŒ โ–ุฒ ุงุฆ ูŠุฑ โ–ุณุงุจู‚ ู‹ุง โ€ฆ (+21 more) 31
16k โ–ุฌู…ู‡ูˆุฑูŠุฉ โ–ุงู„ูƒูˆู†ุบูˆ โ–ุงู„ุฏูŠู…ู‚ุฑุงุทูŠุฉ ุŒ โ–ุฒ ุงุฆ ูŠุฑ โ–ุณุงุจู‚ู‹ุง ุŒ โ–ุนุงุตู…ุชู‡ุง โ€ฆ (+16 more) 26
32k โ–ุฌู…ู‡ูˆุฑูŠุฉ โ–ุงู„ูƒูˆู†ุบูˆ โ–ุงู„ุฏูŠู…ู‚ุฑุงุทูŠุฉ ุŒ โ–ุฒุงุฆ ูŠุฑ โ–ุณุงุจู‚ู‹ุง ุŒ โ–ุนุงุตู…ุชู‡ุง โ–ูƒูŠู†ุดุงุณุง โ€ฆ (+12 more) 22
64k โ–ุฌู…ู‡ูˆุฑูŠุฉ โ–ุงู„ูƒูˆู†ุบูˆ โ–ุงู„ุฏูŠู…ู‚ุฑุงุทูŠุฉ ุŒ โ–ุฒุงุฆูŠุฑ โ–ุณุงุจู‚ู‹ุง ุŒ โ–ุนุงุตู…ุชู‡ุง โ–ูƒูŠู†ุดุงุณุง . โ€ฆ (+10 more) 20

Load Word Embeddings

from gensim.models import KeyedVectors

# Aligned embeddings (cross-lingual, mapped to English vector space)
wv = KeyedVectors.load("ar_embeddings_128d_aligned.kv")

similar = wv.most_similar("word", topn=5)
for word, score in similar:
    print(f"  {word}: {score:.3f}")

Load N-gram Model

import pyarrow.parquet as pq

df = pq.read_table("ar_3gram_word.parquet").to_pandas()
print(df.head())

Models Overview

Performance Dashboard

Category Assets
Tokenizers BPE at 8k, 16k, 32k, 64k vocab sizes
N-gram models 2 / 3 / 4 / 5-gram (word & subword)
Markov chains Context 1โ€“5 (word & subword)
Embeddings 32d, 64d, 128d โ€” mono & aligned
Vocabulary Full frequency list + Zipf analysis
Statistics Corpus & model statistics JSON

Metrics Summary

Component Model Key Metric Value
Tokenizer 8k BPE Compression 3.25x
Tokenizer 16k BPE Compression 3.65x
Tokenizer 32k BPE Compression 4.03x
Tokenizer 64k BPE Compression 4.35x ๐Ÿ†
N-gram 2-gram (subword) Perplexity 426 ๐Ÿ†
N-gram 2-gram (word) Perplexity 359,826
N-gram 3-gram (subword) Perplexity 4,163
N-gram 3-gram (word) Perplexity 775,988
N-gram 4-gram (subword) Perplexity 27,277
N-gram 4-gram (word) Perplexity 1,494,234
N-gram 5-gram (subword) Perplexity 133,736
N-gram 5-gram (word) Perplexity 1,059,510
Markov ctx-1 (subword) Predictability 0.0%
Markov ctx-1 (word) Predictability 0.0%
Markov ctx-2 (subword) Predictability 17.3%
Markov ctx-2 (word) Predictability 67.4%
Markov ctx-3 (subword) Predictability 29.5%
Markov ctx-3 (word) Predictability 89.5%
Markov ctx-4 (subword) Predictability 35.2%
Markov ctx-4 (word) Predictability 96.5% ๐Ÿ†
Vocabulary full Size 986,324
Vocabulary full Zipf Rยฒ 0.9920
Embeddings mono_32d Isotropy 0.8111
Embeddings mono_64d Isotropy 0.7841
Embeddings mono_128d Isotropy 0.7556
Embeddings aligned_32d Isotropy 0.8111 ๐Ÿ†
Embeddings aligned_64d Isotropy 0.7841
Embeddings aligned_128d Isotropy 0.7556
Alignment aligned_32d R@1 / R@5 / R@10 13.4% / 35.0% / 48.6%
Alignment aligned_64d R@1 / R@5 / R@10 28.6% / 54.0% / 65.6%
Alignment aligned_128d R@1 / R@5 / R@10 37.2% / 65.0% / 76.6% ๐Ÿ†

๐Ÿ“Š Full ablation study, per-model breakdowns, and interpretation guide โ†’


About

Trained on wikipedia-monthly โ€” monthly snapshots of 300+ Wikipedia languages.

A project by Wikilangs ยท Maintainer: Omar Kamali ยท Omneity Labs

Citation

@misc{wikilangs2025,
  author    = {Kamali, Omar},
  title     = {Wikilangs: Open NLP Models for Wikipedia Languages},
  year      = {2025},
  doi       = {10.5281/zenodo.18073153},
  publisher = {Zenodo},
  url       = {https://huggingface.co/wikilangs},
  institution = {Omneity Labs}
}

Links

License: MIT โ€” free for academic and commercial use.


Generated by Wikilangs Pipeline ยท 2026-03-04 13:56:39

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Dataset used to train wikilangs/ar