Open to Collab

19 5 16

Omar Kamali PRO

omarkamali

https://omarkama.li

AI & ML interests

NLP & LLMs for low resource languages.

Recent Activity

new activity 4 days ago

omneity-labs/lid-benchmark:Add CommonLingua

repliedto their post 24 days ago

Just sharing a little breakthrough with Gherbal LID where we managed to distinguish the 15 variants of Arabic with 6 variants above 90%, 10 variants above 85% accuracy, practically distinguishing Moroccan and Algerian (which overlap massively). It also embraces the duality of MSA and arabic variants pioneered in ALDi by @AMR-KELEG et al. Now we're only bottlenecked by the availability of high quality data for the low scoring variants such as Iraqi, Libyan, Sudanese, Adeni ... More on Gherbal at: https://omneitylabs.com/models/gherbal

posted an update 24 days ago

View all activity

Organizations

replied to their post 24 days ago

About ALDi (Arabic Level of Dialectness):

The latest Gherbal follows a similar convention to ALDi (MSA and Variants are scored separately) but does not reuse any of the code or data.

For more information about the original ALDi: https://github.com/AMR-KELEG/ALDi

posted an update 24 days ago

Post

898

Just sharing a little breakthrough with Gherbal LID where we managed to distinguish the 15 variants of Arabic with 6 variants above 90%, 10 variants above 85% accuracy, practically distinguishing Moroccan and Algerian (which overlap massively).

It also embraces the duality of MSA and arabic variants pioneered in ALDi by @AMR-KELEG et al.

Now we're only bottlenecked by the availability of high quality data for the low scoring variants such as Iraqi, Libyan, Sudanese, Adeni ...

More on Gherbal at:
https://omneitylabs.com/models/gherbal

1 reply

replied to their post 27 days ago

We're actually releasing an eval and leaderboard to settle this question once and for all, the scientific way :)

posted an update about 1 month ago

Post

4558

We got Qwen 3.5 to count Rs in Strawberry correctly! 🚨

Building on Sawtone, we’ve been testing a different way to feed language into an LLM to build the next generation of multilingual AI.

The usual setup gives the model tokenized text and asks it to perform various linguistic tasks. That works surprisingly well, until it doesn’t. Accents disappear. Words get mangled. Internal structure gets blurred away. And the cost of that gets higher once you move into multilingual and lower-resource settings.

So we tried adding a second path.

In addition to the normal text input, the model also receives Sawtone: a byte-level word representation that preserves how a word is written, how it sounds, and how it is structured.

Same LLM. Better interface.

In this proof of concept with Qwen 3.5 0.8B, that pushed our eval from 64% to 88%. The gains showed up exactly where tokenized models usually get shaky: diacritics, character order, exact spelling, and other form-sensitive behavior.

Sawtone itself is tokenizer-free, byte-level, and pre-trained across 507 languages.

Still early, but promising!

5 replies

posted an update about 1 month ago

Post

222

🌐 LID Benchmark update:

• 10 Regional Leaderboards
• 17 LID models (+7 new, incl. non-fastText based)
• 449 languages in total (200+ additional)
• Fixed: F1 macro reporting error
• Normalized language codes for more accurate results

The dataset is also updated, now with individual model predictions to reproduce and validate our findings.

omneity-labs/lid-benchmark

replied to their post about 2 months ago

This is super helpful, thanks! I'll get up to speed on the literature and keep your use case in mind :)

replied to their post about 2 months ago

So you basically still want ASR-style transcription before the LLM kicks in (perhaps to reduce hallucination? or another purpose?), but would like the representation to be more rich so a downstream LLM can still reason about pronunciation, pauses and so on?

replied to their post about 2 months ago

Hah yeah that rendering bug is for sure a meta joke (played on me :D).

Speech is for sure something I'd like to address. This work is deeply grounded in phonetics as you guessed (I wrote a paper on this topic because I love word plays https://doi.org/10.14746/linpo.2025.67.1.8 and it's kinda a precursor to this method) so it must work with audio. Just have to figure out the right way and objective.

What are the most critical gaps you see in voice AI that need an improvement?

replied to their post about 2 months ago

I knowww. Need to fix the video pipeline lol

Thanks @alfredo-ottomate ! In principle, it should be faster than a conventional LLM at the same scale while also using less VRAM. Mostly because it removes the softmax layer, which is one of the more expensive operations in standard language models. It also removes the embedding table, which usually accounts for roughly 10-20% of the parameters. For example, in Qwen 3.5 4B, that’s about 700M embedding parameters eliminated.

Raw performance-wise, I expect around ~10% generation speed up per-token, ~10% less VRAM usage, and better use of the context window since each token means a full word, not a subword piece.

The question then is how many parameters my replacement mechanism will ultimately need to stay competitive. The approach is already working surprisingly well at around 4M parameters, which is about 0.6% of the alternative at 4B total. Even if that number grows, the efficiency upside still looks very promising.

Fingers crossed! ✌︎

replied to their post about 2 months ago

Quick update, it seems to mostly work as intended 🤯

More details here:
https://x.com/OmarKamali/status/2036932984226320748

posted an update about 2 months ago

Post

221

Omneity Labs LID Benchmark is live 🔥

- 8 Evals
- 10 Models (GlotLID, OpenLID, our own Gherbal and others)
- 200+ Languages
- One Leaderboard To Rule Them All!

Come find your language and which LID model supports it best in this space 👇

omneity-labs/lid-benchmark

replied to their post about 2 months ago

I added a decoding head to the LLM, so the MLP generates a latent word vector that gets decoded by a GRU into a valid word.

I'm using the same input representation and train a joint encoder-decoder which gets further fine-tuned as part of the "Next Latent Prediction"(?) objective and it seems to be pretty decent for a first shot. Still working out some of the kinks.

replied to their post about 2 months ago

That's planned! I'm just running a few more experiments before locking it in.

Will let you know first @unmodeled-tyler :)

posted an update about 2 months ago

Post

1876

I just might have cracked tokenizer-free LLMs. No vocab, no softmax.

I'm training a 22M params LLM rn to test this "thing" and it's able to formulate coherent sentences 🤯

Bear in mind, this is a completely new, tokenizer-free LLM architecture with built-in language universality.

Check the explainer video to understand what's happening. Feedback welcome on this approach!

14 replies

posted an update 2 months ago

Post

338

You're probably training on outdated Wikipedia data right now and don't know it. 💡

In June last year, a friend from the Moroccan Wikipedia community slid into my DMs: "Are you using the current version? The official dataset is severely outdated. We added so many articles nowhere to be found on HuggingFace."

He was right. I was running a 2023 snapshot. In 2025. The official Wikipedia dataset, the one hundreds of labs and researchers grab by default without a second thought, was frozen in time.
• For English, that's 700,000 missing articles.
• For Moroccan Arabic, 30% of the language's entire Wikipedia.
• For 31 other languages, there was literally no text corpus at all until recently.

I could've shrugged and moved on. Instead I spent the next months building a monthly automated pipeline for 340+ languages, on my personal laptop, nearly killing it several times in the process (100% disk, frozen screen, the works).

Nous Research trained Hermes 4 on it. INRIA cited it. It's now three years ahead of what most people are training on.

Here's the full story of how I built Wikipedia Monthly 👇

https://omarkamali.com/blog/wikipedia-monthly-pipeline

posted an update 4 months ago

Post

1701

New year, new dataset 🚀

I just released omarkamali/wikipedia-labels, with all the structural labels and namespace from wikipedia in 300+ languages. A gift for the data preprocessors and cleaners among us.

Happy new year 2026 everyone! 🎆

replied to their post 5 months ago

Thanks Louis!

That's a great idea, stay tuned for the next version of picomon 🤙

replied to their post 5 months ago

Big Rig Card

Compact Rig Card

posted an update 5 months ago

Post

302

Picomon v0.2.0 released! 💫

- Supports all of AMD, Nvidia and Apple Silicon 🧑‍🧑‍🧒‍🧒
- Beautiful TUI with themes (who said monitoring should be boring?) 💅
- Shareable Rig Cards! Boast to friends, family and foes alike 🫨

Get it now! uvx picomon or pip install picomon then picomon

3 replies

posted an update 5 months ago

Post

3498

Hello picomon! AMD GPU Monitoring made easy

Just run uvx picomon and behold:

┌──────────────────────────────────────────┐  ┌──────────────────────────────────────────┐
│ GPU 0  GFX  42%  UMC  21%                │  │ GPU 1  GFX  78%  UMC  66%                │
│ PWR 135/250W (54%)  VRAM 10.0/16.0GB 62% │  │ PWR 210/250W (84%)  VRAM 14.5/16.0GB 90% │
│                                          │  │                                          │
│ GFX ▁▂▂▃▄▄▅▆▆▇█▇▆▅▄▃▂▁                   │  │ GFX ▂▃▄▅▆▇██▇▆▅▄▂▂▃▅▆                    │
│ PWR ▁▁▂▂▃▄▄▅▆▇██▇▆▅▄▂▁                   │  │ PWR ▂▂▃▄▅▆▇██▇▆▅▄▃▂▂▃                    │
│ VRM ▁▁▂▂▃▄▄▅▆▇███▇▆▅▄▂                   │  │ VRM ▂▃▄▅▆▆▇███▇▆▅▄▃▂▂                    │
└──────────────────────────────────────────┘  └──────────────────────────────────────────┘

Repo at https://github.com/omarkamali/picomon
Or pypi at https://pypi.org/project/picomon

Omar Kamali PRO

AI & ML interests

Recent Activity

Organizations

omarkamali's activity