Activity Feed

AI & ML interests

Large scale distributed AI model training, model parallelisation, low-level GPU acceleration, make GPUs go brrrrr

Recent Activity

nouamanetaziĀ 
posted an update 3 months ago
view post
Post
4376
After training š’š¦šØš„š‹šŒšŸ‘ on šŸ‘šŸ–šŸ’ š‡šŸšŸŽšŸŽš¬ for nearly a month, I've come to realize something most people overlook: š¢š§šŸš«ššš¬š­š«š®šœš­š®š«šž š¢š¬ š­š”šž š¦ššš¤šž-šØš«-š›š«šžššš¤ šŸšššœš­šØš« š¢š§ š‹š‹šŒ š­š«ššš¢š§š¢š§š . šŸ”„

Everyone talks about model architecture and data quality. And yes, those matter immensely. But here's what nobody tells you: when your training run fails at 2 AM because of mysterious šš‚š‚š‹ šžš«š«šØš«š¬, or when your expensive GPU cluster is running at šŸ”šŸŽ% šžšŸšŸš¢šœš¢šžš§šœš², the problem isn't your model. It's most probably a š¦š¢š¬š®š¬šž šØšŸ š­š”šž š”ššš«šš°ššš«šž. šŸ› ļø

Questions that seemed simple but had no clear answers: Why is šŒšØš„ š­š«ššš¢š§š¢š§š  š¬š„šØš°šžš« š­š”ššš§ ššžš§š¬šž š¦šØššžš„š¬? Which šš‚š‚š‹ šŸš„ššš š¬ should we actually set? How often should we checkpoint without killing throughput?

That's why we built š“š”šž š’š¦šØš„ š“š«ššš¢š§š¢š§š  šš„ššš²š›šØšØš¤ šŸ“–: a complete guide covering everything from model architecture and data curation to the SmolLM3 training marathon, post-training techniques, and crucially, the š¢š§šŸš«ššš¬š­š«š®šœš­š®š«šž š„ššš²šžš« that most teams get wrong.

We validated real vs theoretical bandwidth across the entire stack: š‡ššŒšŸ‘ š”š¢š­š­š¢š§š  šŸ‘ š“š/š¬, šš•š‹š¢š§š¤ šŸ’.šŸŽ š«šžšššœš”š¢š§š  šŸ•šŸ–šŸ” š†š/š¬, šš‚šˆšž š†šžš§šŸ’ ššš­ šŸšŸ’.šŸ š†š/š¬. Then we ran collective operations across šŸšŸšŸ– š†šš”š¬ (16 nodes, 8xH100s each) and measured how performance degrades at scale: all-reduce drops from šŸ’šŸ–šŸŽ š†š/š¬ on a single node to šŸ‘šŸšŸŽ-šŸ‘šŸ“šŸŽ š†š/š¬ across 16 nodes.

If you've ever wondered why your training runs are slower than they should be, or you're planning to scale up and want to avoid expensive mistakes, this guide might save you weeks of debugging.

š“š”šž š’š¦šØš„ š“š«ššš¢š§š¢š§š  šš„ššš²š›šØšØš¤: https://lnkd.in/e5MKXUHS

Shared with ā¤ļø by the HuggingFace team
eliebakĀ 
posted an update 5 months ago
view post
Post
4034
Super excited to announce that our research team at Hugging Face will be doing an AMA on reddit r/LocalLLaMA.

Come ask any questions to the team behind SmolLM, FineWeb and more! And who knows, maybe there’ll be a shiny new release to talk about?

Thursday 4th September, 8AM-11AM PST šŸ¤—

science
eliebakĀ 
posted an update 5 months ago
view post
Post
728
Motif 2.6B tech report is pretty insane, first time i see a model with differential attention and polynorm trained at scale!

> It's trained on 2.5T of token, with a "data mixture schedule" to continuously adjust the mixture over training.
> They use WSD with a "Simple moving average" averaging the last 6 ckpt every 8B token.
> They trained on Finemath, Fineweb2, DCLM, TxT360.
> Lot of details in the finetuning data they used, for instance they used EvolKit and did some "dataset fusion" to have more compressed knowledge into the data.
> They mention they also tried Normalized GPT, QK-Norm and Cross Layer Attention.

Motif-Technologies/Motif-2.6B

dark-mode

šŸ¤— 1
11
#82 opened 12 months ago by
serhany

typos

#119 opened 6 months ago by
kashif

order button

#118 opened 6 months ago by
lvwerra

order button

#118 opened 6 months ago by
lvwerra
lvwerraĀ 
updated a Space 6 months ago
julien-cĀ 
updated a Space 6 months ago
lvwerraĀ 
in nanotron/book 6 months ago

Update README.md

#1 opened 6 months ago by
lvwerra

Update README.md

#2 opened 6 months ago by
lvwerra

Update README.md

#3 opened 6 months ago by
lvwerra