To make revising LLM architectures and training methods faster, I created a deck of 180 visual flashcards. It started as a personal hobby, but slowly became cheat code for reviewing LLM concepts before technical interviews. People love it!
Can you predict what something smells like just from its chemical structure?
Turns out yes β and a model can learn it.
Smell is molecular. Specific shapes bind to specific receptors in your nose. That pattern is encodable.
Feed it a molecule, get odor descriptors back: Ethanol β alcoholic (87%) + ethereal (62%) Isoamyl alcohol β floral (71%) + fruity (58%) β this is literally what makes bananas smell like bananas
Human brains don't recreate every pixel to understand the world!
Most current models in genomics, proteomics, and single-cell transcriptomics rely on generative objectives like masked language modeling or next token prediction. While effective, these architectures waste significant capacity reconstructing raw, noisy sequence details that may not carry functional biological meaning.
But a promising, more efficient alternative is emerging: Joint-Embedding Predictive Architecture (JEPA)
Originally introduced by Yann LeCun for computer vision, JEPA is a non-generative, self-supervised learning (SSL) framework. Instead of predicting raw inputs, it operates as a world model that predicts abstract semantic embeddings in latent space.
Recently, the JEPA framework (and its more efficient LeJEPA variant) has been adapted into the biological sciences to develop performing foundation models and to improve on already existing ones.
It's interesting how each adaptation modified and tailored JEPA to suit its specific biological domain, whether by experimenting with different backbones or complementing the objective with other loss terms.
For example, JEPA-DNA and ProteinJEPA used JEPA as a continual pre-training framework to enhance existing foundation models without training from scratch, while Cell-JEPA and JEPA-DNA employed a hybrid objective that combines the JEPA loss with a traditional language modeling loss.
The article below provides an overview of these implementations, along with others that came out this year. As always, your thoughts and feedback are welcome and highly appreciated!
How do scientists know if a new chemical is toxic before testing it on anything alive ?
For decades: animal studies. Slow, expensive, ethically complicated.
Tox21 changed that β a government-backed initiative screening 10,000 compounds across 12 biochemical assays, testing whether molecules activate receptors linked to cancer, hormonal disruption, and organ damage.
I trained a model on this data and published everything under MIT β free for research, education, and building.
β οΈ Research and educational use only. Not a substitute for certified toxicological testing. Model has known limitations on novel chemical classes β see model card for details.
𧬠Just uploaded K-quants of Carbon-3B for llama.cpp users! @HuggingFaceBio released the original GGUF in bf16 only β so I added the full quant ladder for CPU/edge inference: β’ Q2_K β 1.4 GB β’ Q3_K_M β 1.8 GB β’ Q4_K_M β 2.1 GB β β’ Q5_K_M β 2.4 GB β’ Q6_K β 2.7 GB β’ Q8_0 β 3.5 GB π pankajpandey-dev/Carbon-3B-GGUF Now you can generate DNA sequences on your laptop. Needs a llama.cpp build with PR #23410 (HybridDNATokenizer support). Huge thanks to the HuggingFaceBio team for the original model π #GGUF #llamacpp #genomics #DNA
A live community radio for AI-generated songs, powered by tracks created with ACE-Step.
You can tune in, discover community-made songs in many languages, vote on what sounds good, and mark your real favorites as Bangers.
The more people listen, vote, and create, the better the station gets.
Under the hood, it connects a few Hugging Face pieces together:
Spaces for the live app, HF buckets for community tracks, OAuth for signed-in listeners, server-side streaming with ffmpeg, hourly playlist refreshes, moderation, jingles, and community feedback loops.
Itβs not just a playlist.
Itβs a shared taste experiment: new songs get a shot every hour, and the community helps decide what deserves another spin.
Come listen. Find weird gems. Support the Bangers. Shape the radio.
Spanglish. Hinglish. Franglais. Real users don't speak textbook languages π
500M+ Indians type messages like the below example, Most NLP pipelines fail silently on code-switched text, so I built something to start closing that gap for regional language
Fine-tuned MuRIL on 3,000 synthetic Hinglish examples. 97.6% F1. Not perfect, but open, working, and hopefully useful.
"bhai mera refund kab aayega" β refund_status β "wrong item aaya hai" β exchange_product β "payment cut ho gaya order nahi hua" β payment_issue β