🏗️ Building on HF

Dipankar Sarkar PRO

dipankarsarkar

https://www.dipankar.cc

AI & ML interests

Building the AI-native stack. Agents as infrastructure, safety as architecture, performance as plumbing. I publish the receipts: papers, datasets, demos.

Recent Activity

upvoted a paper 36 minutes ago

MOPD: Multi-Teacher On-Policy Distillation for Capability Integration in LLM Post-Training

reacted to danielhanchen's post with 🔥 36 minutes ago

1-bit GLM-5.2 GGUF vs. Claude 4.8 Opus vs. GPT-5.5 We gave 3 models the same prompt and compared one-shot outputs. The 1-bit GLM-5.2 GGUF ran locally on a Mac Studio M3 Ultra with 256GB RAM at ~21.6 tok/s. Which output do you like best? GGUF: https://huggingface.co/unsloth/GLM-5.2-GGUF

reacted to Banaxi-Tech's post with 🚀 37 minutes ago

📱 TinyPhoneLM - LLMs on a Phone I built TinyPhoneLM because I wanted to see how far tiny local LMs can go on a real Android phone. Not just a server app. Not just an API wrapper. Not “AI on your phone” that secretly sends everything somewhere else. TinyPhoneLM allows you to run small language models directly on android. It uses llama.cpp via JNI. We have alot of options for default models + custom GGUF Import Supported. I am running Qwen3.5 4B Locally on my Redmi Note 12 Pro 5G at 4 tokens per second, that may seem slow but that it even runs on my phone is insane. I can also run Qwen3.5 0.8B at 10TPS! Look at this Chart From Artificial Analysis. Qwen3.5 4B is Better than GPT 4.1 and GPT 5 Mini at minimal reasoning! And even the smallest 800M Parameter Qwen3.5 0.8B still beats GPT 3.5 Turbo! The bad news: To get it on the play store we need 12 Testers Please only submit your Google Play email if you have a Android phone If you want to test TinyPhoneLM, enter your Google Play email here: 👉 https://docs.google.com/forms/d/1LqkT2pUHbalSUV50M8PX8m7M6S122ip0cWcbKcytcXk/viewform I would really appreciate the help if you get a tester!

View all activity

Organizations

reacted to danielhanchen's post with 🔥 36 minutes ago

Post

2968

1-bit GLM-5.2 GGUF vs. Claude 4.8 Opus vs. GPT-5.5

We gave 3 models the same prompt and compared one-shot outputs.

The 1-bit GLM-5.2 GGUF ran locally on a Mac Studio M3 Ultra with 256GB RAM at ~21.6 tok/s.

Which output do you like best?
GGUF: unsloth/GLM-5.2-GGUF

3 replies

reacted to Banaxi-Tech's post with 🚀 37 minutes ago

Post

📱 TinyPhoneLM - LLMs on a Phone
I built TinyPhoneLM because I wanted to see how far tiny local LMs can go on a real Android phone.
Not just a server app.
Not just an API wrapper.
Not “AI on your phone” that secretly sends everything somewhere else.

TinyPhoneLM allows you to run small language models directly on android. It uses llama.cpp via JNI. We have alot of options for default models + custom GGUF Import Supported. I am running Qwen3.5 4B Locally on my Redmi Note 12 Pro 5G at 4 tokens per second, that may seem slow but that it even runs on my phone is insane. I can also run Qwen3.5 0.8B at 10TPS!
Look at this Chart From Artificial Analysis.
Qwen3.5 4B is Better than GPT 4.1 and GPT 5 Mini at minimal reasoning!
And even the smallest 800M Parameter Qwen3.5 0.8B still beats GPT 3.5 Turbo!

The bad news: To get it on the play store we need 12 Testers

Please only submit your Google Play email if you have a Android phone
If you want to test TinyPhoneLM, enter your Google Play email here:

👉 https://docs.google.com/forms/d/1LqkT2pUHbalSUV50M8PX8m7M6S122ip0cWcbKcytcXk/viewform
I would really appreciate the help if you get a tester!

reacted to SeaWolf-AI's post with 🔥 37 minutes ago

Post

4901

🐯 Chitos — The Security Scanner That Actually Proves It

Most security scanners hand you a suspect list and walk away. That gap between detection and proof is where attackers live — and it's exactly the gap that Chitos was built to close.

Chitos is the successor to Mythos, a static analyzer built for quick code health checks. Mythos was good at pattern matching — spotting dangerous sinks, mapping CWEs, producing readable reports. But static analysis has a structural ceiling. A rule that sees eval(user_input) can tell you that looks dangerous. It cannot tell you whether the input is reachable, whether sanitization three layers up covers this path, or whether there's a live exploit chain for your exact framework version. Chitos was built to answer those questions.

🔍 Phase 1 applies 50 language-agnostic rules across Python, JavaScript, Go, Java, C/C++, Rust, PHP, YAML and more — covering injection sinks, deserialization gadgets, credential leakage, broken crypto, and prototype pollution. Every candidate is re-verified before reaching the report. Findings that can't be substantiated are excluded, not handed to you as noise.

🔬 Phase 2 dispatches an autonomous web-search agent to hunt live CVE databases, exploit advisories, and public PoC repositories. It formulates hypotheses, verifies them, and synthesizes a structured threat narrative. This phase needs a user-supplied Claude API key — Phases 1 and 3 run entirely free.

🎯 Phase 3 is where Chitos diverges from everything else. Against targets you own or are authorized to test, it fires real payloads — XSS, SQLi, path traversal, command injection — mutates on block, captures hard evidence, and connects every proven finding into a kill-chain showing which vulnerabilities to remediate first.

No installation. No account. No code sent to third-party APIs.

Article: https://huggingface.co/blog/FINAL-Bench/chitos

Try it now 👉 https://chitos.vidraft.net

1 reply

replied to SeaWolf-AI's post 38 minutes ago

The detection-to-proof gap is the right target. The trap is the second gap right behind it.

A reachability proof is only as true as your call-graph model. Dynamic dispatch, reflection, a framework's implicit routing, and the proven-safe verdict quietly inherits every edge your model missed. Green because the analyzer could not see the path, not because the path is closed.

So the proof can be as overconfident as the suspect list was noisy, just in the other direction.

Does Chitos emit the assumptions behind a verdict, the edges it modeled and the sanitizers it trusted, or just proven / not-proven? A proof I cannot audit is a prettier suspect list.

replied to kanaria007's post 38 minutes ago

That pushes the problem up a level, it does not remove it.

Frozen replay and golden probes only fire on the drift they were shaped to see. The boundary that bit me lived where no probe pointed: a retrieval path that was never in the golden set, so its freshness check never ran. The canary stayed green because nothing aimed it there.

So the detector inherits the failure it detects. Probe coverage ages too. A golden set frozen at epoch N slowly stops matching the live distribution, and now the drift surface is itself drifting, quietly, under a clean ledger.

Which makes the question recursive: who recalibrates the canaries? Does Chronia treat probe and coverage staleness as its own drift surface with its own receipts, or is the detection layer assumed fixed?

reacted to kanaria007's post with 🔥 about 3 hours ago

Post

✅ Article highlight: *Chronia Adaptation: Time-Varying Policies, Drift, and Identity Across Change* (art-60-189, v0.1)

TL;DR:
This article argues that adaptation is not background drift.

Governed systems change over time: policies update, environments shift, calibrations age, memories expire, identities fork, and old decisions still need to remain explainable. 189 turns time adaptation into receipted governance: policy epochs, drift events, temporal identity continuity, memory continuity ledgers, and adaptation receipts.

Read:
kanaria007/agi-structural-intelligence-protocols

Why it matters:
• prevents silent policy drift from rewriting the meaning of old decisions
• distinguishes continuity, narrowed continuity, fork, and discontinuity
• keeps memory deletion, tombstones, and reconstruction linked to lineage
• makes recalibration and environment drift reviewable
• preserves auditability when a runtime legitimately changes

What’s inside:
• temporal-context envelopes for current validity frames
• policy-epoch records for versioned decision intervals
• drift-event receipts for calibration, environment, norm, or assumption shifts
• temporal identity continuity records
• adaptation decisions that say what changed, what stayed continuous, and what became invalid
• memory continuity ledgers, tombstone linkage, and chronia reentry artifacts

Key idea:
Do not say:

*“the system adapted over time.”*

Say:

*“this decision belonged to this temporal context and policy epoch; this drift event changed these assumptions; this adaptation preserved this lineage, invalidated these prior claims, and left receipts for replay and review.”*

Change is allowed.

Silent discontinuity is not.

4 replies

replied to kanaria007's post about 3 hours ago

Recording a drift event is the easy half. The drift that never fires an event is the hard half.

Policy epochs assume you can name the boundary. The ones that bit me were gradual: a retrieval index staleing token by token, a tool's semantics shifting under a pinned version string, a calibration aging with no single moment to stamp. Nothing announces itself, so no epoch record gets written.

So the ledger stays clean while the meaning under it moves. Auditable and wrong at the same time.

How does Chronia catch an unlogged epoch boundary? Do you diff behavior against a frozen replay to surface it, or does an epoch have to be declared before it can be receipted?

replied to ginigen-ai's post about 6 hours ago

Accuracy is the wrong headline here, and you named it. The metric that matters downstream is whether confidence drops right before the wrong step, not after it.

In an agent loop that gap is the whole game. A model that knows it is unsure stops and re-plans. One that does not cascades the error through five tool calls before anyone notices.

How are you scoring metacognition: abstention, self-correction, or calibrated confidence at the decision boundary? Those three reward very different models.

reacted to ginigen-ai's post with 🔥 about 6 hours ago

Post

2245

🧠 Does your LLM know when it's about to be wrong?

Most leaderboards measure accuracy. We measure metacognition — whether a model catches its own errors. Benchmark + leaderboard + adapters, all open. 🎉

The surprise: even a K-AI #1 model (JGOS-31B-Citizen) is the strongest on multiple-choice traps (trap_rate 0.005 — ~2 misses in 400) yet blind to its own free-form mistakes (self-confidence AUROC = 0.5, pure random). A tiny base-frozen adapter recovers that signal.

Two independent axes (never compared across a row): ① trap_rate — does it fall for tempting trap options? (lower = stronger) ② adapter gain Δ — how much a lightweight adapter catches errors the model itself misses. (higher = more adapter value)

What's open: 📊 300+100 trap problems (each with a hidden trap + TICOS type) 🏆 24-model leaderboard 🧩 11 per-model adapters — adapters, NOT fine-tunes (base stays frozen; the adapter just reads the hidden state → P(wrong))

Submit any HF model → auto-scored daily at 09:00 KST and added to the board.

🏆 Leaderboard → ginigen-ai/Metacognition-Leaderboard-Space

📊 Benchmark → ginigen-ai/Metacognition-Bench

🧩 Adapters → FINAL-Bench/metacognition-adapters-6a42c032e6beb803dd032961

📊 Article → https://huggingface.co/blog/ginigen-ai/metacognition

Benchmark by ginigen-ai · Adapters by FINAL-Bench (Darwin/Chimera platform + AETHER metacognition tech).

2 replies

reacted to mmhamdy's post with 🔥 about 7 hours ago

Post

236

It has been more than a decade now since the knowledge distillation paper came out.

Knowledge Distillation (KD) is one of my favorite topics, but I have to confess that I'm not a huge fan of the term because I find it confusing (or at least, it has became so over time).

The idea behind KD is not novel; it was there almost a decade before the paper came out (and arguably even a decade before that, back to 1990-91). But this paper is the one that clicked, the one that made the topic much more popular and introduced it to a broader audience.

First, the timing and the authors played a big role: we have Geoffrey Hinton, Oriol Vinyals, and Jeff Dean here. And second, Geoffrey Hinton is really good at idea branding: Model compression?! No, no, no! Let's call it "Knowledge Distillation" and use evocative terms such as "Dark Knowledge" to describe what is being transferred.

It's a great name, but as time has passed, the term became a bit of a relic. KD is no longer solely about compression (KD used to be introduced as a method for model compression, but now model compression is just one application of KD). And the other thing is that the word "distillation" implies some sort of potency here, that the student is somehow more powerful than the teacher, which is not the case (but many counterarguments could be made, for example, more powerful compared to another model trained with no teacher)

Nevertheless, the paper is incredibly well-written, short, and fun to read. It's one of few papers that I read several times. Check it out, and maybe share your thoughts on the topic with us here!

If you had to choose another name for Knowledge Distillation, what would it be?

5 replies

replied to mmhamdy's post about 7 hours ago

The transfer was never the architecture, it was the soft targets. The dark knowledge is the runner-up mass, the 0.39 the teacher spreads over the wrong-but-related classes. A one-hot label deletes exactly that.

So I would drop "distillation" and call it soft-target transfer. Names the mechanism, kills the implied potency.

The part that still bugs me: most of the gain rides on temperature, not the loss term. High T is literally teaching the shape of the teacher's mistakes. Have you seen a principled way to set T, or is it still a swept knob?

replied to stas's post about 22 hours ago

Prompt dedup. That is the performance-is-plumbing story in one line, not an algorithm change.

RL prompt sets are mostly shared system + few-shot prefixes, so the duplicate compute is huge and invisible until someone measures it.

Is the dedup exact-match on the full prompt, or prefix-level, so two prompts that diverge late still share the early generation and forward passes?

replied to RDTvlokip's post about 23 hours ago

Logits, but not the chosen token's prob. The entropy of the whole next-token distribution.

A token picked at 0.6 reads confident until you see the runner-up sat at 0.39. That is a fork the model nearly took, and the per-token view hides the near-miss completely.

When even that looks clean I leave the single generation and go to the seams between turns. What state actually carried forward versus what the model assumed did. In an agent loop the bug is rarely inside one call, it is in what got dropped between two.

So my ladder runs one rung past yours: rendered to ids to chosen prob to full distribution to cross-turn state.

Where does it bottom out for you, is there a layer you have found that never lies?

reacted to danieldk's post with 🔥 1 day ago

Post

We have recently added Torch Stable ABI support to kernels and kernel-builder. This allows kernel developers to target a particular Torch version and the kernel will be supported on that Torch version and later Torch versions (up to ~2 years).

This makes it much easier to write kernels with long-term support and not just the last two Torch releases.

We have also started rolling out Stable ABI support to kernels in kernels-community, starting with Flash Attention 3, supporting Torch 2.9 and later as well as CUDA versions starting at 12.6:

https://huggingface.co/kernels/kernels-community/flash-attn3/tree/v1/build

1 reply

replied to danieldk's post 1 day ago

Stable ABI is the unglamorous win that quietly removes the most expensive tax in the kernel ecosystem.

Right now a kernel's useful life is pinned to Torch's release cadence, so every couple of versions you re-port code that never actually changed. Decoupling kernel lifetime from the Torch version is the real story here, not just FA3.

The ~2-year support window is what makes it safe to depend on a community kernel in production instead of vendoring your own copy.

Does the Stable ABI cover the custom-op registration path too, or just the kernel entry points?

replied to RDTvlokip's post 1 day ago

skip_special_tokens=True hiding the exact thing that is breaking you is the perfect summary. The rendered view is lossy somewhere, always.

Same trap in agent loops. You read the clean transcript and trust it, but the tool call that actually fired was truncated JSON the model never closed. The string lies, the id stream does not.

So I keep raw-vs-rendered on by default now, tokens and tool args both.

What is the first raw signal you reach for when an eval looks clean but feels off?

reacted to stas's post with 🔥 1 day ago

Post

2725

After many months of intense work the
Snowflake AI Research team is happy to present to you the new open source project: Arctic RL

https://snowflake.com/en/blog/engineering/arctic-rl-open-source-backend/

- Arctic RL integrates with VeRL and SkyRL today; enable ZoRRo with one config flag, no code changes required
- ZoRRo delivers up to 6x actor-update acceleration and a 3.5x end-to-end training speedup, reducing Arctic-Text2SQL-R2 training from ~5 days to ~36 hours on 32 H200 GPUs
- Arctic-Text2SQL-R2 achieved higher accuracy scores (48.7) than Gemini 3.1 Pro (47.9) and Claude 4.7 (47.3) on Snowflake's evaluated enterprise SQL benchmark under the tested conditions
- Two open source recipes ship with this release: a text-to-SQL recipe that improved BIRD dev accuracy from 59.92% to 70.35%, and a multi-hop QA recipe that improved average accuracy from 69.6% to 72.3%

4 replies

replied to stas's post 1 day ago

The 3.5x end-to-end number is the part people skim past, and it is the whole story.

A text-to-SQL model edging Gemini 3.1 Pro is not an architecture win, it is a faster-iteration win. 5 days down to 36 hours means ~3x more experiments per week, and that compounds into the accuracy gap.

The "one config flag, no code changes" line is what makes it real. Most RL speedups die because integrating them burns more eng time than they save.

Where does ZoRRo's 6x actor-update speedup actually come from? Overlapping rollout generation with the optimizer step, or the actor/learner weight-sync?

posted an update 1 day ago

Post

Your issue tracker is in the wrong place.

It lives on a server. Your code lives in git. So every time an agent picks up work it makes an API call, burns a token, fights a rate limit, and still cannot see what the other agent just did.

Move the issues into the repo. Append-only event log in git refs. Branches when you branch, merges when you merge, CRDT so two agents never conflict. No server, no database.

The coordination signal that PR-level telemetry misses lives before the pull request. The paper, and a live demo running the real tool:

Before the Pull Request: Mining Multi-Agent Coordination (2606.19616)
neullabs/grite

If your agents share a repo, where does their shared state actually live right now?

replied to RDTvlokip's post 1 day ago

The trailing is the cruelest kind of bug. The cause is invisible in the decoded output, so the symptom and the trigger never show up in the same place.

Packed training teaches the model that means a new document starts here. Hand it one at the end of the prompt and it just obeys.

I started diffing the real input_ids against what I thought I sent. The bug is usually two tokens I never typed.

Do you log raw token ids on every eval run now, or only when something already looks off?

Dipankar Sarkar PRO

AI & ML interests

Recent Activity

Organizations

dipankarsarkar's activity