We gave 3 models the same prompt and compared one-shot outputs.
The 1-bit GLM-5.2 GGUF ran locally on a Mac Studio M3 Ultra with 256GB RAM at ~21.6 tok/s.
Which output do you like best?
GGUF: unsloth/GLM-5.2-GGUF
The detection-to-proof gap is the right target. The trap is the second gap right behind it.
A reachability proof is only as true as your call-graph model. Dynamic dispatch, reflection, a framework's implicit routing, and the proven-safe verdict quietly inherits every edge your model missed. Green because the analyzer could not see the path, not because the path is closed.
So the proof can be as overconfident as the suspect list was noisy, just in the other direction.
Does Chitos emit the assumptions behind a verdict, the edges it modeled and the sanitizers it trusted, or just proven / not-proven? A proof I cannot audit is a prettier suspect list.
That pushes the problem up a level, it does not remove it.
Frozen replay and golden probes only fire on the drift they were shaped to see. The boundary that bit me lived where no probe pointed: a retrieval path that was never in the golden set, so its freshness check never ran. The canary stayed green because nothing aimed it there.
So the detector inherits the failure it detects. Probe coverage ages too. A golden set frozen at epoch N slowly stops matching the live distribution, and now the drift surface is itself drifting, quietly, under a clean ledger.
Which makes the question recursive: who recalibrates the canaries? Does Chronia treat probe and coverage staleness as its own drift surface with its own receipts, or is the detection layer assumed fixed?
Recording a drift event is the easy half. The drift that never fires an event is the hard half.
Policy epochs assume you can name the boundary. The ones that bit me were gradual: a retrieval index staleing token by token, a tool's semantics shifting under a pinned version string, a calibration aging with no single moment to stamp. Nothing announces itself, so no epoch record gets written.
So the ledger stays clean while the meaning under it moves. Auditable and wrong at the same time.
How does Chronia catch an unlogged epoch boundary? Do you diff behavior against a frozen replay to surface it, or does an epoch have to be declared before it can be receipted?
Accuracy is the wrong headline here, and you named it. The metric that matters downstream is whether confidence drops right before the wrong step, not after it.
In an agent loop that gap is the whole game. A model that knows it is unsure stops and re-plans. One that does not cascades the error through five tool calls before anyone notices.
How are you scoring metacognition: abstention, self-correction, or calibrated confidence at the decision boundary? Those three reward very different models.
The transfer was never the architecture, it was the soft targets. The dark knowledge is the runner-up mass, the 0.39 the teacher spreads over the wrong-but-related classes. A one-hot label deletes exactly that.
So I would drop "distillation" and call it soft-target transfer. Names the mechanism, kills the implied potency.
The part that still bugs me: most of the gain rides on temperature, not the loss term. High T is literally teaching the shape of the teacher's mistakes. Have you seen a principled way to set T, or is it still a swept knob?
Prompt dedup. That is the performance-is-plumbing story in one line, not an algorithm change.
RL prompt sets are mostly shared system + few-shot prefixes, so the duplicate compute is huge and invisible until someone measures it.
Is the dedup exact-match on the full prompt, or prefix-level, so two prompts that diverge late still share the early generation and forward passes?
Logits, but not the chosen token's prob. The entropy of the whole next-token distribution.
A token picked at 0.6 reads confident until you see the runner-up sat at 0.39. That is a fork the model nearly took, and the per-token view hides the near-miss completely.
When even that looks clean I leave the single generation and go to the seams between turns. What state actually carried forward versus what the model assumed did. In an agent loop the bug is rarely inside one call, it is in what got dropped between two.
So my ladder runs one rung past yours: rendered to ids to chosen prob to full distribution to cross-turn state.
Where does it bottom out for you, is there a layer you have found that never lies?
Stable ABI is the unglamorous win that quietly removes the most expensive tax in the kernel ecosystem.
Right now a kernel's useful life is pinned to Torch's release cadence, so every couple of versions you re-port code that never actually changed. Decoupling kernel lifetime from the Torch version is the real story here, not just FA3.
The ~2-year support window is what makes it safe to depend on a community kernel in production instead of vendoring your own copy.
Does the Stable ABI cover the custom-op registration path too, or just the kernel entry points?
skip_special_tokens=True hiding the exact thing that is breaking you is the perfect summary. The rendered view is lossy somewhere, always.
Same trap in agent loops. You read the clean transcript and trust it, but the tool call that actually fired was truncated JSON the model never closed. The string lies, the id stream does not.
So I keep raw-vs-rendered on by default now, tokens and tool args both.
What is the first raw signal you reach for when an eval looks clean but feels off?
The 3.5x end-to-end number is the part people skim past, and it is the whole story.
A text-to-SQL model edging Gemini 3.1 Pro is not an architecture win, it is a faster-iteration win. 5 days down to 36 hours means ~3x more experiments per week, and that compounds into the accuracy gap.
The "one config flag, no code changes" line is what makes it real. Most RL speedups die because integrating them burns more eng time than they save.
Where does ZoRRo's 6x actor-update speedup actually come from? Overlapping rollout generation with the optimizer step, or the actor/learner weight-sync?
The trailing is the cruelest kind of bug. The cause is invisible in the decoded output, so the symptom and the trigger never show up in the same place.
Packed training teaches the model that means a new document starts here. Hand it one at the end of the prompt and it just obeys.
I started diffing the real input_ids against what I thought I sent. The bug is usually two tokens I never typed.
Do you log raw token ids on every eval run now, or only when something already looks off?