TrueACT: A Different Kind of Neuron

Transformers use the same MLP on every token. Every time. Same weights, same math, no memory of where it's been in the sequence, no sense of how confident it is. TrueACT chucks that. Replaces the MLP with a small recurrent block that loops. Reads the token, updates a hidden state, checks if it's confident enough, either answers or loops again. Keeps going til it hits 0.99 confidence or runs out of steps.

It's not a standard neuron. It's not an attention head. It's a loop with a router that picks between four specialized operations depending on what the token needs.

Half the parameters of the equivalent standard model. Same loss. 2.8x slower cause of the loop. You're trading flops for parameter efficiency.

the four experts

Each TrueACT layer has four experts. The router picks which combination to use per token, not per layer. So different tokens in the same batch can fire different experts. Here's what they do:

Think Cell — this is the actual recurrent part. Updates the latent state, which is basically working memory that persists across steps within the same layer. Think of it like scratch paper the model scribbles on while reasoning.

Standard — plain linear pattern matching. Same job the normal MLP would do. Catches the easy stuff.

Fancy — this is the weird one. Does math in log-space. For multiplication, log(a*b) = log(a) + log(b). Addition is something a linear layer can already do. So instead of needing a pile of neurons to approximate a multiplication curve, one Fancy expert can do it cleanly. Log then add then exp. Multiplication, division, ratios, chained operations.

Memory Vault — key → value associative lookup. A dedicated place to store facts instead of smearing them across all the weights. Retrieve, don't approximate.

The router takes [input, latent_state, step_count], sticks it through a linear layer plus softmax, and that's the expert selection. Standard and Fancy spend from an action budget. Once the budget is gone, the loop stops, unless it hits the 32-step cap.

why this works

A standard transformer neuron is a linear approximator. For something like a * b = c, you'd need a big pile of neurons approximating a curve. It works eventually but it's wasteful. The weights end up encoding the same multiplication table across hundreds of parameters, and there's no clean way to just do the math.

The Fancy expert sidesteps that. Goes to log-space, adds, comes back. One neuron doing what used to take a crowd.

The Memory Vault is the same idea from the other direction. Instead of memorizing facts by storing them implicitly in weight matrices, just do a key-value lookup. Store it once, retrieve it when needed.

The Think Cell ties it together. Gives the model a place to hold intermediate state while it loops through the experts. Without it, each token is a one-shot guess. With it, the model can go "hmm let me think about this" and take another step.

So the model gets more mileage per parameter. The tradeoff is sequential compute. You can't parallelize a loop that depends on its own output. That's where the 2.8x slowdown comes from.

the numbers

3-layer LLaMA-style comparison at d=384:

metric	Standard	TrueACT
loss	0.0884	0.0880
params	852,864	428,652
train speed	1x	2.8x slower

Same loss, roughly half the weights. The slowdown is real — the loop is sequential, can't be parallelized. But you're getting the same quality out of half the parameter budget.

1-layer arithmetic reasoner: 12/12 on a fixed 12-expression benchmark. 91.6% on 500 random expressions. The misses are mostly multi-digit arithmetic — 42*88=3524 type stuff. Structure like parentheses, operator precedence, intermediate steps — those come out clean. The model actually writes out the worked steps: ((5*5)+(10*2))=(25+(10*2))=(25+20)=45|

the router in action

The routing stats tell you what the model's doing under the hood. Example from the inference CLI:

Prompt > ((5*5)+(10*2))=
TrueACT : ((5*5)+(10*2))=(25+(10*2))=(25+20)=45|
         [Think: 15% | Stand: 30% | Fancy: 45% | Vault: 10%]

For arithmetic, Fancy gets most of the budget. Makes sense — multiplication is the expensive operation and Fancy handles it in log-space. Standard catches the easy pattern matching (digits, parens, equals signs). Think Cell does the state tracking across steps. Memory Vault probably handles the number facts.

The router isn't pre-programmed. It learns which expert to use for which kind of token during training. The routing stats are emergent.

how training works

Data is an infinite stream of generated arithmetic — +, -, *, parentheses, multi-step chains. Format is ((5*5)+(10*2))=(25+(10*2))=(25+20)=45|. The model sees a random position in the chain and has to predict the next character.

Context window is 64 chars, one-hot encoded. The alphabet is 12 characters (digits, operators, parens, equals, pipe) so the input is 768-dimensional one-hots.

Batch size 8192. AdamW, lr 5e-4, weight decay 0.01. 1-3 layer TrueACTStack, t_dim=256, max 32 ACT steps per layer.

Training also runs a StandardStack (same structure, ordinary SiLU MLPs) side by side as the control group. Checkpoints save both every 500 steps.

the architecture, deeper

TrueACTLayer: concat input x and latent h → xh. Router reads [xh, step_frac] → softmax over 4 experts. Compute expert outputs, gate by router_prob * remaining_budget, accumulate into the result, update h through the Think Cell. Repeat til budget hits zero or 32 steps.

TrueACTStack: N of those layers with residual connections. Input projection to model dim at the bottom, output projection to vocab at the top.

StandardStack: same structure but with normal SiLU MLPs instead of the TrueACT loop. The control group.

The budget mechanism matters. Standard and Fancy both consume budget when used. Think Cell and Memory Vault, from how they're structured, seem to be state management rather than compute, so they don't appear to draw from the budget. The model can think (Think Cell) and retrieve (Memory Vault) freely. Only the expensive ops cost steps.

how it started

This thing began as one log-space neuron trying to learn x*y=z. That's it. One neuron doing multiplication in log-space.

41 notes later in MEMORY.md. Mode collapse. Gradient explosions. Dead architecture after dead architecture. Full rewrites. Things that almost worked before falling apart at higher dimensions.

The 41 notes on what didn't work are arguably more valuable than what did. Every dead end, every fix, every "wait that shouldn't have helped" moment. Built for AI agents to read so they don't repeat the same mistakes.

The four-expert router, the Think Cell, the budget gating, the step cap — none of that was in the original idea. Each piece got added because something broke without it.

looking forward

The toy results are promising. Half the params, same loss. The next step is figuring out if this scales past small arithmetic models, and what the loop overhead looks like at bigger sizes. That's the open question.

the 41st attempt finally worked. go read MEMORY.md if you wanna avoid the first 40.

tldr

Swap the transformer MLP for a recurrent block that loops til it's confident
Four experts: Think Cell (working memory), Standard (linear matching), Fancy (log-space math), Memory Vault (key→value lookup)
Router picks which experts fire per token based on input, latent state, and step count
Half the params (429k vs 853k), same loss (0.0880 vs 0.0884), 2.8x slower
1-layer solves 12/12 on fixed benchmark, 91.6% on 500 random arithmetic expressions
Fancy expert does log(a*b) = log(a)+log(b) — one neuron doing what used to take a crowd
Started as one log-space neuron, 41 failed notes later it's a whole architecture
Tradeoff: sequential compute for parameter efficiency

the 41st attempt finally worked. go read MEMORY.md if you wanna avoid the first 40.