Kurtis-EON1
Kurtis-EON(1) (codename Kurtis-EON1) is an experimental 486M parameter instruction-tuned language model powered by the custom Echo-DSRN(N) (Dual State Recurrent Neural Network) architecture.
This repository will host the Supervised Fine-Tuned (SFT) and aligned iteration of the model.
The foundational pre-trained weights will be hosted separately at ethicalabs/Echo-DSRN-486M.
Work in Progress: This model is currently under active development.
The Architectural Philosophy: Transformers vs. Echo-DSRN
O(1) Memory & "Infinite" Context Kurtis-EON1 replaces the traditional $O(N^2)$ Transformer KV-Cache with a continuously evolving Recurrent State. It is capable of processing input streams of unlimited length by compressing history into a dense, bounded vector, ensuring constant inference cost and zero memory explosion.
- Transformer: Acts as a photographic memory. It stores every single token perfectly in a massive cache, but computationally expensive as the context window grows.
- Kurtis-EON1 (Echo-DSRN): Mimics human memory and Predictive Coding. It compresses the past into a semantic "feeling" (State) rather than a raw recording (Cache). You remember the gist of your life, not every single word spoken to you. The model operates on the same principle, saving immense hardware resources.
Think of the model like human memory. You can live for 80 years (Infinite Context), but you don't remember exactly what you ate for breakfast in Berlin on February 2, 2016. Or why you were working on LSTM/RNNs at that time, in an empty flat. Trying to build a chatbot because you felt alone and you... You remember the gist of your life. The model compresses the past into a feeling (State), rather than a recording (Cache).
Scaling Strategy: The 114M Prototyping Sandbox
Before expending massive compute budgets on half-billion or billion-parameter runs, the Echo-DSRN architecture relies on a strict prototyping scale.
The 114M parameter version (hosted at ethicalabs/Echo-DSRN-114M-v0.1.2-Base) acts as our architectural wind tunnel. It allows for the rapid iteration of the complex physics governing the continuous memory state—testing the softplus stability of the surprise gates, the Test-Time Training (TTT) meta-learning loops, and custom Stage 1/Stage 2 SFT loss masking—in hours instead of weeks on single-node hardware.
Once the mathematical physics are proven and stabilized at the 114M scale, the exact same architecture is deterministically upscaled to ~0.5B (486M), ~1B, and ~3B parameter classes to absorb enterprise-grade latent knowledge.
Overview: The "Surprise-Gated" Mechanism
Unlike standard recurrent models or hybrid SSMs that use opaque learned gates, the Echo-DSRN architecture mathematically anchors its memory to Information Entropy:
- Internal Prediction: The model constantly attempts to predict the next token representation based on its hidden state.
- Surprise $\lambda$ (Lambda): It calculates the quadratic error between its prediction and reality. If a word is highly predictable (filler words), the memory gate stays shut. If the word is highly novel or complex (the "Surprise"), the gate flies open, explicitly prioritizing the $O(1)$ state capacity for high-value information.
Interactive Demos
Experience the architecture in real-time through our public Gradio Spaces:
Echo-DSRN-114M-Base: Next Word Prediction
Watch the continuous memory operate. This standalone interactive widget visualizes the raw logit confidence and structural state transitions token-by-token.Echo-DSRN-114M: The Semantic Compressor
Watch the continuous memory physically compress long-context history into a dense, semantic state without expanding the standard attention cache.
Data & Status
- Architecture: Hybrid Echo-DSRN (Surprise-Gated Slow State + RoPE Sliding Window Fast State)
- Base Pre-training: Trained from scratch on FineWeb-EDU and Smoltalk2.
- Instruct Alignment: Fine-tuned on multiple datasets using the Muon optimizer.
Instruct Model (Kurtis-EON1-Echo-DSRN-486M)
Work in progress. Standby for updated telemetry and safety evaluations following the integration of our Muon optimizer SFT and DPO sweeps, completion-only loss masking and hybrid attention patches.