Sequential-Hidden-Decoding-8B-n2

This is the n=2 variant of Sequential Hidden Decoding, a method that scales sequence length by n× with only additional Embedding parameters — same Transformer, more compute per token.

  • Base model: Qwen3-8B-Base
  • Scale: 2×
  • Additional Embedding Params: 1.9B
  • Training Tokens: 75B
  • Dtype: bfloat16

Note: This is a base model (not instruction-tuned). It is intended for benchmarking, text completion, and as a foundation for downstream fine-tuning (SFT / RLHF). For conversational or instruction-following use cases, please fine-tune on your own data.

Key Idea

Prepare n independent Embedding matrices to encode the same token sequence n times, interleave the results, and feed the n×-length sequence into the same Transformer. Only the last embedding of each token computes the next-token loss, while the preceding embeddings serve as implicit reasoning steps in a continuous latent space.

Results

Benchmark # Shots 8B Baseline 8B scale n=2 8B scale n=4 8B scale n=8
BBH (EM) 3-shot 78.8 81.3 83.0 83.9
MMLU (EM) 5-shot 79.8 80.9 81.9 82.2
MBPP+ (Pass@1) 1-shot 66.7 69.4 68.7 69.4
MATH (LLM-judge) 4-shot 56.0 58.2 60.0 61.1
ARC-C 25-shot 93.9 94.3 94.4 94.7
Hellaswag 10-shot 79.7 83.1 85.0 85.3
GSM8K 4-shot 92.5 93.3 93.9 94.6

Serving (SGLang)

This model requires a patched version of SGLang for inference. See the project page for installation options (Docker image, forked repo, or manual patch).

python -m sglang.launch_server \
    --model-path tencent/Sequential-Hidden-Decoding-8B-n2 \
    --trust-remote-code \
    --tp-size 1 \
    --port 30000 --host 0.0.0.0 \
    --chunked-prefill-size -1 \
    --attention-backend fa3 \
    --mem-fraction-static 0.82 \
    --max-running-requests 32 \
    --context-length 131072 \
    --cuda-graph-max-bs 128 \
    --cuda-graph-bs 1 2 4 8 16 32 64 128
from openai import OpenAI

client = OpenAI(base_url="http://localhost:30000/v1", api_key="EMPTY")
response = client.completions.create(
    model="tencent/Sequential-Hidden-Decoding-8B-n2",
    prompt="The meaning of life is",
    max_tokens=128,
    temperature=0,
)
print(response.choices[0].text)

All Models

Model Scale Embedding Params Training Tokens
Sequential-Hidden-Decoding-8B-n2 2× 1.9B 75B
Sequential-Hidden-Decoding-8B-n4 4× 3.1B 150B
Sequential-Hidden-Decoding-8B-n8 8× 5.6B 187B

Citation

@article{hidden_decoding_2026,
  title   = {Hidden Decoding: Scaling Sequence Length in Pretraining},
  year    = {2026},
  url     = {https://welm.weixin.qq.com/posts/hidden_decoding/}
}

License

This model is released under the License Terms of Sequential-Hidden-Decoding.

Downloads last month
-
Safetensors
Model size
9B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for tencent/Sequential-Hidden-Decoding-8B-n2

Base model

Qwen/Qwen3-8B-Base
Finetuned
(345)
this model

Collection including tencent/Sequential-Hidden-Decoding-8B-n2