Instructions to use roskosmos19/Rhea-4B-Coding with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use roskosmos19/Rhea-4B-Coding with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="roskosmos19/Rhea-4B-Coding")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForMultimodalLM

tokenizer = AutoTokenizer.from_pretrained("roskosmos19/Rhea-4B-Coding")
model = AutoModelForMultimodalLM.from_pretrained("roskosmos19/Rhea-4B-Coding")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use roskosmos19/Rhea-4B-Coding with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "roskosmos19/Rhea-4B-Coding"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "roskosmos19/Rhea-4B-Coding",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/roskosmos19/Rhea-4B-Coding

SGLang

How to use roskosmos19/Rhea-4B-Coding with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "roskosmos19/Rhea-4B-Coding" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "roskosmos19/Rhea-4B-Coding",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "roskosmos19/Rhea-4B-Coding" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "roskosmos19/Rhea-4B-Coding",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use roskosmos19/Rhea-4B-Coding with Docker Model Runner:
```
docker model run hf.co/roskosmos19/Rhea-4B-Coding
```

Rhea 4B Coding

Rhea-4B-Coding is an optimized version of Aquiles-ai/Athenea-4B-Coding, specialized in code reasoning, debugging, agentic tools and multi-pass problem solving.

Trained on high-quality programming data with explicit reasoning traces using thinking and 思考结束 tags, the model is designed to perform detailed 3-pass reasoning for software development, algorithm design, and code comprehension tasks:

Pass 1: First implementation
Pass 2: Self-review for bugs, edge cases, security, performance
Pass 3: Final optimized version with identical functionality

⚠️ Important Note: This model uses an uncensored base version, providing full expressive freedom and unrestricted output generation. Users are fully responsible for any use or content produced by the model. It is intended exclusively for research and experimentation purposes.

🎯 Model Description

Rhea-4B-Coding extends Athenea-4B-Coding's structured reasoning capabilities into programming-related domains with multi-pass processing, showing strong performance on logical problem-solving, code completion, debugging scenarios, and iterative code refinement.

Key features:

Multi-Pass Processing: 3-step reasoning with <think>, <review>, and <final> tags
Agentic Tools for AI Agents
Step-by-step code reasoning within thinking blocks
Self-review capabilities for bug detection and optimization
Specialization in algorithmic and debugging tasks
Uncensored output generation for full reasoning visibility
Improved logical consistency through focused fine-tuning
Compatible with open inference frameworks (Transformers, vLLM, etc.)

The model was fine-tuned using the dataset Aquiles-ai/Athenea-Coding-100k, which includes diverse programming challenges, structured reasoning chains, and natural language explanations across multiple programming languages.

🔄 Multi-Pass Architecture

The model uses special tokens for structured reasoning:

Token	Purpose
`<think>`	Start of self-review phase (Pass 2)
`</think>`	End of self-review phase
`<review>`	Start of review results documentation
`</review>`	End of review results
`<final>`	Start of final optimized version (Pass 3)
`</final>`	End of final version

This structure ensures identical functionality across all passes while improving code structure, comments, and robustness.

💻 Usage

Installation

uv pip install transformers torch accelerate

Basic Inference

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model = AutoModelForCausalLM.from_pretrained("Roskosmos19/Rhea-4B-Coding",
        dtype=torch.bfloat16,
        trust_remote_code=True,
        device_map="auto",
        attn_implementation="flash_attention_2") # Requires flash-attn

# Without flash-attn:
# model = AutoModelForCausalLM.from_pretrained("Roskosmos19/Rhea-4B-Coding",
#     dtype="auto",
#     device_map="auto"
# )

tokenizer = AutoTokenizer.from_pretrained("Roskosmos19/Rhea-4B-Coding", trust_remote_code=True)

messages = [
    {"role": "user", "content": "Hey, write a Python function that calculates the factorial of a number recursively."}
]

inputs = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    tokenize=True,
    return_dict=True,
    return_tensors="pt",
).to('cuda')

with torch.no_grad():
    output = model.generate(
        **inputs,
        max_new_tokens=16384,  # Increased for multi-pass output
        pad_token_id=tokenizer.pad_token_id,
        eos_token_id=tokenizer.eos_token_id,
    )

# Decode and print the output
print(tokenizer.decode(output[0], skip_special_tokens=False))

Multi-Pass Inference (Recommended)

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model = AutoModelForCausalLM.from_pretrained("Roskosmos19/Rhea-4B-Coding",
        dtype=torch.bfloat16,
        trust_remote_code=True,
        device_map="auto")

tokenizer = AutoTokenizer.from_pretrained("Roskosmos19/Rhea-4B-Coding", trust_remote_code=True)

def generate_multi_pass(prompt, max_tokens_per_pass=4096):
    """
    Generate code with 3-pass reasoning:
    Pass 1: First implementation
    Pass 2: Self-review
    Pass 3: Final optimized version
    """
    
    # Pass 1: First implementation
    messages = [{"role": "user", "content": prompt}]
    inputs = tokenizer.apply_chat_template(
        messages,
        add_generation_prompt=True,
        tokenize=True,
        return_dict=True,
        return_tensors="pt",
    ).to('cuda')
    
    with torch.no_grad():
        output1 = model.generate(
            **inputs,
            max_new_tokens=max_tokens_per_pass,
            temperature=0.4,
            pad_token_id=tokenizer.pad_token_id,
            eos_token_id=tokenizer.eos_token_id,
        )
    
    pass1 = tokenizer.decode(output1[0], skip_special_tokens=False)
    
    # Pass 2: Self-review
    review_prompt = pass1 + "\n<<think>\n### PASS 2 - Self-Review:\n"
    inputs2 = tokenizer(review_prompt, return_tensors="pt").to('cuda')
    
    with torch.no_grad():
        output2 = model.generate(
            **inputs2,
            max_new_tokens=2048,
            temperature=0.3,
            pad_token_id=tokenizer.pad_token_id,
            eos_token_id=tokenizer.eos_token_id,
        )
    
    review = tokenizer.decode(output2[0], skip_special_tokens=False)
    
    # Pass 3: Final version
    final_prompt = review + "\n<<final>\n### PASS 3 - Final Version:\n"
    inputs3 = tokenizer(final_prompt, return_tensors="pt").to('cuda')
    
    with torch.no_grad():
        output3 = model.generate(
            **inputs3,
            max_new_tokens=max_tokens_per_pass,
            temperature=0.2,
            pad_token_id=tokenizer.pad_token_id,
            eos_token_id=tokenizer.eos_token_id,
        )
    
    final = tokenizer.decode(output3[0], skip_special_tokens=False)
    
    return {
        "pass1": pass1,
        "review": review,
        "pass3": final
    }

# Example usage
result = generate_multi_pass("Write a Python function for binary search")
print("=== PASS 1 ===")
print(result["pass1"])
print("\n=== REVIEW ===")
print(result["review"])
print("\n=== FINAL ===")
print(result["pass3"])

Streaming Inference

from transformers import AutoModelForCausalLM, AutoTokenizer, TextIteratorStreamer
import torch
from threading import Thread

model = AutoModelForCausalLM.from_pretrained("Roskosmos19/Rhea-4B-Coding",
        dtype=torch.bfloat16,
        trust_remote_code=True,
        device_map="auto",
        attn_implementation="flash_attention_2")

tokenizer = AutoTokenizer.from_pretrained("Roskosmos19/Rhea-4B-Coding", trust_remote_code=True)

messages = [
    {"role": "user", "content": "Hey, write a Python function that implements the binary search algorithm recursively."}
]

inputs = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    tokenize=True,
    return_dict=True,
    return_tensors="pt",
).to('cuda')

# Create the streamer
streamer = TextIteratorStreamer(tokenizer, skip_prompt=True, skip_special_tokens=False)

# Build kwargs for generate
generate_kwargs = dict(
    **inputs,
    max_new_tokens=16384,  # Increased for multi-pass output
    pad_token_id=tokenizer.pad_token_id,
    eos_token_id=tokenizer.eos_token_id,
    streamer=streamer,
)

def _generate_thread(model, kwargs):
    with torch.no_grad():
        model.generate(**kwargs)

thread = Thread(target=_generate_thread, args=(model, generate_kwargs))
thread.start()

for chunk in streamer:
    print(chunk, end="", flush=True)

Production Deployment with vLLM

Start server:

vllm serve Roskosmos19/Rhea-4B-Coding \
  --host 0.0.0.0 \
  --port 8000 \
  --api-key dummyapikey \
  --max-model-len=262144 \
  --async-scheduling \
  --gpu-memory-utilization=0.90

Request to the server from the OpenAI client:

from openai import OpenAI
client = OpenAI(api_key="dummyapikey", base_url="http://127.0.0.1:8000/v1")
stream = client.chat.completions.create(
    model="roskosmos19/Rhea-4B-Coding",
    messages=[{
        "role": "user",
        "content": "Hey, write a Python function that determines if a string is a palindrome, ignoring case, spaces, and punctuation."
    }],
    max_tokens=16384,  # Increased for multi-pass output
    stream=True
)
for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)

vLLM Benefits: 20-30x faster inference, OpenAI-compatible API, continuous batching, async scheduling.

📝 Model Configuration

Parameter	Value	Description
`temperature`	0.4	Balanced creativity and consistency
`max_new_tokens`	32768	Full multi-pass output capacity
`repetition_penalty`	1.0	No penalty for intentional code repetition
`no_repeat_ngram_size`	0	Allows code structure repetition
`use_cache`	true	Faster inference for long outputs

⚙️ Files Modified for Multi-Pass

File	Changes
`generation_config.json`	Extended tokens, optimized for multi-pass
`config.json`	Enabled caching, full context window
`tokenizer_config.json`	Added `<think>`, `<review>`, `<final>` tokens
`special_tokens_map.json`	Registered new special tokens
`chat_template.jinja`	3-pass prompt structure

🤝 Credits

Base model: Aquiles-ai/Athenea-4B-Coding
Dataset: Aquiles-ai/Athenea-Coding-100k
Architecture: Qwen3 4B

Roskosmos19

Downloads last month: 229

Safetensors

Model size

4B params

Tensor type

BF16

Duplicated from Aquiles-ai/Athenea-4B-Coding

roskosmos19
/

Rhea-4B-Coding