Qwen3.5-397B-A17B LoRA SFT v3

LoRA adapter for Qwen/Qwen3.5-397B-A17B fine-tuned on AMD GPU kernel engineering trajectories using LLaMA-Factory.

What This Adapter Does

Specializes Qwen3.5-397B-A17B for AMD GPU kernel optimization tasks -- writing Triton kernels, debugging ROCm issues, and optimizing performance on AMD Instinct GPUs. Trained on 104 multi-turn agent trajectories from the amdpilot dataset.

Version History

Version Train Loss Eval Loss Key Change HuggingFace
v1 0.163 n/a Baseline pipeline v1
v2 0.085 n/a 3-view data extraction (-48% loss) v2
v3 0.059 0.044 Recipe fix: 10x steps, 2x rank, eval (-31% loss) this repo

Training Details

Parameter Value
Base model Qwen/Qwen3.5-397B-A17B (MoE, 17B active)
Hardware 8x AMD Instinct MI355X (ROCm 7.2)
LoRA rank / alpha 32 / 64
Target modules all (13 types)
Trainable params 128.5M / 396.9B (0.032%)
Dataset 296 examples (3-view from 104 trajectories)
Cutoff length 32,768 tokens
Epochs / Steps 10 / 130
Batch size 8 (1 per device x 8 GPUs)
Learning rate 2e-5 (cosine schedule)
Weight decay 0.01
Training time 5h 10min
Framework LLaMA-Factory + DeepSpeed ZeRO-3 + PEFT 0.18.1

Eval Loss Trajectory

Step Epoch Eval Loss
20 1.5 0.0618
40 3.1 0.0539
60 4.6 0.0491
80 6.2 0.0461
100 7.7 0.0446
120 9.2 0.0443
130 10.0 0.0442

Eval loss decreases monotonically with no overfitting. wandb run.

Usage

Load with PEFT

from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("JinnP/Qwen3.5-397B-A17B-LoRA-SFT-v3")
model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen3.5-397B-A17B", device_map="auto", torch_dtype="bfloat16"
)
model = PeftModel.from_pretrained(model, "JinnP/Qwen3.5-397B-A17B-LoRA-SFT-v3")

Serve with vLLM (LoRA hot-loading)

python -m vllm.entrypoints.openai.api_server \
  --model Qwen/Qwen3.5-397B-A17B \
  --enable-lora \
  --lora-modules amdpilot=JinnP/Qwen3.5-397B-A17B-LoRA-SFT-v3 \
  --tensor-parallel-size 8

Merge with LLaMA-Factory

llamafactory-cli export \
  --model_name_or_path Qwen/Qwen3.5-397B-A17B \
  --adapter_name_or_path JinnP/Qwen3.5-397B-A17B-LoRA-SFT-v3 \
  --template qwen3_5_nothink \
  --finetuning_type lora \
  --export_dir saves/qwen35-397b-merged

Dataset

JinnP/amdpilot-lora-sft-dataset -- 104 multi-turn agent trajectories:

  • 94 KernelBench Triton kernel optimization tasks
  • 4 SGLang/vLLM bugfix and feature tasks
  • 4 frontier bugfix trajectories
  • Processed into 296 training examples using 3-view extraction (bookend + full + solution chunks)

Framework Versions

  • PEFT 0.18.1
  • Transformers 5.2.0
  • PyTorch 2.9.1+rocm7.2.0
  • Datasets 4.0.0
  • Tokenizers 0.22.2
Downloads last month
1
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for JinnP/Qwen3.5-397B-A17B-LoRA-SFT-v3

Adapter
(13)
this model

Dataset used to train JinnP/Qwen3.5-397B-A17B-LoRA-SFT-v3