SykoLLM-V5.7-Mini

SykoLLM-V5.7-Mini, sıfırdan (from scratch) eğitilmiş, Türkçe ve İngilizce destekli, kod anlama kapasitesine sahip küçük ölçekli bir dil modelidir. Phi-3 mimarisi temel alınarak özel BPE tokenizer ile geliştirilmiştir.

Model Mimarisi

Özellik	Değer
Mimari	Phi-3 (Phi3ForCausalLM)
Toplam Parametre	~277 Milyon
Gizli Katman Boyutu	1024
Katman Sayısı	18
Attention Head	8 (2 KV Head — GQA)
Vocabulary Boyutu	50.000
Maksimum Bağlam	1024 Token
Aktivasyon Fonksiyonu	SiLU
Eğitim Adımı	~3.900 Step
Yaklaşık Eğitim Örneği	~512.000+
Eğitim Donanımı	2x NVIDIA Tesla T4

Tokenizer

Model, sıfırdan eğitilmiş özel bir BPE tokenizer kullanmaktadır. Hugging Face'in hazır tokenizer'larından bağımsız olarak geliştirilmiştir.

Tür: Byte-Level BPE
Vocabulary Boyutu: 50.000
Normalizer: NFKC
Özel Token'lar: <|endoftext|>, <|user|>, <|assistant|>, <|end|>, <|pad|>

Eğitim Verisi

Veri Seti	İçerik	Dil
uonlp/CulturaX	Genel Türkçe web metni	Türkçe
HuggingFaceTB/cosmopedia	Sentetik eğitim materyali	İngilizce
roneneldan/TinyStories	Kısa hikayeler	İngilizce
nampdn-ai/tiny-textbooks	Sentetik ders kitabı	İngilizce
nampdn-ai/tiny-codes	Kod örnekleri	Kod
ise-uiuc/Magicoder-Evol-Instruct-110K	Kod instruction	Kod
theblackcat102/evol-codealpaca-v1	Kod instruction	Kod
turkish-nlp-suite/InstrucTurca	Türkçe instruction	Türkçe

Eğitim Detayları

Parametre	Değer
Optimizer	AdamW 8-bit (bitsandbytes)
Learning Rate	3e-4
LR Scheduler	Cosine
Warmup Steps	200
Batch Size	8 per device x 2 GPU
Gradient Accumulation	8 (Efektif batch: 128)
Max Steps	3.900
Precision	FP16
Max Grad Norm	1.0
Weight Decay	0.05

Kullanım

Kurulum

pip install transformers torch

Metin Üretimi

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "SykoSLM/SykoLLM-V5.7-Mini"

tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.float16,
    device_map="auto",
    trust_remote_code=True
)

prompt = "Türkiye'nin başkenti"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=200,
        temperature=0.7,
        top_p=0.9,
        do_sample=True,
        eos_token_id=tokenizer.eos_token_id,
        pad_token_id=tokenizer.pad_token_id
    )

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Sohbet Formatı

Model <|user|> / <|assistant|> prompt formatıyla eğitilmiştir:

prompt = "<|user|>\nPython'da Fibonacci dizisini nasıl yazarım?<|end|>\n<|assistant|>\n"

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(
    **inputs,
    max_new_tokens=300,
    do_sample=True,
    temperature=0.7,
    top_p=0.9
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Sınırlamalar

Model araştırma ve geliştirme amaçlıdır; production ortamı için önerilmez.
Eğitim adımı görece az olduğundan uzun ve karmaşık akıl yürütme görevlerinde hata yapabilir.
Maksimum bağlam uzunluğu 1024 token ile sınırlıdır.
Model, zararlı içerik filtrelemesi için hizalanmamıştır (RLHF/DPO uygulanmamıştır).

Lisans

Bu model Apache 2.0 lisansı altında yayınlanmıştır.

Alıntı

@misc{sykollm-v57-mini-2025,
  author    = {SykoSLM},
  title     = {SykoLLM-V5.7-Mini: A Small Multilingual Causal Language Model Trained from Scratch},
  year      = {2025},
  publisher = {HuggingFace},
  url       = {https://huggingface.co/SykoSLM/SykoLLM-V5.7-Mini}
}

Downloads last month: 266

Safetensors

Model size

0.3B params

Tensor type

F16

SykoSLM
/

SykoLLM-V5.7-Mini