VTSNLP/vietnamese_curated_dataset
Viewer β’ Updated β’ 12.2M β’ 1.06k β’ 72
This repository contains a ByteLevel BPE tokenizer trained from scratch specifically for the Vietnamese language, designed for decoder-only language model pretraining.
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(
"tranhuyHoang/mini_VN_decoder_tokenizer",
use_fast=True
)