legacy-datasets/wikipedia
Updated • 122k • 629
YAML Metadata Error:"tags[0]" must be a string
For the moment, only the tokenizer is available. The tokenizer is based on SentencePiece with Unigram language model segmentation algorithm.
Taking into account certain characteristics of the language, we chose that:
This tokenizer is adapted to the Bengali language. You can use it to pre-train an Albert model on the Bengali language.
To tokenize:
from transformers import AlbertTokenizer
tokenizer = AlbertTokenizer.from_pretrained('SaulLu/albert-bn-dev')
text = "পোকেমন জাপানী ভিডিও গেম কোম্পানি নিনটেন্ডো কর্তৃক প্রকাশিত একটি মিডিয়া ফ্র্যাঞ্চাইজি।"
encoded_input = tokenizer(text, return_tensors='pt')
Provide examples of latent issues and potential remediations.
The tokenizer was trained on a random subset of 4M sentences of Bengali Oscar and Bengali Wikipedia.
The tokenizer was trained with the SentencePiece on 8 x Intel(R) Core(TM) i7-10510U CPU @ 1.80GHz with 16GB RAM and 36GB SWAP.
import sentencepiece as spm
config = {
"input": "./dataset/oscar_bn.txt,./dataset/wikipedia_bn.txt",
"input_format": "text",
"model_type": "unigram",
"vocab_size": 32000,
"self_test_sample_size": 0,
"character_coverage": 0.9995,
"shuffle_input_sentence": true,
"seed_sentencepiece_size": 1000000,
"shrinking_factor": 0.75,
"num_threads": 8,
"num_sub_iterations": 2,
"max_sentencepiece_length": 16,
"max_sentence_length": 4192,
"split_by_unicode_script": true,
"split_by_number": true,
"split_digits": true,
"control_symbols": "[MASK]",
"byte_fallback": false,
"vocabulary_output_piece_score": true,
"normalization_rule_name": "nmt_nfkc_cf",
"add_dummy_prefix": true,
"remove_extra_whitespaces": true,
"hard_vocab_limit": true,
"unk_id": 1,
"bos_id": 2,
"eos_id": 3,
"pad_id": 0,
"bos_piece": "[CLS]",
"eos_piece": "[SEP]",
"train_extremely_large_corpus": true,
"split_by_whitespace": true,
"model_prefix": "./spiece",
"input_sentence_size": 4000000,
"user_defined_symbols": "(,),-,.,–,£,।",
}
spm.SentencePieceTrainer.train(**config)