Token inconsistency with Starcoder: fim_ or fim-

#41

by yuryya - opened Oct 12, 2023

Oct 12, 2023

This model has special tokens, which start with "fim-", while StarCoder model uses tokens starting with "fim_". VSCode client is working with StarCoder by default, so it uses "fim_" tokens. This leads to improper work of SantaCoder when VSCode endpoint is changed to it: "fim_..." tokens are parsed as text, and the model adds them to the output from time to time.

Workaround: change token names "fim_" to "fim-" in the VSCode extension settings when SantaCoder is used.

Proposal: change "fim-" to "fim_" for this model.

mcpotato

Oct 12, 2023

Hello @yuryya , are you certain to have configured the following settings to the right values?

If so, please open an issue in https://github.com/huggingface/llm-vscode with the detail of your problems.

yuryya

Oct 12, 2023

Hello! Sure, I mentioned this way as "workaround" in my proposal.

The problem is that it is not evident way. Since StarCoder and SantaCoder are from the save vendor and for the same task, there is no good reason to look in the config again. Moreover, difference like <fim_prefix> and <fim-prefix> is too hard to notice for human, and the error manifests not every time.

Yes, problem can be solved by adding a separate template for SantaCoder in https://github.com/huggingface/llm-vscode. It will work for default configurations, while model interfaces will remain different. But it is better than nothing, I will create PR when I have time.

Maybe, we can also add a note in README like "this model uses different tokens, comparing to StarCoder (fim- instead of fim_), so be careful in the case of migration between them".

loubnabnl

BigCode org Oct 12, 2023

Hello, both models are by BigCode but it's not the same family of models e.g all StarCoder variants (15B, 7B, 3B.. ) have the same FIM tokens. But I added the note you suggested to the "How to use FIM section" in the readme https://huggingface.co/bigcode/santacoder/discussions/42.

Thinkcru

Oct 13, 2023

Oh man... I am one human that totally missed the _ vs - I wish they used the same token type.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment