Context length and regeneration
What was the context length used to train this head?
Did you regenerate the data firstly?
We used context length 4096 and regenerated the sequences using the large model.
Thank you for information.
I think there is a risk that after first 4k of tokens during reasoning, the acceptance ratio will drop significantly, as it is the problem with train data distribution instead of rope/yarn.
So this head is ok for easier reasoning tasks but not for AIME GPQA etc. where sometimes we need to generate more than 30k tokens before answer.
What do you think?
Sure, that may be an issue. We haven't tested very long context length tasks but please let us know how it goes if you do!
Hello, I met similar problems here. When testing AIME with this eagle, the accept rate is 50~60% with k=3 if generated sequence is short. But it drops significantly to < 10 % if seq lens are longer