Missing SWA implementation?
#3
by hell0ks - opened
Hello,
I'm currently implementing Trillion architecture support for llama.cpp.
However, during testing I found model is unstable at long context. With trial-and-error, it looks like trained with SWA at window size 4096 as model card says, but its implementation is missing in transformer modeling code.
Can you confirm this is correct? Thanks.
hell0ks changed discussion status to closed