This is a great tutorial, I particularly like your 3 TPU optimizations. What other TPU optimization techniques are possible, beyond SPMD Initialization, FSDP and padding to prevent graph recompilation, if you want to further squeeze better performance out of a TPU?
Iβm curious about your take on applying these to coding-specific models. Since we can automatically verify code, KTO seems like it could bypass the bottleneck of creating preference pairs. Do you think the 'unpaired' nature of KTO makes it the superior choice for technical domains where 'bad' data is easy to generate but 'good' data is expensive?