Bridging the Data Provenance Gap Across Text, Speech and Video Paper β’ 2412.17847 β’ Published Dec 19, 2024 β’ 13
The Common Pile v0.1: An 8TB Dataset of Public Domain and Openly Licensed Text Paper β’ 2506.05209 β’ Published Jun 5, 2025 β’ 61
YaRN: Efficient Context Window Extension of Large Language Models Paper β’ 2309.00071 β’ Published Aug 31, 2023 β’ 85