Aurora-M/MDEL

community

https://aurora-lm.github.io/posts/about-us/

AI & ML interests

Formerly, MDEL, we have renamed ourselves after the model we deployed, Aurora-M. Visit us here: https://huggingface.co/aurora-m

Recent Activity

huu-ontocord authored a paper about 1 month ago

Agents Learn Their Runtime: Interpreter Persistence as Training-Time Semantics

cabbage972 authored a paper about 1 month ago

GitChameleon: Evaluating AI Code Generation Against Python Library Version Incompatibilities

cabbage972 authored a paper about 1 month ago

MixtureVitae: Open Web-Scale Pretraining Dataset With High Quality Instruction and Reasoning Data Built from Permissive-First Text Sources

View all activity

ajibawa-2023

posted an update 24 days ago

Post

2103

Stitched-Reasoning-Trajectories-7M

Dataset: ajibawa-2023/Stitched-Reasoning-Trajectories-7M
Stitched-Reasoning-Trajectories-7M is a massive-scale, synthetic multi-hop reasoning dataset. It was built by algorithmically "stitching" together discrete reasoning traces from the original glaiveai/reasoning-v1-20m dataset into continuous, coherent, and logically structured multi-agent trajectories.

By extracting internal sub-questions from <think> blocks and mapping high-information keyword overlaps, this dataset transforms single-turn Q&A pairs into deep, multi-step research plans. To ensure high quality and eliminate "topic drift," every trajectory has been verified using a dense semantic embedding model (BAAI/bge-large-en-v1.5).

The resulting dataset consists of 709 .jsonl files containing over 7.2 million entirely deduplicated, highly coherent reasoning chains.

huu-ontocord

authored a paper about 1 month ago

Agents Learn Their Runtime: Interpreter Persistence as Training-Time Semantics

Paper • 2603.01209 • Published Mar 1 • 1

ajibawa-2023

posted an update about 1 month ago

Post

1303

Ruby-Code-Large
Dataset : ajibawa-2023/Ruby-Code-Large

Ruby-Code-Large is a large-scale corpus of Ruby programming language source code comprising 331,743 code samples stored in .jsonl format. The dataset is designed to support research and development in large language model (LLM) pretraining, static analysis, web application development, and software engineering automation within the Ruby ecosystem.

By offering a substantial, language-focused dataset, Ruby-Code-Large enables targeted experimentation in dynamic programming, object-oriented design, and rapid application development—areas where Ruby is widely used, particularly in web frameworks and scripting.

Ruby-Code-Large addresses the lack of large, curated, Ruby-specific datasets, enabling focused research on expressive syntax, metaprogramming, and high-level abstractions.

ajibawa-2023

posted an update about 1 month ago

Post

6114

Go-Code-Large
Dataset: ajibawa-2023/Go-Code-Large

Go-Code-Large is a large-scale corpus of Go (Golang) programming language source code, comprising 316,427 code samples stored in .jsonl format. The dataset is designed to support research and development in large language model (LLM) pretraining, static analysis, cloud-native systems, and modern backend software engineering.

By offering a focused and curated dataset for Go, this corpus enables experimentation in concurrent programming, distributed systems, and performance-oriented backend services—domains where Go is widely adopted.

Go-Code-Large addresses the relative scarcity of large, language-specific datasets for Go, enabling targeted research into idiomatic Go patterns, concurrency primitives, and scalable system design.

2 replies

cabbage972

authored 4 papers about 1 month ago

GitChameleon: Evaluating AI Code Generation Against Python Library Version Incompatibilities

Paper • 2507.12367 • Published Jul 16, 2025 • 7

MixtureVitae: Open Web-Scale Pretraining Dataset With High Quality Instruction and Reasoning Data Built from Permissive-First Text Sources

Paper • 2509.25531 • Published Sep 29, 2025 • 10

Agents Learn Their Runtime: Interpreter Persistence as Training-Time Semantics

Paper • 2603.01209 • Published Mar 1 • 1

FreshBrew: A Benchmark for Evaluating AI Agents on Java Code Migration

Paper • 2510.04852 • Published Oct 13, 2025

JeanKaddour

submitted a paper to Daily Papers about 1 month ago

Target Policy Optimization

Paper • 2604.06159 • Published Apr 7 • 23

ajibawa-2023

posted an update 2 months ago

Post

2797

C-Code-Large
Dataset: ajibawa-2023/C-Code-Large

C-Code-Large is a large-scale corpus of C programming language source code comprising more than 4 million code samples stored in .jsonl format. The dataset is designed to support research and development in large language model (LLM) pretraining, static analysis, and software engineering automation for the C ecosystem.

By offering a high-volume, language-focused dataset, C-Code-Large enables targeted experimentation in low-level programming, memory-constrained environments, and performance-critical systems, where C continues to be a dominant language.

C-Code-Large addresses the lack of large, curated, C-specific datasets, making it possible to conduct focused research on procedural programming paradigms, manual memory management, and system-level abstractions.

ajibawa-2023

posted an update 3 months ago

Post

3859

Cpp-Code-Large
Dataset: ajibawa-2023/Cpp-Code-Large

Cpp-Code-Large is a large-scale corpus of C++ source code comprising more than 5 million lines of C++ code. The dataset is designed to support research in large language model (LLM) pretraining, code intelligence, software engineering automation, and static program analysis for the C++ ecosystem.

By providing a high-volume, language-specific corpus, Cpp-Code-Large enables systematic experimentation in C++-focused model training, domain adaptation, and downstream code understanding tasks.

Cpp-Code-Large addresses the need for a dedicated C++-only dataset at substantial scale, enabling focused research across systems programming, performance-critical applications, embedded systems, game engines, and large-scale native software projects.

3 replies

ajibawa-2023

posted an update 3 months ago

Post

3537

Python-Code-Large
Dataset: ajibawa-2023/Python-Code-Large

Python-Code-Large is a large-scale corpus of Python source code comprising more than 2 million rows of Python code. The dataset is designed to support research in large language model (LLM) pretraining, code intelligence, software engineering automation, and program analysis for the Python ecosystem.

By providing a high-volume, language-specific corpus, Python-Code-Large enables systematic experimentation in Python-focused model training, domain adaptation, and downstream code understanding tasks.

Python-Code-Large addresses the need for a dedicated Python-only dataset at substantial scale, enabling focused research across data science, backend systems, automation, scientific computing, and AI-driven Python environments.

1 reply

ajibawa-2023

posted an update 3 months ago

Post

2588

PHP-Code-Large

Dataset: ajibawa-2023/PHP-Code-Large

PHP-Code-Large is a large-scale corpus of PHP source code comprising more than 12 million lines of PHP code. The dataset is designed to support research in large language model (LLM) pretraining, code intelligence, software engineering automation, and static program analysis for the PHP ecosystem.

By providing a high-volume, language-specific corpus, PHP-Code-Large enables systematic experimentation in PHP-focused model training, domain adaptation, and downstream code understanding tasks.

PHP-Code-Large addresses the need for a dedicated PHP-only dataset at substantial scale, enabling focused research across backend systems, CMS platforms, APIs, and full-stack PHP environments.

ajibawa-2023

posted an update 3 months ago

Post

3268

JavaScript-Code-Large
ajibawa-2023/JavaScript-Code-Large

JavaScript-Code-Large is a large-scale corpus of JavaScript source code comprising around 5 million JavaScript files. The dataset is designed to support research in large language model (LLM) pretraining, code intelligence, software engineering automation, and program analysis for the JavaScript ecosystem.

By providing a high-volume, language-specific corpus, JavaScript-Code-Large enables systematic experimentation in JavaScript-focused model training, domain adaptation, and downstream code understanding tasks.

JavaScript-Code-Large addresses the need for a dedicated JavaScript-only dataset at substantial scale, enabling focused research across frontend, backend, and full-stack JavaScript environments. .

liangyuch

authored a paper 3 months ago

UniT: Unified Multimodal Chain-of-Thought Test-time Scaling

Paper • 2602.12279 • Published Feb 12 • 20

liangyuch

submitted a paper to Daily Papers 3 months ago

UniT: Unified Multimodal Chain-of-Thought Test-time Scaling

Paper • 2602.12279 • Published Feb 12 • 20

ajibawa-2023

posted an update 3 months ago

Post

3143

Java-Code-Large ( ajibawa-2023/Java-Code-Large)

Java-Code-Large is a large-scale corpus of publicly available Java source code comprising more than 15 million java codes. The dataset is designed to support research in large language model (LLM) pretraining, code intelligence, software engineering automation, and program analysis.

By providing a high-volume, language-specific corpus, Java-Code-Large enables systematic experimentation in Java-focused model training, domain adaptation, and downstream code understanding tasks.

Taishi-N324

authored a paper 4 months ago

On the Optimal Reasoning Length for RL-Trained Language Models

Paper • 2602.09591 • Published Feb 10 • 6

Ziyang

authored a paper 5 months ago

DiffCoT: Diffusion-styled Chain-of-Thought Reasoning in LLMs

Paper • 2601.03559 • Published Jan 7 • 14

terryyz

authored a paper 6 months ago

From Code Foundation Models to Agents and Applications: A Practical Guide to Code Intelligence

Paper • 2511.18538 • Published Nov 23, 2025 • 304

AI & ML interests

Recent Activity

Team members 86

Multi-Domain-Expert-Learning's activity