Activity Feed

AI & ML interests

None defined yet.

Recent Activity

ajibawa-2023 
posted an update 7 days ago
view post
Post
2026
Stitched-Reasoning-Trajectories-7M

Dataset: ajibawa-2023/Stitched-Reasoning-Trajectories-7M
Stitched-Reasoning-Trajectories-7M is a massive-scale, synthetic multi-hop reasoning dataset. It was built by algorithmically "stitching" together discrete reasoning traces from the original glaiveai/reasoning-v1-20m dataset into continuous, coherent, and logically structured multi-agent trajectories.

By extracting internal sub-questions from <think> blocks and mapping high-information keyword overlaps, this dataset transforms single-turn Q&A pairs into deep, multi-step research plans. To ensure high quality and eliminate "topic drift," every trajectory has been verified using a dense semantic embedding model (BAAI/bge-large-en-v1.5).

The resulting dataset consists of 709 .jsonl files containing over 7.2 million entirely deduplicated, highly coherent reasoning chains.
ajibawa-2023 
posted an update 23 days ago
view post
Post
1246
Ruby-Code-Large
Dataset : ajibawa-2023/Ruby-Code-Large

Ruby-Code-Large is a large-scale corpus of Ruby programming language source code comprising 331,743 code samples stored in .jsonl format. The dataset is designed to support research and development in large language model (LLM) pretraining, static analysis, web application development, and software engineering automation within the Ruby ecosystem.

By offering a substantial, language-focused dataset, Ruby-Code-Large enables targeted experimentation in dynamic programming, object-oriented design, and rapid application development—areas where Ruby is widely used, particularly in web frameworks and scripting.

Ruby-Code-Large addresses the lack of large, curated, Ruby-specific datasets, enabling focused research on expressive syntax, metaprogramming, and high-level abstractions.
ajibawa-2023 
posted an update 24 days ago
view post
Post
6109
Go-Code-Large
Dataset: ajibawa-2023/Go-Code-Large

Go-Code-Large is a large-scale corpus of Go (Golang) programming language source code, comprising 316,427 code samples stored in .jsonl format. The dataset is designed to support research and development in large language model (LLM) pretraining, static analysis, cloud-native systems, and modern backend software engineering.

By offering a focused and curated dataset for Go, this corpus enables experimentation in concurrent programming, distributed systems, and performance-oriented backend services—domains where Go is widely adopted.

Go-Code-Large addresses the relative scarcity of large, language-specific datasets for Go, enabling targeted research into idiomatic Go patterns, concurrency primitives, and scalable system design.
  • 2 replies
·
ajibawa-2023 
posted an update about 2 months ago
view post
Post
2794
C-Code-Large
Dataset: ajibawa-2023/C-Code-Large

C-Code-Large is a large-scale corpus of C programming language source code comprising more than 4 million code samples stored in .jsonl format. The dataset is designed to support research and development in large language model (LLM) pretraining, static analysis, and software engineering automation for the C ecosystem.

By offering a high-volume, language-focused dataset, C-Code-Large enables targeted experimentation in low-level programming, memory-constrained environments, and performance-critical systems, where C continues to be a dominant language.

C-Code-Large addresses the lack of large, curated, C-specific datasets, making it possible to conduct focused research on procedural programming paradigms, manual memory management, and system-level abstractions.

codelion 
posted an update 2 months ago
view post
Post
3334
Scaling Pedagogical Pre-training to 10 Billion Tokens

New blog post exploring what happens when you take optimal data mixing insights and scale up the data generation itself.

We built Sutra, a multi-stage framework for generating pedagogical pre-training data guided by a knowledge graph of ~2,000 concepts across 9 domains. The pipeline includes structured content generation, six-dimension quality evaluation, diversity management across 20 content styles, and a cleaning stage to prevent collapse.

The result is codelion/sutra-10B, a 10.2 billion token pedagogical dataset with rich metadata (domain, complexity, prerequisites, quality scores) on every entry.

We trained codelion/SmolLM2-70M on it for 3 full epochs (30.6B tokens) on a single A10 GPU in ~78 hours.

Key finding: perplexity kept improving across epochs, but benchmark gains plateaued fast. At 70M parameters, the model hits a representational ceiling that more data alone can't break through.

Full writeup with comparisons against 7 other datasets, detailed benchmark breakdowns, and connections to recent work on synthetic data scaling, curriculum learning, and data mixing laws: https://huggingface.co/blog/codelion/scaling-pedagogical-pretraining-10-billion-tokens

All datasets at multiple scales (10M, 100M, 1B, 10B) plus seed concepts and an SFT variant are in the Sutra Pedagogical Datasets collection.
  • 2 replies
·
alvarobartt 
posted an update 2 months ago
view post
Post
3711
Learn how to deploy Microsoft Research VibeVoice ASR on Microsoft Azure Foundry with Hugging Face to generate rich audio transcriptions with Who, When, and What! 💥

> 🕒 60-minute single-pass processing, no chunking or stitching
> 👤 Customized hotwords to guide recognition on domain-specific content
> 📝 Rich transcription: joint ASR + diarization + timestamping in one pass
> 🌍 50+ languages with automatic detection and code-switching support
> 🤗 Deployed on Microsoft Foundry via an OpenAI-compatible Chat Completions API

https://huggingface.co/docs/microsoft-azure/foundry/examples/deploy-vibevoice-asr
ajibawa-2023 
posted an update 2 months ago
view post
Post
3858
Cpp-Code-Large
Dataset: ajibawa-2023/Cpp-Code-Large

Cpp-Code-Large is a large-scale corpus of C++ source code comprising more than 5 million lines of C++ code. The dataset is designed to support research in large language model (LLM) pretraining, code intelligence, software engineering automation, and static program analysis for the C++ ecosystem.

By providing a high-volume, language-specific corpus, Cpp-Code-Large enables systematic experimentation in C++-focused model training, domain adaptation, and downstream code understanding tasks.

Cpp-Code-Large addresses the need for a dedicated C++-only dataset at substantial scale, enabling focused research across systems programming, performance-critical applications, embedded systems, game engines, and large-scale native software projects.
  • 3 replies
·
ajibawa-2023 
posted an update 2 months ago
view post
Post
3527
Python-Code-Large
Dataset: ajibawa-2023/Python-Code-Large

Python-Code-Large is a large-scale corpus of Python source code comprising more than 2 million rows of Python code. The dataset is designed to support research in large language model (LLM) pretraining, code intelligence, software engineering automation, and program analysis for the Python ecosystem.

By providing a high-volume, language-specific corpus, Python-Code-Large enables systematic experimentation in Python-focused model training, domain adaptation, and downstream code understanding tasks.

Python-Code-Large addresses the need for a dedicated Python-only dataset at substantial scale, enabling focused research across data science, backend systems, automation, scientific computing, and AI-driven Python environments.
  • 1 reply
·
ajibawa-2023 
posted an update 3 months ago
view post
Post
2587
PHP-Code-Large

Dataset: ajibawa-2023/PHP-Code-Large

PHP-Code-Large is a large-scale corpus of PHP source code comprising more than 12 million lines of PHP code. The dataset is designed to support research in large language model (LLM) pretraining, code intelligence, software engineering automation, and static program analysis for the PHP ecosystem.

By providing a high-volume, language-specific corpus, PHP-Code-Large enables systematic experimentation in PHP-focused model training, domain adaptation, and downstream code understanding tasks.

PHP-Code-Large addresses the need for a dedicated PHP-only dataset at substantial scale, enabling focused research across backend systems, CMS platforms, APIs, and full-stack PHP environments.
mitkox 
posted an update 3 months ago
view post
Post
5615
My USB charger has a Blackwell GPU and 128GB RAM.
What. A. Time. To. Be. Alive.
People in Sofia: “It’s freezing.”
Me: sitting next to 3kW of space AI heaters on my desk 👀
1x GLM-5, 2x MiniMax-M2.5, 1x Qwen3 Coder Next; all on single Aibrix/K8s cluster
  • 6 replies
·
ajibawa-2023 
posted an update 3 months ago
view post
Post
3267
JavaScript-Code-Large
ajibawa-2023/JavaScript-Code-Large

JavaScript-Code-Large is a large-scale corpus of JavaScript source code comprising around 5 million JavaScript files. The dataset is designed to support research in large language model (LLM) pretraining, code intelligence, software engineering automation, and program analysis for the JavaScript ecosystem.

By providing a high-volume, language-specific corpus, JavaScript-Code-Large enables systematic experimentation in JavaScript-focused model training, domain adaptation, and downstream code understanding tasks.

JavaScript-Code-Large addresses the need for a dedicated JavaScript-only dataset at substantial scale, enabling focused research across frontend, backend, and full-stack JavaScript environments. .
mitkox 
posted an update 3 months ago
view post
Post
501
134,614 tok/sec input prefil max
1031 tokens/sec out gen max

At these local AI speeds, there is no User Interface for humans. My human UI is the Radicle distributed Git issues queue

On my GPU workstation:
- Z8 Fury G5 4x A6000
- MiniMax-M2.5
- Claude Code to localhost:8000
  • 1 reply
·
ajibawa-2023 
posted an update 3 months ago
view post
Post
3141
Java-Code-Large ( ajibawa-2023/Java-Code-Large)

Java-Code-Large is a large-scale corpus of publicly available Java source code comprising more than 15 million java codes. The dataset is designed to support research in large language model (LLM) pretraining, code intelligence, software engineering automation, and program analysis.

By providing a high-volume, language-specific corpus, Java-Code-Large enables systematic experimentation in Java-focused model training, domain adaptation, and downstream code understanding tasks.
mitkox 
posted an update 3 months ago
view post
Post
4823
I just pushed Claude Code Agent Swarm with 20 coding agents on my desktop GPU workstation.

With local AI, I don’t have /fast CC switch, but I have /absurdlyfast:
- 100’499 tokens/second read, yeah 100k, not a typo | 811 tok/sec generation
- KV cache: 707’200 tokens
- Hardware: 5+ year old GPUs 4xA6K gen1; It’s not the car. It’s the driver.

Qwen3 Coder Next AWQ with cache at BF16. Scores 82.1% in C# on 29-years-in-dev codebase vs Opus 4.5 at only 57.5%. When your codebase predates Stack Overflow, you don't need the biggest model; you need the one that actually remembers Windows 95.

My current bottleneck is my 27" monitor. Can't fit all 20 Theos on screen without squinting.
  • 3 replies
·