Instructions to use legesher/language-decoded-lora with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use legesher/language-decoded-lora with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="legesher/language-decoded-lora")# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("legesher/language-decoded-lora", dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use legesher/language-decoded-lora with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "legesher/language-decoded-lora" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "legesher/language-decoded-lora", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/legesher/language-decoded-lora
- SGLang
How to use legesher/language-decoded-lora with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "legesher/language-decoded-lora" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "legesher/language-decoded-lora", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "legesher/language-decoded-lora" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "legesher/language-decoded-lora", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Unsloth Studio
How to use legesher/language-decoded-lora with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for legesher/language-decoded-lora to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for legesher/language-decoded-lora to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for legesher/language-decoded-lora to start chatting
Load model with FastModel
pip install unsloth from unsloth import FastModel model, tokenizer = FastModel.from_pretrained( model_name="legesher/language-decoded-lora", max_seq_length=2048, ) - Docker Model Runner
How to use legesher/language-decoded-lora with Docker Model Runner:
docker model run hf.co/legesher/language-decoded-lora
Language Decoded LoRA
QLoRA adapters fine-tuned on multilingual code conditions for the Language Decoded project (part of Cohere's Tiny Aya Expedition).
Submitted paper title (2026-05-26): Language, Decoded: Exploring the Impact of Fine-Tuning a Multilingual Model on Native-Language Code
β οΈ Phase 3 eval numbers β read the experiments repo before citing
Original Phase 3 _summary_*.json files on legesher/language-decoded-experiments under-report cond-5 SIB-200 accuracy by 20β35pp because the strict inference-time extractor refused native-script answers. Cite the _summary_reparsed_*.json siblings (refined extractor) instead. Five Phase 3 SIB-200 conclusions also flip winβloss against baseline once the extractor is corrected (cond-2-es-5k, cond-2-es-20k, cond-2-ur-20k, cond-2-zh-20k, cond-3-zh-5k), and cond-2-ur-5k's gain deflates 4.4Γ. See the banner on the experiments repo (top of the README) for the full picture.
Research Question
How does fine-tuning Tiny Aya on non-English code β whether transpiled, mixed-native, or fully translated β affect its multilingual reasoning and instruction-following, and how does that impact differ from fine-tuning on English code?
The hypothesis is not that non-English code matches or exceeds English code as a generic reasoning aid β rather, that the kind of effect non-English code produces depends on the target language, the data structure, and how the corpus was constructed. See legesher/language-decoded-experiments for the full project context.
Base Model
All adapters are trained on CohereLabs/tiny-aya-base (3.35B parameters). Tiny Aya was chosen because it is small (deployable on a single 16 GB T4 GPU via QLoRA), accessible (Apache 2.0-licensed), and supports 70+ languages with explicit emphasis on lower-resourced ones β which makes the experimental ladder viable for ur at all.
Adapter Inventory
This repo holds adapters from two generations of the project, kept side by side and clearly separated by folder. See the Provenance & Manifest section for a complete path β phase β source-corpus map, and MANIFEST.md for the machine-readable version.
- Paper adapters (Phase 3 Β· The Stack v2-dedup) β live under the
tiny-aya-base/prefix. These are the adapters cited in the submitted paper; cond-1, cond-2, and cond-5 were re-trained from scratch on the cleanerbigcode/the-stack-v2-dedupcorpus. - Preliminary adapters (Phase 2 Β· The Stack v1) β live as flat top-level folders (
condition-1-en-32k/,condition-2-zh-5k/, β¦). These are the original March-2026 hackathon adapters trained onbigcode/the-stack(v1, non-dedup), retained for reproducibility. Do not cite these for the paper.
Paper adapters β Phase 3 Β· The Stack v2-dedup
Each subdirectory under tiny-aya-base/ is one trained condition Γ file-volume Γ seed combination. All adapters share the QLoRA hyperparameters listed under Training Details.
Subdirectory (under tiny-aya-base/) |
Condition | Training data | Seeds |
|---|---|---|---|
tiny-aya-base/condition-1-en-5k-seed{42,123,456}/ |
1 | Raw English Python from bigcode/the-stack-v2-dedup (5k file subset) |
42, 123, 456 |
tiny-aya-base/condition-1-en-20k-seed42/ |
1 | Raw English Python (20k file subset) | 42 |
tiny-aya-base/condition-2-{zh,es,ur}-5k-seed{42,123,456}/ |
2 | The same 5k subset as cond-1, processed through Legesher v0.7.3 β Python's reserved words (keywords, exceptions, built-in functions, numerical system for some target languages) translated to the target language; user logic preserved | 42, 123, 456 |
tiny-aya-base/condition-2-{zh,es,ur}-20k-seed42/ |
2 | The same 20k subset as cond-1, processed through Legesher v0.7.3 | 42 |
tiny-aya-base/condition-3-zh-5k-native-code-seed42/ |
3 | Community-collected raw Chinese code from varied online public-source repositories (different source-file population from cond-1/2/5 by design) | 42 |
tiny-aya-base/condition-5-{zh,es,ur}-5k-c4ai-aya-expanse-32b-seed42/ |
5 | The same 5k subset as cond-1, first transpiled by Legesher v0.7.3 to translate Python's reserved words, then run through c4ai-aya-expanse-32b via the Cohere API to translate the remaining content (identifiers, comments, docstrings, string literals) |
42 |
Condition 4 ("Community-Contributed Native Code") is pending sufficient direct community contributions to the legesher/legesher-native-code HF Space; no cond-4 adapter exists yet.
Preliminary adapters β Phase 2 Β· The Stack v1
These flat top-level folders are the original hackathon adapters, trained on bigcode/the-stack (v1, non-dedup) with Legesher v0.5.1 / v0.6.0. They are superseded by the tiny-aya-base/ Phase 3 adapters above and are kept only for reproducibility of the preliminary results. The 32k size and the single-seed setup are Phase 2 signatures.
| Subdirectory (top level) | Condition | Source corpus | Notes |
|---|---|---|---|
condition-1-en-32k/ |
1 | bigcode/the-stack (v1) |
Phase 2 32k tier; no Phase 3 equivalent |
condition-1-en-5k/ |
1 | bigcode/the-stack (v1) |
Preliminary; use tiny-aya-base/condition-1-en-5k-seed42/ for the paper |
condition-2-es-5k/ |
2 | bigcode/the-stack (v1), Legesher transpiled |
Preliminary |
condition-2-ur-5k/ |
2 | bigcode/the-stack (v1), Legesher transpiled |
Preliminary |
condition-2-zh-5k/ |
2 | bigcode/the-stack (v1), Legesher transpiled |
Preliminary |
condition-3-zh-5k/ |
3 | Community-collected raw Chinese code | Preliminary; corpus unchanged across phases |
The standalone per-adapter repos that previously published these Phase 2 / v1 adapters (
legesher/language-decoded-lora-condition-*) have been renamed tolegesher/language-decoded-lora-phase-2-the-stack-v1-condition-*and deprecated in favor of this umbrella repo. Their old URLs continue to resolve via Hugging Face redirects.
Source-file control
Cond-1, cond-2, and cond-5 all train on the same 5,000-file subset drawn from bigcode/the-stack-v2-dedup (with a parallel 20k subset for the 20k tier). Differences across these conditions reflect the processing pipeline (raw / transpiled / fully translated), not file-quality or content drift. Cond-3 is the deliberate exception β its source files are a different population by design.
The experimental ladder
- Baseline β cond-1: Does code help at all? (Replicates Aryabumi et al., 2024.)
- Cond-1 β cond-2: Does translating Python's reserved words (keywords, exceptions, built-in functions, numerical system for some target languages) into the target language change the model's behavior? User logic and library calls remain English-derived.
- Cond-2 β cond-3: Does code pulled from real-world public-source repositories β code humans actually wrote in or with the target language β add value beyond Legesher's mechanical translation?
- Cond-2 β cond-5: Cond-2 translates only Python's reserved words; cond-5 goes further by translating the rest of the file's content (identifiers, comments, docstrings, string literals) via
c4ai-aya-expanse-32b. Logic and structure are preserved. - Cond-3 β cond-5 (implicit): Human-authored vs. machine-synthesized native code.
For the full ladder including future directions (natural-language text control, combined-language training, similar-script evaluation), see legesher/language-decoded-experiments.
Provenance & Manifest
The two adapter generations are distinguished by folder location and source corpus, matching the convention used across the project's repos (phase-2-the-stack-v1-* on language-decoded-data, phase2/Γ·phase3/ on language-decoded-experiments):
| Generation | Location in this repo | Source corpus | Legesher | Tier / seeds | Cite for paper? |
|---|---|---|---|---|---|
| Phase 3 (paper) | tiny-aya-base/β¦-seed*/ |
bigcode/the-stack-v2-dedup |
v0.7.3 | 5k (3 seeds) + 20k (1 seed) | β Yes |
| Phase 2 (preliminary) | flat top-level condition-*/ |
bigcode/the-stack (v1) |
v0.5.1 / v0.6.0 | 5k / 32k (1 seed) | β No |
A complete, machine-readable path β phase β corpus β condition map is in MANIFEST.md. Training-data provenance for each condition is detailed on language-decoded-data; the phase comparison is in the "Phase 2 β Phase 3 at a glance" table on the experiments repo.
Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
# Load base model
base_model = AutoModelForCausalLM.from_pretrained("CohereLabs/tiny-aya-base")
tokenizer = AutoTokenizer.from_pretrained("CohereLabs/tiny-aya-base")
# Load a paper (Phase 3 Β· Stack v2-dedup) adapter β e.g., cond-1 (English code, seed 42, 5k tier).
# Paper adapters live under the `tiny-aya-base/` prefix.
model = PeftModel.from_pretrained(
base_model,
"legesher/language-decoded-lora",
subfolder="tiny-aya-base/condition-1-en-5k-seed42",
)
# Or a language-specific cond-2 adapter (Chinese reserved-word translation, seed 42)
model = PeftModel.from_pretrained(
base_model,
"legesher/language-decoded-lora",
subfolder="tiny-aya-base/condition-2-zh-5k-seed42",
)
# Or a cond-5 adapter (Synthesized Native Code, Urdu, seed 42)
model = PeftModel.from_pretrained(
base_model,
"legesher/language-decoded-lora",
subfolder="tiny-aya-base/condition-5-ur-5k-c4ai-aya-expanse-32b-seed42",
)
# To load a *preliminary* Phase 2 / Stack v1 adapter instead, use the flat top-level
# folder (no `tiny-aya-base/` prefix) β e.g. the original cond-2 Chinese hackathon adapter:
model = PeftModel.from_pretrained(
base_model,
"legesher/language-decoded-lora",
subfolder="condition-2-zh-5k",
)
Training Details
| Parameter | Value |
|---|---|
| Base model | CohereLabs/tiny-aya-base (3.35B params, 70+ languages, low-resource emphasis) |
| Method | QLoRA 4-bit (NF4), ~5.4 GB VRAM, Unsloth-accelerated |
| Hardware | Kaggle T4 (16 GB) |
| Tokenizer | CohereLabs/tiny-aya-base |
| Transpilation tool | Legesher v0.7.3 (Phase 3); v0.5.1 / v0.6.0 used in Phase 2 |
| Cond-5 translation | c4ai-aya-expanse-32b accessed via the Cohere API (made possible by Cohere credits awarded to Legesher) |
| Training data | legesher/language-decoded-data |
QLoRA hyperparameters
| Parameter | Value |
|---|---|
LoRA rank (r) |
16 |
| LoRA alpha | 32 |
| LoRA dropout | 0.0 |
| Target modules | q_proj, k_proj, v_proj, o_proj, up_proj, down_proj, gate_proj |
| Bias | none |
| Task type | CAUSAL_LM |
| PEFT version | 0.18.1 |
| Quantization | NF4 (4-bit) via Unsloth |
Evaluation
Phase 3 models are evaluated on four multilingual benchmarks under template1 (English-prompt) and template2 (native-prompt) across the full data_lang Γ instr_lang matrix:
| Benchmark | What it measures | Examples per language |
|---|---|---|
| XNLI | Natural-language inference | ~5,000 |
| X-CSQA | Commonsense reasoning | ~1,000 |
| SIB-200 | Topic classification | ~204 |
| Belebele | Reading comprehension | ~900 |
MGSM was used in Phase 2 and dropped from Phase 3 β at 3.35B parameters and 250 examples per language, scores ranged 2.8% β 10.8% across all conditions with most condition-to-condition differences within noise. A useful null result; budget was reallocated to SIB-200 and Belebele.
Paper-grade evaluation results live on legesher/language-decoded-experiments β see the refined-tables and the writeup at expedition-tiny-aya/analysis/phase-3/phase3-refined-evaluation.md.
Limitations
- Single base model: All adapters are trained on
CohereLabs/tiny-aya-base(3.35B params). Results may not generalize to larger or architecturally different models. Future iterations will expand to additional base models. - Per-language fine-tuning only: Every condition is per-language β each
cond-2-{zh,es,ur}-5k(andcond-5-{zh,es,ur}-5k) is a separate training run. Combined-language training is a planned future condition. - Limited training data: 5k and 20k file tiers are constrained by Kaggle T4 hardware limits. 103k variants exist on the training data repo but no 103k adapters have been trained yet.
- Consumer hardware: Training on Kaggle T4 (16 GB) with 4-bit quantization introduces approximation that may affect adapter quality compared to full-precision training.
- Extractor coverage β when citing Phase 3 results, use the refined-extractor scores. See the banner at the top of this card and the experiments repo for full details.
Related Resources
- Experiment tracking and results: legesher/language-decoded-experiments (canonical project source-of-truth)
- Training data: legesher/language-decoded-data
- Community native code: legesher/language-decoded-community
- Cond-4 contribution interface:
legesher/legesher-native-codeHF Space - Transpilation tool: Legesher on GitHub
Citation
@misc{language-decoded-2026,
title={Language Decoded: Exploring the Impact of Native Code on Multilingual Models},
author={Madison Edgar and Saad Ahmed Bazaz and Tom Sherborne and Rashik Shahjahan and Khojasteh Mirza and Sarah Jawaid and Rafay Mustafa and Sohaib Ahmed Bazaz},
year={2026},
publisher={Hugging Face},
url={https://huggingface.co/legesher/language-decoded-lora}
}
License
Apache 2.0
Model tree for legesher/language-decoded-lora
Base model
CohereLabs/tiny-aya-base