Language Decoded LoRA

QLoRA adapters fine-tuned on multilingual code conditions for the Language Decoded project (part of Cohere's Tiny Aya Expedition).

Submitted paper title (2026-05-26): Language, Decoded: Exploring the Impact of Fine-Tuning a Multilingual Model on Native-Language Code

⚠️ Phase 3 eval numbers β€” read the experiments repo before citing

Original Phase 3 _summary_*.json files on legesher/language-decoded-experiments under-report cond-5 SIB-200 accuracy by 20–35pp because the strict inference-time extractor refused native-script answers. Cite the _summary_reparsed_*.json siblings (refined extractor) instead. Five Phase 3 SIB-200 conclusions also flip winβ†’loss against baseline once the extractor is corrected (cond-2-es-5k, cond-2-es-20k, cond-2-ur-20k, cond-2-zh-20k, cond-3-zh-5k), and cond-2-ur-5k's gain deflates 4.4Γ—. See the banner on the experiments repo (top of the README) for the full picture.

Research Question

How does fine-tuning Tiny Aya on non-English code β€” whether transpiled, mixed-native, or fully translated β€” affect its multilingual reasoning and instruction-following, and how does that impact differ from fine-tuning on English code?

The hypothesis is not that non-English code matches or exceeds English code as a generic reasoning aid β€” rather, that the kind of effect non-English code produces depends on the target language, the data structure, and how the corpus was constructed. See legesher/language-decoded-experiments for the full project context.

Base Model

All adapters are trained on CohereLabs/tiny-aya-base (3.35B parameters). Tiny Aya was chosen because it is small (deployable on a single 16 GB T4 GPU via QLoRA), accessible (Apache 2.0-licensed), and supports 70+ languages with explicit emphasis on lower-resourced ones β€” which makes the experimental ladder viable for ur at all.

Adapter Inventory

This repo holds adapters from two generations of the project, kept side by side and clearly separated by folder. See the Provenance & Manifest section for a complete path β†’ phase β†’ source-corpus map, and MANIFEST.md for the machine-readable version.

  • Paper adapters (Phase 3 Β· The Stack v2-dedup) β€” live under the tiny-aya-base/ prefix. These are the adapters cited in the submitted paper; cond-1, cond-2, and cond-5 were re-trained from scratch on the cleaner bigcode/the-stack-v2-dedup corpus.
  • Preliminary adapters (Phase 2 Β· The Stack v1) β€” live as flat top-level folders (condition-1-en-32k/, condition-2-zh-5k/, …). These are the original March-2026 hackathon adapters trained on bigcode/the-stack (v1, non-dedup), retained for reproducibility. Do not cite these for the paper.

Paper adapters β€” Phase 3 Β· The Stack v2-dedup

Each subdirectory under tiny-aya-base/ is one trained condition Γ— file-volume Γ— seed combination. All adapters share the QLoRA hyperparameters listed under Training Details.

Subdirectory (under tiny-aya-base/) Condition Training data Seeds
tiny-aya-base/condition-1-en-5k-seed{42,123,456}/ 1 Raw English Python from bigcode/the-stack-v2-dedup (5k file subset) 42, 123, 456
tiny-aya-base/condition-1-en-20k-seed42/ 1 Raw English Python (20k file subset) 42
tiny-aya-base/condition-2-{zh,es,ur}-5k-seed{42,123,456}/ 2 The same 5k subset as cond-1, processed through Legesher v0.7.3 β€” Python's reserved words (keywords, exceptions, built-in functions, numerical system for some target languages) translated to the target language; user logic preserved 42, 123, 456
tiny-aya-base/condition-2-{zh,es,ur}-20k-seed42/ 2 The same 20k subset as cond-1, processed through Legesher v0.7.3 42
tiny-aya-base/condition-3-zh-5k-native-code-seed42/ 3 Community-collected raw Chinese code from varied online public-source repositories (different source-file population from cond-1/2/5 by design) 42
tiny-aya-base/condition-5-{zh,es,ur}-5k-c4ai-aya-expanse-32b-seed42/ 5 The same 5k subset as cond-1, first transpiled by Legesher v0.7.3 to translate Python's reserved words, then run through c4ai-aya-expanse-32b via the Cohere API to translate the remaining content (identifiers, comments, docstrings, string literals) 42

Condition 4 ("Community-Contributed Native Code") is pending sufficient direct community contributions to the legesher/legesher-native-code HF Space; no cond-4 adapter exists yet.

Preliminary adapters β€” Phase 2 Β· The Stack v1

These flat top-level folders are the original hackathon adapters, trained on bigcode/the-stack (v1, non-dedup) with Legesher v0.5.1 / v0.6.0. They are superseded by the tiny-aya-base/ Phase 3 adapters above and are kept only for reproducibility of the preliminary results. The 32k size and the single-seed setup are Phase 2 signatures.

Subdirectory (top level) Condition Source corpus Notes
condition-1-en-32k/ 1 bigcode/the-stack (v1) Phase 2 32k tier; no Phase 3 equivalent
condition-1-en-5k/ 1 bigcode/the-stack (v1) Preliminary; use tiny-aya-base/condition-1-en-5k-seed42/ for the paper
condition-2-es-5k/ 2 bigcode/the-stack (v1), Legesher transpiled Preliminary
condition-2-ur-5k/ 2 bigcode/the-stack (v1), Legesher transpiled Preliminary
condition-2-zh-5k/ 2 bigcode/the-stack (v1), Legesher transpiled Preliminary
condition-3-zh-5k/ 3 Community-collected raw Chinese code Preliminary; corpus unchanged across phases

The standalone per-adapter repos that previously published these Phase 2 / v1 adapters (legesher/language-decoded-lora-condition-*) have been renamed to legesher/language-decoded-lora-phase-2-the-stack-v1-condition-* and deprecated in favor of this umbrella repo. Their old URLs continue to resolve via Hugging Face redirects.

Source-file control

Cond-1, cond-2, and cond-5 all train on the same 5,000-file subset drawn from bigcode/the-stack-v2-dedup (with a parallel 20k subset for the 20k tier). Differences across these conditions reflect the processing pipeline (raw / transpiled / fully translated), not file-quality or content drift. Cond-3 is the deliberate exception β€” its source files are a different population by design.

The experimental ladder

  • Baseline β†’ cond-1: Does code help at all? (Replicates Aryabumi et al., 2024.)
  • Cond-1 β†’ cond-2: Does translating Python's reserved words (keywords, exceptions, built-in functions, numerical system for some target languages) into the target language change the model's behavior? User logic and library calls remain English-derived.
  • Cond-2 β†’ cond-3: Does code pulled from real-world public-source repositories β€” code humans actually wrote in or with the target language β€” add value beyond Legesher's mechanical translation?
  • Cond-2 β†’ cond-5: Cond-2 translates only Python's reserved words; cond-5 goes further by translating the rest of the file's content (identifiers, comments, docstrings, string literals) via c4ai-aya-expanse-32b. Logic and structure are preserved.
  • Cond-3 β†’ cond-5 (implicit): Human-authored vs. machine-synthesized native code.

For the full ladder including future directions (natural-language text control, combined-language training, similar-script evaluation), see legesher/language-decoded-experiments.

Provenance & Manifest

The two adapter generations are distinguished by folder location and source corpus, matching the convention used across the project's repos (phase-2-the-stack-v1-* on language-decoded-data, phase2/Γ·phase3/ on language-decoded-experiments):

Generation Location in this repo Source corpus Legesher Tier / seeds Cite for paper?
Phase 3 (paper) tiny-aya-base/…-seed*/ bigcode/the-stack-v2-dedup v0.7.3 5k (3 seeds) + 20k (1 seed) βœ… Yes
Phase 2 (preliminary) flat top-level condition-*/ bigcode/the-stack (v1) v0.5.1 / v0.6.0 5k / 32k (1 seed) ❌ No

A complete, machine-readable path β†’ phase β†’ corpus β†’ condition map is in MANIFEST.md. Training-data provenance for each condition is detailed on language-decoded-data; the phase comparison is in the "Phase 2 β†’ Phase 3 at a glance" table on the experiments repo.

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

# Load base model
base_model = AutoModelForCausalLM.from_pretrained("CohereLabs/tiny-aya-base")
tokenizer = AutoTokenizer.from_pretrained("CohereLabs/tiny-aya-base")

# Load a paper (Phase 3 Β· Stack v2-dedup) adapter β€” e.g., cond-1 (English code, seed 42, 5k tier).
# Paper adapters live under the `tiny-aya-base/` prefix.
model = PeftModel.from_pretrained(
    base_model,
    "legesher/language-decoded-lora",
    subfolder="tiny-aya-base/condition-1-en-5k-seed42",
)

# Or a language-specific cond-2 adapter (Chinese reserved-word translation, seed 42)
model = PeftModel.from_pretrained(
    base_model,
    "legesher/language-decoded-lora",
    subfolder="tiny-aya-base/condition-2-zh-5k-seed42",
)

# Or a cond-5 adapter (Synthesized Native Code, Urdu, seed 42)
model = PeftModel.from_pretrained(
    base_model,
    "legesher/language-decoded-lora",
    subfolder="tiny-aya-base/condition-5-ur-5k-c4ai-aya-expanse-32b-seed42",
)

# To load a *preliminary* Phase 2 / Stack v1 adapter instead, use the flat top-level
# folder (no `tiny-aya-base/` prefix) β€” e.g. the original cond-2 Chinese hackathon adapter:
model = PeftModel.from_pretrained(
    base_model,
    "legesher/language-decoded-lora",
    subfolder="condition-2-zh-5k",
)

Training Details

Parameter Value
Base model CohereLabs/tiny-aya-base (3.35B params, 70+ languages, low-resource emphasis)
Method QLoRA 4-bit (NF4), ~5.4 GB VRAM, Unsloth-accelerated
Hardware Kaggle T4 (16 GB)
Tokenizer CohereLabs/tiny-aya-base
Transpilation tool Legesher v0.7.3 (Phase 3); v0.5.1 / v0.6.0 used in Phase 2
Cond-5 translation c4ai-aya-expanse-32b accessed via the Cohere API (made possible by Cohere credits awarded to Legesher)
Training data legesher/language-decoded-data

QLoRA hyperparameters

Parameter Value
LoRA rank (r) 16
LoRA alpha 32
LoRA dropout 0.0
Target modules q_proj, k_proj, v_proj, o_proj, up_proj, down_proj, gate_proj
Bias none
Task type CAUSAL_LM
PEFT version 0.18.1
Quantization NF4 (4-bit) via Unsloth

Evaluation

Phase 3 models are evaluated on four multilingual benchmarks under template1 (English-prompt) and template2 (native-prompt) across the full data_lang Γ— instr_lang matrix:

Benchmark What it measures Examples per language
XNLI Natural-language inference ~5,000
X-CSQA Commonsense reasoning ~1,000
SIB-200 Topic classification ~204
Belebele Reading comprehension ~900

MGSM was used in Phase 2 and dropped from Phase 3 β€” at 3.35B parameters and 250 examples per language, scores ranged 2.8% – 10.8% across all conditions with most condition-to-condition differences within noise. A useful null result; budget was reallocated to SIB-200 and Belebele.

Paper-grade evaluation results live on legesher/language-decoded-experiments β€” see the refined-tables and the writeup at expedition-tiny-aya/analysis/phase-3/phase3-refined-evaluation.md.

Limitations

  • Single base model: All adapters are trained on CohereLabs/tiny-aya-base (3.35B params). Results may not generalize to larger or architecturally different models. Future iterations will expand to additional base models.
  • Per-language fine-tuning only: Every condition is per-language β€” each cond-2-{zh,es,ur}-5k (and cond-5-{zh,es,ur}-5k) is a separate training run. Combined-language training is a planned future condition.
  • Limited training data: 5k and 20k file tiers are constrained by Kaggle T4 hardware limits. 103k variants exist on the training data repo but no 103k adapters have been trained yet.
  • Consumer hardware: Training on Kaggle T4 (16 GB) with 4-bit quantization introduces approximation that may affect adapter quality compared to full-precision training.
  • Extractor coverage β€” when citing Phase 3 results, use the refined-extractor scores. See the banner at the top of this card and the experiments repo for full details.

Related Resources

Citation

@misc{language-decoded-2026,
  title={Language Decoded: Exploring the Impact of Native Code on Multilingual Models},
  author={Madison Edgar and Saad Ahmed Bazaz and Tom Sherborne and Rashik Shahjahan and Khojasteh Mirza and Sarah Jawaid and Rafay Mustafa and Sohaib Ahmed Bazaz},
  year={2026},
  publisher={Hugging Face},
  url={https://huggingface.co/legesher/language-decoded-lora}
}

License

Apache 2.0

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for legesher/language-decoded-lora

Adapter
(8)
this model

Collection including legesher/language-decoded-lora

Paper for legesher/language-decoded-lora