Ollama support - io.yaml files and gguf weights

by kndtran - opened Mar 2

base: refs/heads/main

←

from: refs/pr/6

Discussion Files changed

+505

-0

kndtran

Mar 2

No description provided.

Add Ollama support: io.yaml files, GGUF LoRA weights, and conversion scripts81ac0d82

kndtran

Mar 3

Summary

Adds Ollama support for all 4 GPT-OSS-20B LoRA adapters (answerability, citations, hallucination_detection, query_rewrite).

What's included

io.yaml config files for each LoRA adapter (converted from the original Python io.yaml format to Ollama-compatible format)
Lora-q8_0.gguf — pre-converted GGUF LoRA weights (Q8_0 quantization) for each adapter
Modelfile — Ollama Modelfile for each adapter (references gpt-oss:20b base model)
run_ollama.sh — script to load all LoRA adapters into a running Ollama instance
_ollama/convert_to_gguf.sh — conversion script that downloads the base model, clones llama.cpp, and converts all LoRA adapters from safetensors to GGUF
_ollama/convert_io_yaml_files.py — helper to convert io.yaml files

llama.cpp patches required

The convert_to_gguf.sh script clones llama.cpp from master, but two local patches are needed for the answerability and query_rewrite adapters (which target MoE expert layers via PEFT target_parameters):

convert_lora_to_gguf.py — Remap PEFT's base_layer naming convention to actual HuggingFace tensor names; split interleaved gate_up_proj LoRA into separate gate/up LoRAs; bypass MXFP4 codepath for LoRA tensors
gguf-py/gguf/gguf_writer.py — Handle 2D expert LoRA tensors in parameter counting

These patches are not yet upstreamed — no existing issue or PR on ggml-org/llama.cpp addresses this. The GGUF files in this PR were generated with these patches applied.

Usage

# 1. Have Ollama running with gpt-oss:20b loaded
# 2. From the repo root:
bash run_ollama.sh

kndtran

Mar 3

•

edited Mar 3

From https://github.com/ibm-granite/granite-common/pull/134#issuecomment-3994228106

Ok, turns out Ollama has a new engine that is required by but not yet supporting gpt-oss LoRAs. Claude found the issue in the Ollama code:

Ollama has two inference runners — the older llamarunner (C++ based, supports LoRA) and the newer ollamarunner (Go based, LoRA is a TODO). Models like gpt-oss, deepseek2, gemma3, qwen3, llama4, etc. are hardcoded to require the
ollamarunner via OllamaEngineRequired(). So LoRA adapters cannot be used with any of these models. The code has TODO(jessegross): LoRA loading but no issue or PR tracking it.

Closing this PR until support is available.

kndtran changed pull request status to closed Mar 3

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment