Ollama support - io.yaml files and gguf weights

#6
No description provided.

Summary

Adds Ollama support for all 4 GPT-OSS-20B LoRA adapters (answerability, citations, hallucination_detection, query_rewrite).

What's included

  • io.yaml config files for each LoRA adapter (converted from the original Python io.yaml format to Ollama-compatible format)
  • Lora-q8_0.gguf β€” pre-converted GGUF LoRA weights (Q8_0 quantization) for each adapter
  • Modelfile β€” Ollama Modelfile for each adapter (references gpt-oss:20b base model)
  • run_ollama.sh β€” script to load all LoRA adapters into a running Ollama instance
  • _ollama/convert_to_gguf.sh β€” conversion script that downloads the base model, clones llama.cpp, and converts all LoRA adapters from safetensors to GGUF
  • _ollama/convert_io_yaml_files.py β€” helper to convert io.yaml files

llama.cpp patches required

The convert_to_gguf.sh script clones llama.cpp from master, but two local patches are needed for the answerability and query_rewrite adapters (which target MoE expert layers via PEFT target_parameters):

  1. convert_lora_to_gguf.py β€” Remap PEFT's base_layer naming convention to actual HuggingFace tensor names; split interleaved gate_up_proj LoRA into separate gate/up LoRAs; bypass MXFP4 codepath for LoRA tensors
  2. gguf-py/gguf/gguf_writer.py β€” Handle 2D expert LoRA tensors in parameter counting

These patches are not yet upstreamed β€” no existing issue or PR on ggml-org/llama.cpp addresses this. The GGUF files in this PR were generated with these patches applied.

Usage

# 1. Have Ollama running with gpt-oss:20b loaded
# 2. From the repo root:
bash run_ollama.sh

From https://github.com/ibm-granite/granite-common/pull/134#issuecomment-3994228106

Ok, turns out Ollama has a new engine that is required by but not yet supporting gpt-oss LoRAs. Claude found the issue in the Ollama code:

Ollama has two inference runners β€” the older llamarunner (C++ based, supports LoRA) and the newer ollamarunner (Go based, LoRA is a TODO). Models like gpt-oss, deepseek2, gemma3, qwen3, llama4, etc. are hardcoded to require the
ollamarunner via OllamaEngineRequired(). So LoRA adapters cannot be used with any of these models. The code has TODO(jessegross): LoRA loading but no issue or PR tracking it.

Closing this PR until support is available.

kndtran changed pull request status to closed

Sign up or log in to comment