Orthoformer Models

This repository contains pre-trained Orthoformer foundation models for function-centric representation learning of microbial and viral genomes.

Unlike sequence-based protein or nucleotide models, Orthoformer operates on orthologous group composition and abundance, treating functional units as tokens and learning genome-level embeddings that capture evolutionary, metabolic, and ecological signals.

The models are trained on approximately 3 million microbial and viral genomes, encoded as functional profiles derived from orthologous gene groups.


🧬 Model Families

All Orthoformer models learn a functional embedding space that supports:

  • Alignment-free phylogeny and taxonomy
  • Functional convergence and divergence
  • Metabolic and biosynthetic capacity prediction
  • Genome-level phenotype inference

πŸ“¦ Available Models

🧠 Foundation Models

Model Training Genomes Max Length Hidden Layers Heads Description
model_3M_2048_v8 3M 2048 512 6 8 Base Orthoformer foundation model
model_3M_2048_v10 3M 2048 1024 12 16 Large Orthoformer foundation model
model_140k_2048_v18 140k 2048 512 6 8 Compact foundation model

All foundation models use:

  • ALiBi positional encoding: enables long-context modeling across variable-length microbial genomes, preserving functional relationships between orthologous groups.
  • Span-masked language modeling (span-MLM, span=3): 15% of OG tokens are masked or corrupted following a BERT-style scheme, allowing the model to learn co-occurrence patterns, functional modules, and evolutionary dependencies in a self-supervised manner.

🎯 Task-Specific Models

Model Task Initialized From
Orthoformer_CRISPR_model CRISPR-associated genome prediction model_3M_2048_v10
BGC_abundance_regression_model Biosynthetic gene cluster abundance model_3M_2048_v10

These models adapt the foundation embeddings to organism-level functional phenotypes.

Download Methods

Method 1: Using Hugging Face CLI

# Install huggingface-hub
pip install huggingface-hub

# Download entire model repository
huggingface-cli download jackkuo/Orthoformer --local-dir ./model

# Or download specific model
huggingface-cli download jackkuo/Orthoformer/model_3M_2048_v8 --local-dir ./model/model_3M_2048_v8
huggingface-cli download jackkuo/Orthoformer/model_3M_2048_v10 --local-dir ./model/model_3M_2048_v10

Method 2: Using Python Code

from huggingface_hub import snapshot_download

# Download entire model repository
snapshot_download(
    repo_id="jackkuo/Orthoformer",
    local_dir="./model",
    local_dir_use_symlinks=False
)

# Or download specific model
snapshot_download(
    repo_id="jackkuo/Orthoformer",
    allow_patterns="model_3M_2048_v8/*",
    local_dir="./model",
    local_dir_use_symlinks=False
)

Method 3: Using Git LFS

# Recommended for large model files
git lfs install
git xet install || true
git clone https://huggingface.co/jackkuo/Orthoformer ./model

Model Usage

After downloading the models, you can use feature_extraction_example.py to load and use the models:


# Using model_3M_2048_v8 (ALiBi positional encoding)
python feature_extraction_example.py --model_dir model/model_3M_2048_v8 --use_alibi

πŸ“œ License

This dataset is released under the MIT License.


πŸ“– Citation

If you use this dataset, please cite:

@dataset{xxx,
  title = {Orthoformer: xxx},
  author = {xxx},
  year = {2025},
}

πŸ”— Related Resources


Notes

  • Model files are large, ensure you have sufficient disk space
  • Download speed depends on network connection, recommend using a stable network environment
  • If download is interrupted, you can re-run the download command, the tool will automatically resume
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Dataset used to train jackkuo/Orthoformer