TARA-WorldModel-VICReg

Joint environment–proteome embedding model using VICReg (Variance-Invariance-Covariance Regularization) self-supervised learning, applied to the TARA Oceans metagenomic dataset. This model aligns environmental and Pfam protein domain representations in a shared latent space.

Model Description

Architecture

Two-layer MLP encoder with VICReg alignment loss:

Environment branch: Input(env_dim) → Linear(hidden) → ReLU → Dropout(0.3) → Linear(32)
PFAM branch:        Input(pfam_dim) → Linear(hidden) → ReLU → Dropout(0.3) → Linear(32)

Latent dimension: 32
Parameters: ~53K–64K depending on PFAM input dimensionality
VICReg loss weights: variance = 25.0, invariance = 25.0, covariance = 1.0
Prediction head alpha: 1.0

Training Data

Source: 1,151 samples with complete productivity data (chlorophyll-a, POC, NFLH) from 1,810 total TARA Oceans samples
Environmental features: Google Earth Engine oceanographic variables
PFAM features: CLR-transformed domain abundances reduced via PCA to 20, 32, or 64 dimensions

Performance

This model represents an exploratory methodological approach that did not outperform the primary XGBoost bidirectional framework used in the ELF-NET study. Results are reported for transparency and reproducibility.

6-Fold Leave-One-Basin-Out (LOBO) CV

Target	Joint Model R²	Env-Only Baseline R²	Cohen's d	p-value
POC	0.532	0.422	0.026	0.38
Chl-a	0.516	0.561	—	—
NFLH	0.560	0.700	—	—

The POC improvement (+0.110 R²) was not statistically significant.

9-Fold Spatial Block CV (matching primary XGBoost design)

PFAM dim	XGB Baseline R²	VICReg R²	ΔR²
pfam20	0.417	−2.045	−2.462
pfam32	0.417	−4.217	−4.634
pfam64	0.417	−1.262	−1.679

The MLP architecture produced catastrophically negative R² on spatially distinctive held-out basins (Mediterranean, mid-Pacific), where distribution shift defeats shallow neural networks. XGBoost's tree-based partitioning handles this regime far more effectively at N ≈ 1,100.

Interpretation

The poor performance under spatial CV is driven by the architecture confound (MLP vs. XGBoost for small tabular data), not necessarily by absence of the PFAM alignment signal. A fair comparison would require XGBoost-with-VICReg-embeddings, which was not evaluated. The XGBoost bidirectional framework was retained as the primary modeling approach.

Repository Contents

Model checkpoints (.pt files) for each fold and PFAM dimensionality
Hyperparameter sweep results
Per-fold training curves and metrics

Usage

import torch

# Load a VICReg checkpoint
checkpoint = torch.load(
    "path/to/vicreg_checkpoint.pt",
    map_location="cpu",
    weights_only=False
)
state_dict = checkpoint["model_state_dict"]

Related Models

TARA-XGBoost-Bidirectional — Primary bidirectional models used in the ELF-NET study
algaGPT — Protein classification model used for proteome extraction

References

Bardes, A., Ponce, J., & LeCun, Y. (2022). VICReg: Variance-Invariance-Covariance Regularization for Self-Supervised Learning. ICLR 2022.

Authors

David R. Nelson, Kourosh Salehi-Ashtiani

Green Genomics Lab, New York University Abu Dhabi

Citation

@article{elfnet2026,
  title={ELF-NET: Environment-Linked Functional Network for marine microalgal domain-environment coupling},
  author={Nelson, David R. and Salehi-Ashtiani, Kourosh},
  journal={Forthcoming},
  year={2026}
}

Contact

Kourosh Salehi-Ashtiani — ksa3@nyu.edu

Downloads last month: -; Downloads are not tracked for this model. How to track