TARA-WorldModel-VICReg

Joint environment–proteome embedding model using VICReg (Variance-Invariance-Covariance Regularization) self-supervised learning, applied to the TARA Oceans metagenomic dataset. This model aligns environmental and Pfam protein domain representations in a shared latent space.

Model Description

Architecture

Two-layer MLP encoder with VICReg alignment loss:

Environment branch: Input(env_dim) → Linear(hidden) → ReLU → Dropout(0.3) → Linear(32)
PFAM branch:        Input(pfam_dim) → Linear(hidden) → ReLU → Dropout(0.3) → Linear(32)
  • Latent dimension: 32
  • Parameters: ~53K–64K depending on PFAM input dimensionality
  • VICReg loss weights: variance = 25.0, invariance = 25.0, covariance = 1.0
  • Prediction head alpha: 1.0

Training Data

  • Source: 1,151 samples with complete productivity data (chlorophyll-a, POC, NFLH) from 1,810 total TARA Oceans samples
  • Environmental features: Google Earth Engine oceanographic variables
  • PFAM features: CLR-transformed domain abundances reduced via PCA to 20, 32, or 64 dimensions

Performance

This model represents an exploratory methodological approach that did not outperform the primary XGBoost bidirectional framework used in the ELF-NET study. Results are reported for transparency and reproducibility.

6-Fold Leave-One-Basin-Out (LOBO) CV

Target Joint Model R² Env-Only Baseline R² Cohen's d p-value
POC 0.532 0.422 0.026 0.38
Chl-a 0.516 0.561
NFLH 0.560 0.700

The POC improvement (+0.110 R²) was not statistically significant.

9-Fold Spatial Block CV (matching primary XGBoost design)

PFAM dim XGB Baseline R² VICReg R² ΔR²
pfam20 0.417 −2.045 −2.462
pfam32 0.417 −4.217 −4.634
pfam64 0.417 −1.262 −1.679

The MLP architecture produced catastrophically negative R² on spatially distinctive held-out basins (Mediterranean, mid-Pacific), where distribution shift defeats shallow neural networks. XGBoost's tree-based partitioning handles this regime far more effectively at N ≈ 1,100.

Interpretation

The poor performance under spatial CV is driven by the architecture confound (MLP vs. XGBoost for small tabular data), not necessarily by absence of the PFAM alignment signal. A fair comparison would require XGBoost-with-VICReg-embeddings, which was not evaluated. The XGBoost bidirectional framework was retained as the primary modeling approach.

Repository Contents

  • Model checkpoints (.pt files) for each fold and PFAM dimensionality
  • Hyperparameter sweep results
  • Per-fold training curves and metrics

Usage

import torch

# Load a VICReg checkpoint
checkpoint = torch.load(
    "path/to/vicreg_checkpoint.pt",
    map_location="cpu",
    weights_only=False
)
state_dict = checkpoint["model_state_dict"]

Related Models

References

  • Bardes, A., Ponce, J., & LeCun, Y. (2022). VICReg: Variance-Invariance-Covariance Regularization for Self-Supervised Learning. ICLR 2022.

Authors

David R. Nelson, Kourosh Salehi-Ashtiani

Green Genomics Lab, New York University Abu Dhabi

Citation

@article{elfnet2026,
  title={ELF-NET: Environment-Linked Functional Network for marine microalgal domain-environment coupling},
  author={Nelson, David R. and Salehi-Ashtiani, Kourosh},
  journal={Forthcoming},
  year={2026}
}

Contact

Kourosh Salehi-Ashtiani — ksa3@nyu.edu

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support