TARA-WorldModel-VICReg
Joint environment–proteome embedding model using VICReg (Variance-Invariance-Covariance Regularization) self-supervised learning, applied to the TARA Oceans metagenomic dataset. This model aligns environmental and Pfam protein domain representations in a shared latent space.
Model Description
Architecture
Two-layer MLP encoder with VICReg alignment loss:
Environment branch: Input(env_dim) → Linear(hidden) → ReLU → Dropout(0.3) → Linear(32)
PFAM branch: Input(pfam_dim) → Linear(hidden) → ReLU → Dropout(0.3) → Linear(32)
- Latent dimension: 32
- Parameters: ~53K–64K depending on PFAM input dimensionality
- VICReg loss weights: variance = 25.0, invariance = 25.0, covariance = 1.0
- Prediction head alpha: 1.0
Training Data
- Source: 1,151 samples with complete productivity data (chlorophyll-a, POC, NFLH) from 1,810 total TARA Oceans samples
- Environmental features: Google Earth Engine oceanographic variables
- PFAM features: CLR-transformed domain abundances reduced via PCA to 20, 32, or 64 dimensions
Performance
This model represents an exploratory methodological approach that did not outperform the primary XGBoost bidirectional framework used in the ELF-NET study. Results are reported for transparency and reproducibility.
6-Fold Leave-One-Basin-Out (LOBO) CV
| Target | Joint Model R² | Env-Only Baseline R² | Cohen's d | p-value |
|---|---|---|---|---|
| POC | 0.532 | 0.422 | 0.026 | 0.38 |
| Chl-a | 0.516 | 0.561 | — | — |
| NFLH | 0.560 | 0.700 | — | — |
The POC improvement (+0.110 R²) was not statistically significant.
9-Fold Spatial Block CV (matching primary XGBoost design)
| PFAM dim | XGB Baseline R² | VICReg R² | ΔR² |
|---|---|---|---|
| pfam20 | 0.417 | −2.045 | −2.462 |
| pfam32 | 0.417 | −4.217 | −4.634 |
| pfam64 | 0.417 | −1.262 | −1.679 |
The MLP architecture produced catastrophically negative R² on spatially distinctive held-out basins (Mediterranean, mid-Pacific), where distribution shift defeats shallow neural networks. XGBoost's tree-based partitioning handles this regime far more effectively at N ≈ 1,100.
Interpretation
The poor performance under spatial CV is driven by the architecture confound (MLP vs. XGBoost for small tabular data), not necessarily by absence of the PFAM alignment signal. A fair comparison would require XGBoost-with-VICReg-embeddings, which was not evaluated. The XGBoost bidirectional framework was retained as the primary modeling approach.
Repository Contents
- Model checkpoints (
.ptfiles) for each fold and PFAM dimensionality - Hyperparameter sweep results
- Per-fold training curves and metrics
Usage
import torch
# Load a VICReg checkpoint
checkpoint = torch.load(
"path/to/vicreg_checkpoint.pt",
map_location="cpu",
weights_only=False
)
state_dict = checkpoint["model_state_dict"]
Related Models
- TARA-XGBoost-Bidirectional — Primary bidirectional models used in the ELF-NET study
- algaGPT — Protein classification model used for proteome extraction
References
- Bardes, A., Ponce, J., & LeCun, Y. (2022). VICReg: Variance-Invariance-Covariance Regularization for Self-Supervised Learning. ICLR 2022.
Authors
David R. Nelson, Kourosh Salehi-Ashtiani
Green Genomics Lab, New York University Abu Dhabi
Citation
@article{elfnet2026,
title={ELF-NET: Environment-Linked Functional Network for marine microalgal domain-environment coupling},
author={Nelson, David R. and Salehi-Ashtiani, Kourosh},
journal={Forthcoming},
year={2026}
}
Contact
Kourosh Salehi-Ashtiani — ksa3@nyu.edu