AI & ML interests
None defined yet.
Recent Activity
SLAF (Sparse Lazy Array Format)
SLAF is a high-performance format for single-cell transcriptomics data built on top of the Lance table format and Polars. For users of scanpy or anndata, it should feel like you never left. SLAF provides and advanced dataloader that looks and feels like PyTorch, but runs its own multi-threaded async prefetcher under the hood. Bleeding-edge internals, familiar interfaces.
pip install slafdb[ml]
Why SLAF?
Single-cell transcriptomics datasets have scaled 2,000-fold in less than a decade. A typical study used to have 50k cells that could be copied to SSD and processed in memory. At the 100M-cell scale, network, storage, and memory become bottlenecks.
The traditional analytic workload is stuck in in-memory single-node operations. Today we need to do cell/gene filtering, normalization, PCA/UMAP, and differential expression at 2000x the scale.
New AI-native workloads have arrived:
- cell typing with nearest neighbor search on embeddings
- transformer-based foundation model training with efficient tokenization
- distribute workloads across nodes or GPUs by streaming random batches concurrently
For these, we need cloud-native, zero-copy, query-in-place storage --- without maintaining multiple copies per user, workload, application, or node --- while retaining the interfaces for numpy-like sparse matrix slicing and the scanpy pipelines we already use.
Who is SLAF for?
- Bioinformaticians β Struggling with OOM and data transfer on 10M+ cell datasets. SLAF eliminates the bottleneck with lazy evaluation.
- Foundation Model Builders β SLAF enables cloud-native streaming and removes data duplication.
- Tech Leaders & Architects β SLAF provides zero-copy, query-in-place storage instead of duplicated datasets per user.
- Tool Builders β SLAF enables concurrent, cloud-scale access with high QPS for interactive experiences.
- Atlas Builders β SLAF provides cloud-native, zero-copy storage for global distribution.
- Data Integrators β SLAFβs SQL-native design enables complex data integration with pushdown optimization.
Quick examples
Query with SQL (no full download):
from slaf import SLAFArray
slaf_array = SLAFArray("hf://datasets/slaf-project/Parse-10M")
results = slaf_array.query("""
SELECT
cytokine,
cell_type,
AVG(gene_count) as avg_gene_count
FROM cells
WHERE donor = 'Donor10'
AND cytokine IN ('C5a', 'CD40L')
GROUP BY cytokine, cell_type
ORDER BY cytokine, avg_gene_count DESC
""")
Lazy Scanpy-style slicing:
from slaf.integrations import read_slaf
adata = read_slaf("hf://datasets/slaf-project/Parse-10M")
subset = adata[
(
(adata.obs.cell_type == "CD8 Naive") &
(adata.obs.cytokine == "C5a") &
(adata.obs.donor == "Donor10")
), :
]
expression = subset[:10, :].X.compute() # Only now is data loaded
Stream tokenized batches for training:
from slaf import SLAFArray
from slaf.ml.dataloaders import SLAFDataLoader
slaf_array = SLAFArray("hf://datasets/slaf-project/Parse-10M")
dataloader = SLAFDataLoader(
slaf_array=slaf_array,
tokenizer_type="geneformer",
batch_size=32,
max_genes=2048,
vocab_size=50000,
prefetch_batch_size=1_000_000
)
for batch in dataloader:
input_ids = batch["input_ids"]
attention_mask = batch["attention_mask"]
# Your training code here