LTX-Video 2B (distilled) → Core AI — the zoo's first VIDEO model

Lightricks/LTX-Video, config ltxv-2b-0.9.6-distilled — text → video via an 8-step distilled flow-matching DiT. All three neural nets run as Core AI .aimodel bundles; only the FlowMatch sampler loop runs on host.

This repo holds the Core AI bundles (each .aimodel is a directory). Conversion + runner + Mac app are in the coreai-model-zoo (conversion/ltxvideo/, apps/CoreAIVideo/).

Sample

512×768 · 49 frames · 8 steps · ~14 s on a Mac GPU (Apple silicon). Prompt: "A clear glass of water on a wooden table, slow motion droplet falling into it creating ripples, cinematic."

Bundles

net	shape (demo 512×768×49f)	dtype	bundle
T5-XXL text encoder	ids(1,256)+mask(1,256) → (1,256,4096)	bf16 8.9 G	`t5_bf16.aimodel`
DiT denoiser (one step)	latent(1,N,128)+grid(1,3,N)+text(1,256,4096)+mask(1,256)+t(1,1) → (1,N,128)	fp16 3.6 G	`dit_fp16.aimodel`
Causal video VAE decoder	latent(1,128,lf,lh,lw)+t(1,) → pixels(1,3,F,H,W)	fp16 1.0 G	`vae_fp16.aimodel`

N = lf·lh·lw, lf=(F-1)//8+1, lh=H/32, lw=W/32 (VAE 32× spatial / 8× temporal). DiT + VAE are fixed-shape (here 512×768×49f → N=2688) — re-convert per target resolution; T5 is resolution-independent (seq 256), so t5_bf16 is reused. At guidance_scale=1 (distilled) CFG is off and stg_scale=0, so the DiT runs batch-1, single-conditioning.

Numerics

Per-net converted-vs-eager cosine = 1.000000 (T5, DiT, VAE). The DiT also reproduces torch on every one of the 8 real sampler steps (cos 1.000000, max|Δ| ~1e-3). End-to-end pixel cosine vs a reference is ~0.93 — but that is stochastic-sampler variance, not error: two torch runs (MPS vs CPU, same seed) are also cos 0.9325. Gate by per-step cos + visual.

T5 must be bf16 or fp32 — the encoder overflows in fp16 (washed-out video); bf16 has fp32's exponent range at half the size. DiT + VAE are fp16-clean. Ship set 13.5 G (vs 27 G fp32).

Run it

The 3 bundles run on coreai.runtime (load with an explicit SpecializationOptions.default() for GPU — AIModel.load(path, None) trips an MPSGraph error; keep the AIModel refs alive). The host reuses LTX's real FlowMatch sampler / patchify / indices_grid / decode-noise. See conversion/ltxvideo/_run_coreai.py (CLI) and apps/CoreAIVideo/ (a SwiftUI Mac app: type a prompt → video, ~14 s).

On-device note

iPhone is a stretch: T5-XXL is 4.76 B params and the DiT's video-latent attention working set grows with frame count. Mac is the shipped path.

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for mlboydaisuke/LTX-Video-2B-CoreAI

Base model

Lightricks/LTX-Video

Finetuned

(29)

this model