LTX-Video 2B (distilled) β Core AI β the zoo's first VIDEO model
Lightricks/LTX-Video, config
ltxv-2b-0.9.6-distilled β text β video via an 8-step distilled flow-matching DiT. All three
neural nets run as Core AI .aimodel bundles; only the FlowMatch sampler loop runs on host.
This repo holds the Core AI bundles (each .aimodel is a directory). Conversion + runner +
Mac app are in the coreai-model-zoo
(conversion/ltxvideo/, apps/CoreAIVideo/).
Sample
512Γ768 Β· 49 frames Β· 8 steps Β· ~14 s on a Mac GPU (Apple silicon). Prompt: "A clear glass of water on a wooden table, slow motion droplet falling into it creating ripples, cinematic."
Bundles
| net | shape (demo 512Γ768Γ49f) | dtype | bundle |
|---|---|---|---|
| T5-XXL text encoder | ids(1,256)+mask(1,256) β (1,256,4096) | bf16 8.9 G | t5_bf16.aimodel |
| DiT denoiser (one step) | latent(1,N,128)+grid(1,3,N)+text(1,256,4096)+mask(1,256)+t(1,1) β (1,N,128) | fp16 3.6 G | dit_fp16.aimodel |
| Causal video VAE decoder | latent(1,128,lf,lh,lw)+t(1,) β pixels(1,3,F,H,W) | fp16 1.0 G | vae_fp16.aimodel |
N = lfΒ·lhΒ·lw, lf=(F-1)//8+1, lh=H/32, lw=W/32 (VAE 32Γ spatial / 8Γ temporal). DiT + VAE are
fixed-shape (here 512Γ768Γ49f β N=2688) β re-convert per target resolution; T5 is
resolution-independent (seq 256), so t5_bf16 is reused. At guidance_scale=1 (distilled) CFG is
off and stg_scale=0, so the DiT runs batch-1, single-conditioning.
Numerics
Per-net converted-vs-eager cosine = 1.000000 (T5, DiT, VAE). The DiT also reproduces torch on every one of the 8 real sampler steps (cos 1.000000, max|Ξ| ~1e-3). End-to-end pixel cosine vs a reference is ~0.93 β but that is stochastic-sampler variance, not error: two torch runs (MPS vs CPU, same seed) are also cos 0.9325. Gate by per-step cos + visual.
T5 must be bf16 or fp32 β the encoder overflows in fp16 (washed-out video); bf16 has fp32's exponent range at half the size. DiT + VAE are fp16-clean. Ship set 13.5 G (vs 27 G fp32).
Run it
The 3 bundles run on coreai.runtime (load with an explicit SpecializationOptions.default() for
GPU β AIModel.load(path, None) trips an MPSGraph error; keep the AIModel refs alive). The host
reuses LTX's real FlowMatch sampler / patchify / indices_grid / decode-noise. See
conversion/ltxvideo/_run_coreai.py (CLI) and apps/CoreAIVideo/ (a SwiftUI Mac app: type a
prompt β video, ~14 s).
On-device note
iPhone is a stretch: T5-XXL is 4.76 B params and the DiT's video-latent attention working set grows with frame count. Mac is the shipped path.
Model tree for mlboydaisuke/LTX-Video-2B-CoreAI
Base model
Lightricks/LTX-Video