LTX-Video 2B (distilled) β†’ Core AI β€” the zoo's first VIDEO model

Lightricks/LTX-Video, config ltxv-2b-0.9.6-distilled β€” text β†’ video via an 8-step distilled flow-matching DiT. All three neural nets run as Core AI .aimodel bundles; only the FlowMatch sampler loop runs on host.

This repo holds the Core AI bundles (each .aimodel is a directory). Conversion + runner + Mac app are in the coreai-model-zoo (conversion/ltxvideo/, apps/CoreAIVideo/).

Sample

512Γ—768 Β· 49 frames Β· 8 steps Β· ~14 s on a Mac GPU (Apple silicon). Prompt: "A clear glass of water on a wooden table, slow motion droplet falling into it creating ripples, cinematic."

Bundles

net shape (demo 512Γ—768Γ—49f) dtype bundle
T5-XXL text encoder ids(1,256)+mask(1,256) β†’ (1,256,4096) bf16 8.9 G t5_bf16.aimodel
DiT denoiser (one step) latent(1,N,128)+grid(1,3,N)+text(1,256,4096)+mask(1,256)+t(1,1) β†’ (1,N,128) fp16 3.6 G dit_fp16.aimodel
Causal video VAE decoder latent(1,128,lf,lh,lw)+t(1,) β†’ pixels(1,3,F,H,W) fp16 1.0 G vae_fp16.aimodel

N = lfΒ·lhΒ·lw, lf=(F-1)//8+1, lh=H/32, lw=W/32 (VAE 32Γ— spatial / 8Γ— temporal). DiT + VAE are fixed-shape (here 512Γ—768Γ—49f β†’ N=2688) β€” re-convert per target resolution; T5 is resolution-independent (seq 256), so t5_bf16 is reused. At guidance_scale=1 (distilled) CFG is off and stg_scale=0, so the DiT runs batch-1, single-conditioning.

Numerics

Per-net converted-vs-eager cosine = 1.000000 (T5, DiT, VAE). The DiT also reproduces torch on every one of the 8 real sampler steps (cos 1.000000, max|Ξ”| ~1e-3). End-to-end pixel cosine vs a reference is ~0.93 β€” but that is stochastic-sampler variance, not error: two torch runs (MPS vs CPU, same seed) are also cos 0.9325. Gate by per-step cos + visual.

T5 must be bf16 or fp32 β€” the encoder overflows in fp16 (washed-out video); bf16 has fp32's exponent range at half the size. DiT + VAE are fp16-clean. Ship set 13.5 G (vs 27 G fp32).

Run it

The 3 bundles run on coreai.runtime (load with an explicit SpecializationOptions.default() for GPU β€” AIModel.load(path, None) trips an MPSGraph error; keep the AIModel refs alive). The host reuses LTX's real FlowMatch sampler / patchify / indices_grid / decode-noise. See conversion/ltxvideo/_run_coreai.py (CLI) and apps/CoreAIVideo/ (a SwiftUI Mac app: type a prompt β†’ video, ~14 s).

On-device note

iPhone is a stretch: T5-XXL is 4.76 B params and the DiT's video-latent attention working set grows with frame count. Mac is the shipped path.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for mlboydaisuke/LTX-Video-2B-CoreAI

Finetuned
(29)
this model