Generate videos with synchronized audio on Apple Silicon using MLX. Supports text-to-video (T2V) and image-to-video (I2V) with LTX-2.
Project description
mlx-video-with-audio
Generate videos with synchronized audio on Apple Silicon using MLX. Supports text-to-video (T2V) and image-to-video (I2V) generation with the LTX-2 model.
Features
- Text-to-Video (T2V) generation with synchronized audio
- Image-to-Video (I2V) generation with synchronized audio
- Video-only generation (without audio)
- Two-stage generation pipeline for high-quality output
- 2x spatial upscaling for images and videos
- Optimized for Apple Silicon using MLX
- Cross-modal attention for audio-video synchronization
Installation
Install from PyPI:
pip install mlx-video-with-audio
Or with uv:
uv pip install mlx-video-with-audio
Install from source:
pip install git+https://github.com/james-see/mlx-video-with-audio.git
Requirements
- macOS with Apple Silicon (M1/M2/M3/M4)
- Python >= 3.11
- MLX >= 0.22.0
- ffmpeg (for audio-video muxing)
Install ffmpeg if not already installed:
brew install ffmpeg
Supported Models
LTX-2
- LTX-2 — 19B parameter video generation model from Lightricks
- Wan2.1 — 1.3B / 14B parameter T2V models (single-model pipeline)
- Wan2.2 — T2V-14B, TI2V-5B, and I2V-14B models (dual-model pipeline)
LTX-2 is a 19B parameter video generation model from Lightricks with audio generation capabilities.
Recommended: Use the unified MLX model notapalindrome/ltx2-mlx-av (~42GB). It avoids downloading the full Lightricks/LTX-2 (~150GB) by using MLX-community Gemma for the text encoder.
Important: The Gemma text encoder is required for normal generation embeddings (video/audio conditioning), even when prompt enhancement is disabled.
Prompt enhancement is a separate optional step that rewrites the prompt text before generation. If prompt enhancement fails, generation now falls back to the original prompt automatically.
- Text-to-video generation with multiple model families
- LTX-2: Two-stage pipeline with 2x spatial upscaling
- Wan2.1/2.2: Flow-matching diffusion with classifier-free guidance
- Optimized for Apple Silicon using MLX
LTX-2
This project uses uv for dependency management and isolation.
Text-to-Video with Audio (T2V+Audio)
Generate videos with synchronized audio from text descriptions:
uv run mlx_video.generate_av --prompt "A jazz band playing in a smoky club"
With custom settings:
uv run mlx_video.generate_av \
--prompt "Ocean waves crashing on a beach at sunset" \
--height 768 \
--width 768 \
--num-frames 65 \
--seed 123 \
--output-path my_video.mp4
Image-to-Video with Audio (I2V+Audio)
Generate videos from an input image with synchronized audio:
uv run mlx_video.generate_av \
--prompt "A person dancing to upbeat music" \
--image photo.jpg \
--image-strength 0.8
Video-Only Generation (no audio)
For video generation without audio:
uv run mlx_video.generate --prompt "Two dogs of the poodle breed wearing sunglasses, close up, cinematic, sunset" -n 100 --width 768
CLI Reference
generate_av (Audio-Video)
| Option | Default | Description |
|---|---|---|
--prompt, -p |
(required) | Text description of the video/audio |
--height, -H |
512 | Output height (must be divisible by 64) |
--width, -W |
512 | Output width (must be divisible by 64) |
--num-frames, -n |
65 | Number of frames (must be 1 + 8*k) |
--seed, -s |
42 | Random seed for reproducibility |
--fps |
24 | Frames per second |
--output-path |
output_av.mp4 | Output video path |
--output-audio |
(auto) | Output audio path (default: same as video with .wav) |
--image, -i |
None | Path to conditioning image for I2V |
--image-strength |
1.0 | Conditioning strength (1.0 = full denoise) |
--image-frame-idx |
0 | Frame index to condition (0 = first frame) |
--enhance-prompt |
false | Enhance prompt using Gemma |
--tiling |
auto | Tiling mode for VAE (auto/none/default/aggressive/conservative) |
--model-repo |
notapalindrome/ltx2-mlx-av | Model repo (~42GB unified MLX, no Lightricks download) |
generate (Video-Only)
| Option | Default | Description |
|---|---|---|
--prompt, -p |
(required) | Text description of the video |
--height, -H |
512 | Output height (must be divisible by 64) |
--width, -W |
512 | Output width (must be divisible by 64) |
--num-frames, -n |
100 | Number of frames |
--seed, -s |
42 | Random seed for reproducibility |
--fps |
24 | Frames per second |
--output, -o |
output.mp4 | Output video path |
--save-frames |
false | Save individual frames as images |
--model-repo |
notapalindrome/ltx2-mlx-av | Model repo (~42GB unified MLX, no Lightricks download) |
convert (Model Conversion)
Convert HuggingFace models to unified MLX format for faster loading:
uv run mlx_video.convert --hf-path Lightricks/LTX-2 --mlx-path ~/models/ltx2-mlx-av
| Option | Default | Description |
|---|---|---|
--hf-path |
Lightricks/LTX-2 | HuggingFace model path or repo ID |
--mlx-path |
mlx_model | Output path for MLX model |
--dtype |
bfloat16 | Target dtype (float16/float32/bfloat16) |
--no-audio |
false | Exclude audio components (video-only model) |
Using Pre-Converted MLX Models
For faster loading, you can use pre-converted MLX models instead of converting on-the-fly.
Option 1: Use a Pre-Converted Model from HuggingFace
# Use a community-converted MLX model (replace with actual repo)
uv run mlx_video.generate_av \
--prompt "A jazz band playing in a smoky club" \
--model-repo username/ltx2-mlx-av
Option 2: Convert Your Own Model
- Convert the model (one-time, ~42GB output):
uv run mlx_video.convert --hf-path Lightricks/LTX-2 --mlx-path ~/models/ltx2-mlx-av
- Use the converted model:
uv run mlx_video.generate_av \
--prompt "A jazz band playing in a smoky club" \
--model-repo ~/models/ltx2-mlx-av
Benefits of Unified MLX Format
- Faster loading: Single file vs multiple scattered files
- Pre-sanitized weights: No on-the-fly key transformation
- Smaller footprint: Only includes necessary weights (no quantized variants)
- Easy sharing: Upload to HuggingFace for others to use
How It Works (LTX-2)
Video Generation Pipeline
The pipeline uses a two-stage generation process:
- Stage 1: Generate at half resolution (e.g., 384x384) with 8 denoising steps
- Upsample: 2x spatial upsampling via LatentUpsampler
- Stage 2: Refine at full resolution (e.g., 768x768) with 3 denoising steps
- Decode: VAE decoder converts latents to RGB video
Audio Generation Pipeline
Audio is generated in sync with video through:
- Joint Denoising: Video and audio latents are denoised together
- Cross-Modal Attention: Bidirectional attention between video and audio
- Audio Decoding: Audio VAE converts latents to mel spectrogram
- Vocoder: HiFi-GAN converts mel spectrogram to waveform
- Muxing: ffmpeg combines video and audio
Architecture
Text Prompt
│
▼
┌─────────────────────────────────────────────┐
│ Text Encoder (Gemma 3 12B) │
│ ┌─────────────┐ ┌─────────────┐ │
│ │ Video │ │ Audio │ │
│ │ Connector │ │ Connector │ │
│ │ (4096-dim) │ │ (2048-dim) │ │
│ └──────┬──────┘ └──────┬──────┘ │
└─────────┼────────────────────┼──────────────┘
│ │
▼ ▼
┌─────────────────────────────────────────────┐
│ LTX Transformer (48 layers) │
│ ┌─────────────┐ ◄──► ┌─────────────┐ │
│ │ Video Path │ │ Audio Path │ │
│ │ (4096-dim) │ │ (2048-dim) │ │
│ └──────┬──────┘ └──────┬──────┘ │
└─────────┼────────────────────┼──────────────┘
│ │
▼ ▼
┌─────────────────┐ ┌─────────────────┐
│ Video VAE │ │ Audio VAE │
│ Decoder │ │ Decoder │
└────────┬────────┘ └────────┬────────┘
│ │
▼ ▼
Video Frames ┌─────────────┐
│ Vocoder │
│ (HiFi-GAN) │
└──────┬──────┘
│
▼
Audio Waveform
Model Specifications
Video Path
- Transformer: 48 layers, 32 attention heads, 128 dim per head (4096 total)
- Latent channels: 128
- Text encoder: Gemma 3 with 3840-dim features, projected to 4096-dim
- RoPE: Split mode with double precision
Audio Path
- Transformer: 48 layers, 32 attention heads, 64 dim per head (2048 total)
- Latent channels: 8 (patchified to 128)
- Mel bins: 16 (latent), 64 (decoded)
- Sample rate: 24kHz output, 16kHz internal
- Audio latents per second: 25
Cross-Modal Attention
- Bidirectional attention between video and audio paths
- Separate timestep conditioning for cross-attention
- Gated attention output for controlled mixing
Project Structure
mlx_video/
├── generate.py # Video-only generation pipeline
├── generate_av.py # Audio-video generation pipeline
├── convert.py # Weight conversion (PyTorch -> MLX)
├── postprocess.py # Video post-processing utilities
├── utils.py # Helper functions
├── conditioning/ # I2V conditioning utilities
└── models/
└── ltx/
├── ltx.py # Main LTXModel (DiT transformer)
├── config.py # Model configuration
├── transformer.py # Transformer blocks with cross-modal attention
├── attention.py # Multi-head attention with RoPE
├── text_encoder.py # Text encoder with video/audio connectors
├── upsampler.py # 2x spatial upsampler
├── video_vae/ # Video VAE encoder/decoder
└── audio_vae/ # Audio VAE decoder and vocoder
Tips for Best Results
- Prompt Quality: Use detailed, descriptive prompts that include both visual and audio elements
- Frame Count: Use frame counts of the form
1 + 8*k(e.g., 33, 65, 97) for optimal quality - Resolution: Higher resolutions (768x768) produce better results but require more memory
- Tiling: For large videos, use
--tiling aggressiveto reduce memory usage - Audio Sync: Audio is automatically synchronized to video duration
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file mlx_video_with_audio-0.1.33.tar.gz.
File metadata
- Download URL: mlx_video_with_audio-0.1.33.tar.gz
- Upload date:
- Size: 211.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
84108ca61c4f60ebe62c240e0197db2c2a952d41e17bfc86663a5feb1701925f
|
|
| MD5 |
02b08108f506227b5ee2033460b6dc51
|
|
| BLAKE2b-256 |
2b9a6ea9eebb91b8475c33444ae1dd97e3967a53a511bba4077574098701a2e3
|
File details
Details for the file mlx_video_with_audio-0.1.33-py3-none-any.whl.
File metadata
- Download URL: mlx_video_with_audio-0.1.33-py3-none-any.whl
- Upload date:
- Size: 196.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ef41daf146dd8899dd8a2c71ad0a96ea191c94f9508f66b91dd5b059a43508e9
|
|
| MD5 |
8d7f3549bfd27a5ad518349511bb4c5d
|
|
| BLAKE2b-256 |
9ed30be181ac5bad088a0e7a3abe03c40c65a0aa3a734b161d7be12f8696b1f4
|