Skip to main content

Generate videos with synchronized audio on Apple Silicon using MLX. Supports text-to-video (T2V) and image-to-video (I2V) with LTX-2.

Project description

mlx-video-with-audio

Generate videos with synchronized audio on Apple Silicon using MLX. Supports text-to-video (T2V) and image-to-video (I2V) generation with the LTX-2 model.

Features

  • Text-to-Video (T2V) generation with synchronized audio
  • Image-to-Video (I2V) generation with synchronized audio
  • Video-only generation (without audio)
  • Two-stage generation pipeline for high-quality output
  • 2x spatial upscaling for images and videos
  • Optimized for Apple Silicon using MLX
  • Cross-modal attention for audio-video synchronization

Installation

Install from PyPI:

pip install mlx-video-with-audio

Or with uv:

uv pip install mlx-video-with-audio

Install from source:

pip install git+https://github.com/james-see/mlx-video-with-audio.git

Requirements

  • macOS with Apple Silicon (M1/M2/M3/M4)
  • Python >= 3.11
  • MLX >= 0.22.0
  • ffmpeg (for audio-video muxing)

Install ffmpeg if not already installed:

brew install ffmpeg

Supported Models

LTX-2

  • LTX-2 — 19B parameter video generation model from Lightricks
  • Wan2.1 — 1.3B / 14B parameter T2V models (single-model pipeline)
  • Wan2.2 — T2V-14B, TI2V-5B, and I2V-14B models (dual-model pipeline)

LTX-2 is a 19B parameter video generation model from Lightricks with audio generation capabilities.

Recommended: Use the unified MLX model notapalindrome/ltx2-mlx-av (~42GB). It avoids downloading the full Lightricks/LTX-2 (~150GB) by using MLX-community Gemma for the text encoder.

Important: The Gemma text encoder is required for normal generation embeddings (video/audio conditioning), even when prompt enhancement is disabled.
Prompt enhancement is a separate optional step that rewrites the prompt text before generation. If prompt enhancement fails, generation now falls back to the original prompt automatically.

  • Text-to-video generation with multiple model families
  • LTX-2: Two-stage pipeline with 2x spatial upscaling
  • Wan2.1/2.2: Flow-matching diffusion with classifier-free guidance
  • Optimized for Apple Silicon using MLX

LTX-2

This project uses uv for dependency management and isolation.

Text-to-Video with Audio (T2V+Audio)

Generate videos with synchronized audio from text descriptions:

uv run mlx_video.generate_av --prompt "A jazz band playing in a smoky club"

With custom settings:

uv run mlx_video.generate_av \
    --prompt "Ocean waves crashing on a beach at sunset" \
    --height 768 \
    --width 768 \
    --num-frames 65 \
    --seed 123 \
    --output-path my_video.mp4

Image-to-Video with Audio (I2V+Audio)

Generate videos from an input image with synchronized audio:

uv run mlx_video.generate_av \
    --prompt "A person dancing to upbeat music" \
    --image photo.jpg \
    --image-strength 0.8

Video-Only Generation (no audio)

For video generation without audio:

uv run mlx_video.generate --prompt "Two dogs of the poodle breed wearing sunglasses, close up, cinematic, sunset" -n 100 --width 768
Poodles demo

CLI Reference

generate_av (Audio-Video)

Option Default Description
--prompt, -p (required) Text description of the video/audio
--height, -H 512 Output height (must be divisible by 64)
--width, -W 512 Output width (must be divisible by 64)
--num-frames, -n 65 Number of frames (must be 1 + 8*k)
--seed, -s 42 Random seed for reproducibility
--fps 24 Frames per second
--output-path output_av.mp4 Output video path
--output-audio (auto) Output audio path (default: same as video with .wav)
--image, -i None Path to conditioning image for I2V
--image-strength 1.0 Conditioning strength (1.0 = full denoise)
--image-frame-idx 0 Frame index to condition (0 = first frame)
--enhance-prompt false Enhance prompt using Gemma
--tiling auto Tiling mode for VAE (auto/none/default/aggressive/conservative)
--model-repo notapalindrome/ltx2-mlx-av Model repo (~42GB unified MLX, no Lightricks download)

generate (Video-Only)

Option Default Description
--prompt, -p (required) Text description of the video
--height, -H 512 Output height (must be divisible by 64)
--width, -W 512 Output width (must be divisible by 64)
--num-frames, -n 100 Number of frames
--seed, -s 42 Random seed for reproducibility
--fps 24 Frames per second
--output, -o output.mp4 Output video path
--save-frames false Save individual frames as images
--model-repo notapalindrome/ltx2-mlx-av Model repo (~42GB unified MLX, no Lightricks download)

convert (Model Conversion)

Convert HuggingFace models to unified MLX format for faster loading:

uv run mlx_video.convert --hf-path Lightricks/LTX-2 --mlx-path ~/models/ltx2-mlx-av
Option Default Description
--hf-path Lightricks/LTX-2 HuggingFace model path or repo ID
--mlx-path mlx_model Output path for MLX model
--dtype bfloat16 Target dtype (float16/float32/bfloat16)
--no-audio false Exclude audio components (video-only model)

Using Pre-Converted MLX Models

For faster loading, you can use pre-converted MLX models instead of converting on-the-fly.

Option 1: Use a Pre-Converted Model from HuggingFace

# Use a community-converted MLX model (replace with actual repo)
uv run mlx_video.generate_av \
    --prompt "A jazz band playing in a smoky club" \
    --model-repo username/ltx2-mlx-av

Option 2: Convert Your Own Model

  1. Convert the model (one-time, ~42GB output):
uv run mlx_video.convert --hf-path Lightricks/LTX-2 --mlx-path ~/models/ltx2-mlx-av
  1. Use the converted model:
uv run mlx_video.generate_av \
    --prompt "A jazz band playing in a smoky club" \
    --model-repo ~/models/ltx2-mlx-av

Benefits of Unified MLX Format

  • Faster loading: Single file vs multiple scattered files
  • Pre-sanitized weights: No on-the-fly key transformation
  • Smaller footprint: Only includes necessary weights (no quantized variants)
  • Easy sharing: Upload to HuggingFace for others to use

How It Works (LTX-2)

Video Generation Pipeline

The pipeline uses a two-stage generation process:

  1. Stage 1: Generate at half resolution (e.g., 384x384) with 8 denoising steps
  2. Upsample: 2x spatial upsampling via LatentUpsampler
  3. Stage 2: Refine at full resolution (e.g., 768x768) with 3 denoising steps
  4. Decode: VAE decoder converts latents to RGB video

Audio Generation Pipeline

Audio is generated in sync with video through:

  1. Joint Denoising: Video and audio latents are denoised together
  2. Cross-Modal Attention: Bidirectional attention between video and audio
  3. Audio Decoding: Audio VAE converts latents to mel spectrogram
  4. Vocoder: HiFi-GAN converts mel spectrogram to waveform
  5. Muxing: ffmpeg combines video and audio

Architecture

Text Prompt
    │
    ▼
┌─────────────────────────────────────────────┐
│          Text Encoder (Gemma 3 12B)         │
│  ┌─────────────┐      ┌─────────────┐       │
│  │   Video     │      │   Audio     │       │
│  │ Connector   │      │ Connector   │       │
│  │  (4096-dim) │      │  (2048-dim) │       │
│  └──────┬──────┘      └──────┬──────┘       │
└─────────┼────────────────────┼──────────────┘
          │                    │
          ▼                    ▼
┌─────────────────────────────────────────────┐
│        LTX Transformer (48 layers)          │
│  ┌─────────────┐ ◄──► ┌─────────────┐       │
│  │ Video Path  │      │ Audio Path  │       │
│  │  (4096-dim) │      │  (2048-dim) │       │
│  └──────┬──────┘      └──────┬──────┘       │
└─────────┼────────────────────┼──────────────┘
          │                    │
          ▼                    ▼
┌─────────────────┐    ┌─────────────────┐
│   Video VAE     │    │   Audio VAE     │
│   Decoder       │    │   Decoder       │
└────────┬────────┘    └────────┬────────┘
         │                      │
         ▼                      ▼
    Video Frames          ┌─────────────┐
                          │   Vocoder   │
                          │  (HiFi-GAN) │
                          └──────┬──────┘
                                 │
                                 ▼
                           Audio Waveform

Model Specifications

Video Path

  • Transformer: 48 layers, 32 attention heads, 128 dim per head (4096 total)
  • Latent channels: 128
  • Text encoder: Gemma 3 with 3840-dim features, projected to 4096-dim
  • RoPE: Split mode with double precision

Audio Path

  • Transformer: 48 layers, 32 attention heads, 64 dim per head (2048 total)
  • Latent channels: 8 (patchified to 128)
  • Mel bins: 16 (latent), 64 (decoded)
  • Sample rate: 24kHz output, 16kHz internal
  • Audio latents per second: 25

Cross-Modal Attention

  • Bidirectional attention between video and audio paths
  • Separate timestep conditioning for cross-attention
  • Gated attention output for controlled mixing

Project Structure

mlx_video/
├── generate.py             # Video-only generation pipeline
├── generate_av.py          # Audio-video generation pipeline
├── convert.py              # Weight conversion (PyTorch -> MLX)
├── postprocess.py          # Video post-processing utilities
├── utils.py                # Helper functions
├── conditioning/           # I2V conditioning utilities
└── models/
    └── ltx/
        ├── ltx.py          # Main LTXModel (DiT transformer)
        ├── config.py       # Model configuration
        ├── transformer.py  # Transformer blocks with cross-modal attention
        ├── attention.py    # Multi-head attention with RoPE
        ├── text_encoder.py # Text encoder with video/audio connectors
        ├── upsampler.py    # 2x spatial upsampler
        ├── video_vae/      # Video VAE encoder/decoder
        └── audio_vae/      # Audio VAE decoder and vocoder

Tips for Best Results

  1. Prompt Quality: Use detailed, descriptive prompts that include both visual and audio elements
  2. Frame Count: Use frame counts of the form 1 + 8*k (e.g., 33, 65, 97) for optimal quality
  3. Resolution: Higher resolutions (768x768) produce better results but require more memory
  4. Tiling: For large videos, use --tiling aggressive to reduce memory usage
  5. Audio Sync: Audio is automatically synchronized to video duration

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mlx_video_with_audio-0.1.25.tar.gz (207.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

mlx_video_with_audio-0.1.25-py3-none-any.whl (192.9 kB view details)

Uploaded Python 3

File details

Details for the file mlx_video_with_audio-0.1.25.tar.gz.

File metadata

  • Download URL: mlx_video_with_audio-0.1.25.tar.gz
  • Upload date:
  • Size: 207.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for mlx_video_with_audio-0.1.25.tar.gz
Algorithm Hash digest
SHA256 046501b0525a48edfccd6f49a426612d2b2f8aa06fe5f008bcafe2844261d8ad
MD5 cfcb5eda294b59e7b0b1240056059368
BLAKE2b-256 8fdc6b2409dad57f6475ce0d7577495627cbf0ea1bf8c01246ac29122a506696

See more details on using hashes here.

File details

Details for the file mlx_video_with_audio-0.1.25-py3-none-any.whl.

File metadata

File hashes

Hashes for mlx_video_with_audio-0.1.25-py3-none-any.whl
Algorithm Hash digest
SHA256 e22043be0a708cdb8d37d793711111781b20fa3f84ec85f4552b9809006704e0
MD5 60acb089b7922eb4749110e7b08b00de
BLAKE2b-256 84e0aa9861774cc15d1b168c95193610bfc45ecc6494c766f3f3134708a9dd9e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page