Generate videos with synchronized audio on Apple Silicon using MLX. Supports text-to-video (T2V) and image-to-video (I2V) with LTX-2.

These details have not been verified by PyPI

Project links

Project description

mlx-video-with-audio

Generate videos with synchronized audio on Apple Silicon using MLX. Supports text-to-video (T2V) and image-to-video (I2V) generation with the LTX-2 model.

Features

Text-to-Video (T2V) generation with synchronized audio
Image-to-Video (I2V) generation with synchronized audio
Video-only generation (without audio)
Two-stage generation pipeline for high-quality output
2x spatial upscaling for images and videos
Optimized for Apple Silicon using MLX
Cross-modal attention for audio-video synchronization

Installation

Install from PyPI:

pip install mlx-video-with-audio

Or with uv:

uv pip install mlx-video-with-audio

Install from source:

pip install git+https://github.com/james-see/mlx-video-with-audio.git

Requirements

macOS with Apple Silicon (M1/M2/M3/M4)
Python >= 3.11
MLX >= 0.22.0
ffmpeg (for audio-video muxing)

Install ffmpeg if not already installed:

brew install ffmpeg

Supported Models

LTX-2

LTX-2 — 19B parameter video generation model from Lightricks
Wan2.1 — 1.3B / 14B parameter T2V models (single-model pipeline)
Wan2.2 — T2V-14B, TI2V-5B, and I2V-14B models (dual-model pipeline)

LTX-2 is a 19B parameter video generation model from Lightricks with audio generation capabilities.

Recommended: Use the unified MLX model notapalindrome/ltx2-mlx-av (~42GB). It avoids downloading the full Lightricks/LTX-2 (~150GB) by using MLX-community Gemma for the text encoder.

Important: The Gemma text encoder is required for normal generation embeddings (video/audio conditioning), even when prompt enhancement is disabled.
Prompt enhancement is a separate optional step that rewrites the prompt text before generation. If prompt enhancement fails, generation now falls back to the original prompt automatically.

Text-to-video generation with multiple model families
LTX-2: Two-stage pipeline with 2x spatial upscaling
Wan2.1/2.2: Flow-matching diffusion with classifier-free guidance
Optimized for Apple Silicon using MLX

LTX-2

This project uses uv for dependency management and isolation.

Text-to-Video with Audio (T2V+Audio)

Generate videos with synchronized audio from text descriptions:

uv run mlx_video.generate_av --prompt "A jazz band playing in a smoky club"

With custom settings:

uv run mlx_video.generate_av \
    --prompt "Ocean waves crashing on a beach at sunset" \
    --height 768 \
    --width 768 \
    --num-frames 65 \
    --seed 123 \
    --output-path my_video.mp4

Image-to-Video with Audio (I2V+Audio)

Generate videos from an input image with synchronized audio:

uv run mlx_video.generate_av \
    --prompt "A person dancing to upbeat music" \
    --image photo.jpg \
    --image-strength 0.8

Video-Only Generation (no audio)

For video generation without audio:

uv run mlx_video.generate --prompt "Two dogs of the poodle breed wearing sunglasses, close up, cinematic, sunset" -n 100 --width 768

CLI Reference

generate_av (Audio-Video)

Option	Default	Description
`--prompt`, `-p`	(required)	Text description of the video/audio
`--height`, `-H`	512	Output height (must be divisible by 64)
`--width`, `-W`	512	Output width (must be divisible by 64)
`--num-frames`, `-n`	65	Number of frames (must be 1 + 8*k)
`--seed`, `-s`	42	Random seed for reproducibility
`--fps`	24	Frames per second
`--output-path`	output_av.mp4	Output video path
`--output-audio`	(auto)	Output audio path (default: same as video with .wav)
`--image`, `-i`	None	Path to conditioning image for I2V
`--image-strength`	1.0	Conditioning strength (1.0 = full denoise)
`--image-frame-idx`	0	Frame index to condition (0 = first frame)
`--enhance-prompt`	false	Enhance prompt using Gemma
`--tiling`	auto	Tiling mode for VAE (auto/none/default/aggressive/conservative)
`--model-repo`	notapalindrome/ltx2-mlx-av	Model repo (~42GB unified MLX, no Lightricks download)

generate (Video-Only)

Option	Default	Description
`--prompt`, `-p`	(required)	Text description of the video
`--height`, `-H`	512	Output height (must be divisible by 64)
`--width`, `-W`	512	Output width (must be divisible by 64)
`--num-frames`, `-n`	100	Number of frames
`--seed`, `-s`	42	Random seed for reproducibility
`--fps`	24	Frames per second
`--output`, `-o`	output.mp4	Output video path
`--save-frames`	false	Save individual frames as images
`--model-repo`	notapalindrome/ltx2-mlx-av	Model repo (~42GB unified MLX, no Lightricks download)

convert (Model Conversion)

Convert HuggingFace models to unified MLX format for faster loading:

uv run mlx_video.convert --hf-path Lightricks/LTX-2 --mlx-path ~/models/ltx2-mlx-av

Option	Default	Description
`--hf-path`	Lightricks/LTX-2	HuggingFace model path or repo ID
`--mlx-path`	mlx_model	Output path for MLX model
`--dtype`	bfloat16	Target dtype (float16/float32/bfloat16)
`--no-audio`	false	Exclude audio components (video-only model)

Using Pre-Converted MLX Models

For faster loading, you can use pre-converted MLX models instead of converting on-the-fly.

Option 1: Use a Pre-Converted Model from HuggingFace

# Use a community-converted MLX model (replace with actual repo)
uv run mlx_video.generate_av \
    --prompt "A jazz band playing in a smoky club" \
    --model-repo username/ltx2-mlx-av

Option 2: Convert Your Own Model

Convert the model (one-time, ~42GB output):

uv run mlx_video.convert --hf-path Lightricks/LTX-2 --mlx-path ~/models/ltx2-mlx-av

Use the converted model:

uv run mlx_video.generate_av \
    --prompt "A jazz band playing in a smoky club" \
    --model-repo ~/models/ltx2-mlx-av

Benefits of Unified MLX Format

Faster loading: Single file vs multiple scattered files
Pre-sanitized weights: No on-the-fly key transformation
Smaller footprint: Only includes necessary weights (no quantized variants)
Easy sharing: Upload to HuggingFace for others to use

How It Works (LTX-2)

Video Generation Pipeline

The pipeline uses a two-stage generation process:

Stage 1: Generate at half resolution (e.g., 384x384) with 8 denoising steps
Upsample: 2x spatial upsampling via LatentUpsampler
Stage 2: Refine at full resolution (e.g., 768x768) with 3 denoising steps
Decode: VAE decoder converts latents to RGB video

Audio Generation Pipeline

Audio is generated in sync with video through:

Joint Denoising: Video and audio latents are denoised together
Cross-Modal Attention: Bidirectional attention between video and audio
Audio Decoding: Audio VAE converts latents to mel spectrogram
Vocoder: HiFi-GAN converts mel spectrogram to waveform
Muxing: ffmpeg combines video and audio

Architecture

Text Prompt
    │
    ▼
┌─────────────────────────────────────────────┐
│          Text Encoder (Gemma 3 12B)         │
│  ┌─────────────┐      ┌─────────────┐       │
│  │   Video     │      │   Audio     │       │
│  │ Connector   │      │ Connector   │       │
│  │  (4096-dim) │      │  (2048-dim) │       │
│  └──────┬──────┘      └──────┬──────┘       │
└─────────┼────────────────────┼──────────────┘
          │                    │
          ▼                    ▼
┌─────────────────────────────────────────────┐
│        LTX Transformer (48 layers)          │
│  ┌─────────────┐ ◄──► ┌─────────────┐       │
│  │ Video Path  │      │ Audio Path  │       │
│  │  (4096-dim) │      │  (2048-dim) │       │
│  └──────┬──────┘      └──────┬──────┘       │
└─────────┼────────────────────┼──────────────┘
          │                    │
          ▼                    ▼
┌─────────────────┐    ┌─────────────────┐
│   Video VAE     │    │   Audio VAE     │
│   Decoder       │    │   Decoder       │
└────────┬────────┘    └────────┬────────┘
         │                      │
         ▼                      ▼
    Video Frames          ┌─────────────┐
                          │   Vocoder   │
                          │  (HiFi-GAN) │
                          └──────┬──────┘
                                 │
                                 ▼
                           Audio Waveform

Model Specifications

Video Path

Transformer: 48 layers, 32 attention heads, 128 dim per head (4096 total)
Latent channels: 128
Text encoder: Gemma 3 with 3840-dim features, projected to 4096-dim
RoPE: Split mode with double precision

Audio Path

Transformer: 48 layers, 32 attention heads, 64 dim per head (2048 total)
Latent channels: 8 (patchified to 128)
Mel bins: 16 (latent), 64 (decoded)
Sample rate: 24kHz output, 16kHz internal
Audio latents per second: 25

Cross-Modal Attention

Bidirectional attention between video and audio paths
Separate timestep conditioning for cross-attention
Gated attention output for controlled mixing

Project Structure

mlx_video/
├── generate.py             # Video-only generation pipeline
├── generate_av.py          # Audio-video generation pipeline
├── convert.py              # Weight conversion (PyTorch -> MLX)
├── postprocess.py          # Video post-processing utilities
├── utils.py                # Helper functions
├── conditioning/           # I2V conditioning utilities
└── models/
    └── ltx/
        ├── ltx.py          # Main LTXModel (DiT transformer)
        ├── config.py       # Model configuration
        ├── transformer.py  # Transformer blocks with cross-modal attention
        ├── attention.py    # Multi-head attention with RoPE
        ├── text_encoder.py # Text encoder with video/audio connectors
        ├── upsampler.py    # 2x spatial upsampler
        ├── video_vae/      # Video VAE encoder/decoder
        └── audio_vae/      # Audio VAE decoder and vocoder

Tips for Best Results

Prompt Quality: Use detailed, descriptive prompts that include both visual and audio elements
Frame Count: Use frame counts of the form 1 + 8*k (e.g., 33, 65, 97) for optimal quality
Resolution: Higher resolutions (768x768) produce better results but require more memory
Tiling: For large videos, use --tiling aggressive to reduce memory usage
Audio Sync: Audio is automatically synchronized to video duration

License

MIT

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.33

Apr 8, 2026

0.1.32

Mar 23, 2026

0.1.31

Mar 23, 2026

0.1.30

Mar 23, 2026

0.1.29

Mar 23, 2026

0.1.28

Mar 22, 2026

0.1.27

Mar 21, 2026

0.1.26

Mar 21, 2026

0.1.25

Mar 21, 2026

0.1.24

Mar 19, 2026

0.1.22

Mar 19, 2026

0.1.21

Mar 19, 2026

0.1.20

Mar 18, 2026

0.1.19

Mar 18, 2026

0.1.18

Mar 18, 2026

0.1.17

Mar 18, 2026

0.1.16

Mar 18, 2026

0.1.15

Mar 14, 2026

0.1.14

Mar 14, 2026

0.1.13

Mar 5, 2026

0.1.12

Mar 5, 2026

0.1.11

Mar 5, 2026

0.1.10

Mar 5, 2026

0.1.9

Mar 5, 2026

0.1.8

Mar 5, 2026

0.1.7

Feb 23, 2026

0.1.6

Feb 23, 2026

0.1.5

Feb 23, 2026

0.1.4

Feb 23, 2026

0.1.3

Feb 1, 2026

0.1.2

Feb 1, 2026

0.1.1

Feb 1, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mlx_video_with_audio-0.1.33.tar.gz (211.0 kB view details)

Uploaded Apr 8, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

mlx_video_with_audio-0.1.33-py3-none-any.whl (196.9 kB view details)

Uploaded Apr 8, 2026 Python 3

File details

Details for the file mlx_video_with_audio-0.1.33.tar.gz.

File metadata

Download URL: mlx_video_with_audio-0.1.33.tar.gz
Upload date: Apr 8, 2026
Size: 211.0 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for mlx_video_with_audio-0.1.33.tar.gz
Algorithm	Hash digest
SHA256	`84108ca61c4f60ebe62c240e0197db2c2a952d41e17bfc86663a5feb1701925f`
MD5	`02b08108f506227b5ee2033460b6dc51`
BLAKE2b-256	`2b9a6ea9eebb91b8475c33444ae1dd97e3967a53a511bba4077574098701a2e3`

See more details on using hashes here.

File details

Details for the file mlx_video_with_audio-0.1.33-py3-none-any.whl.

File metadata

Download URL: mlx_video_with_audio-0.1.33-py3-none-any.whl
Upload date: Apr 8, 2026
Size: 196.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for mlx_video_with_audio-0.1.33-py3-none-any.whl
Algorithm	Hash digest
SHA256	`ef41daf146dd8899dd8a2c71ad0a96ea191c94f9508f66b91dd5b059a43508e9`
MD5	`8d7f3549bfd27a5ad518349511bb4c5d`
BLAKE2b-256	`9ed30be181ac5bad088a0e7a3abe03c40c65a0aa3a734b161d7be12f8696b1f4`

See more details on using hashes here.

mlx-video-with-audio 0.1.33

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

mlx-video-with-audio

Features

Installation

Install from PyPI:

Install from source:

Requirements

Supported Models

LTX-2

LTX-2

Text-to-Video with Audio (T2V+Audio)

Image-to-Video with Audio (I2V+Audio)

Video-Only Generation (no audio)

CLI Reference

generate_av (Audio-Video)

generate (Video-Only)

convert (Model Conversion)

Using Pre-Converted MLX Models

Option 1: Use a Pre-Converted Model from HuggingFace

Option 2: Convert Your Own Model

Benefits of Unified MLX Format

How It Works (LTX-2)

Video Generation Pipeline

Audio Generation Pipeline

Architecture

Model Specifications

Video Path

Audio Path

Cross-Modal Attention

Project Structure

Tips for Best Results

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes