Liquid Audio - Speech-to-Speech audio models

These details have not been verified by PyPI

Project links

Project description

Liquid Audio - Speech-to-Speech models

We present LFM2-Audio-1.5B, Liquid AI's first end-to-end audio foundation model. Built with low-latency in mind, the lightweight LFM2 backbone enables real time speech-to-speech conversations without sacrificing quality.

LFM2-Audio supports two generation modes, interleaved and sequential, to maximize performance and quality across different tasks. Interleaved generation outputs text and audio tokens in a fixed interleaved pattern. This approach minimizes time to first audio output and number of tokens generated, making it ideal for naturally flowing real-time speech-to-speech interactions on resource constrained devices. Sequential generation mode, where the model decides when to switch modalities via special tokens, is suitable for non-conversational tasks, such as speech-to-text (ASR) or text-to-speech (TTS).

Updates

LFM2.5-Audio-1.5B is released! This model is based on the stronger LFM2.5-1.2B base, and comes with a lightning fast LFM2 based audio detokenizer, stronger ASR, and better TTS voices. To use the new detokenizer, simply use processor.decode, see the examples below for more details. For the improved TTS voices, see the TTS section.

Installation

The package can be installed via pip

pip install liquid-audio
pip install "liquid-audio [demo]" # optional, to install demo dependencies
pip install flash-attn --no-build-isolation  # optional, to use flash attention 2. Will fallback to torch SDPA if not installed

Usage

Generation is handled by two generation modes, interleaved and sequential, accessible from the methods LFM2AudioModel.generate_interleaved and LFM2AudioModel.generate_sequential respectively. Both are generators that yield torch.Tensors. Text tokens are represented by tensors with 1 entry, and audio tokens are tensors with 8 entries, corresponding to 8 Mimi codebooks.

The LFM2AudioModel class operates on tokens only. The LFM2AudioProcessor class is used convert between tokens and data. For text, this means the conversion from string to tokens and back. For audio inputs, this handles the conversion of waveforms to log-mel features, and for audio outputs, this handles the detokenization of audio tokens to waveform.

To facilitate the creation of inputs for the generation methods and to apply the correct chat templates, use the ChatState helper class. See examples below for usage instructions.

Gradio demo

To use the demo interface, make sure to install the extra dependencies in the [demo] group, e.g.

pip install "liquid-audio [demo]"

To launch the demo, use the command liquid-audio-demo on the terminal. The demo interface will be available via the url http://localhost:7860.

Multi-turn, multi-modal chat

For multi-turn chat with text and audio output, we use interleaved generation. The system prompt should be set to Respond with interleaved text and audio.. Here we use audio as the first user turn, and text as the second one

Conversation transcript

User

https://github.com/user-attachments/assets/e2ffb8c3-c84c-4460-9cb8-f95a13b6eec6

Assistant

Sure! How about "Handcrafted Woodworking, Precision Made for You"? Another option could be "Quality Woodworking, Quality Results." If you want something more personal, you might try "Your Woodworking Needs, Our Expertise."

https://github.com/user-attachments/assets/019664b5-3480-4801-b05a-bd62ddcb8d3e

User

My business specialized in chairs, can you give me something related to that?

Assistant

Sure thing! How about “Comfortable Chairs, Crafted with Care” or “Elegant Seats, Handcrafted for You”? Let me know if you’d like a few more options.

https://github.com/user-attachments/assets/d0d054b2-6d1d-49fb-94df-4aa0b6641990

import torch
import torchaudio
from liquid_audio import LFM2AudioModel, LFM2AudioProcessor, ChatState, LFMModality

# Load models
HF_REPO = "LiquidAI/LFM2.5-Audio-1.5B"

processor = LFM2AudioProcessor.from_pretrained(HF_REPO).eval()
model = LFM2AudioModel.from_pretrained(HF_REPO).eval()

# Set up inputs for the model
chat = ChatState(processor)

chat.new_turn("system")
chat.add_text("Respond with interleaved text and audio.")
chat.end_turn()

chat.new_turn("user")
wav, sampling_rate = torchaudio.load("assets/question.wav")
chat.add_audio(wav, sampling_rate)
chat.end_turn()

chat.new_turn("assistant")

# Generate text and audio tokens.
text_out: list[torch.Tensor] = []
audio_out: list[torch.Tensor] = []
modality_out: list[LFMModality] = []
for t in model.generate_interleaved(**chat, max_new_tokens=512, audio_temperature=1.0, audio_top_k=4):
    if t.numel() == 1:
        print(processor.text.decode(t), end="", flush=True)
        text_out.append(t)
        modality_out.append(LFMModality.TEXT)
    else:
        audio_out.append(t)
        modality_out.append(LFMModality.AUDIO_OUT)

# output: Sure! How about "Handcrafted Woodworking, Precision Made for You"? Another option could be "Quality Woodworking, Quality Results." If you want something more personal, you might try "Your Woodworking Needs, Our Expertise."

# Detokenize audio, removing the last "end-of-audio" codes
# Mimi returns audio at 24kHz
audio_codes = torch.stack(audio_out[:-1], 1).unsqueeze(0)
waveform = processor.decode(audio_codes)
torchaudio.save("answer1.wav", waveform.cpu(), 24_000)

# Append newly generated tokens to chat history
chat.append(
    text = torch.stack(text_out, 1),
    audio_out = torch.stack(audio_out, 1),
    modality_flag = torch.tensor(modality_out),
)
chat.end_turn()

# Start new turn
chat.new_turn("user")
chat.add_text("My business specialized in chairs, can you give me something related to that?")
chat.end_turn()

chat.new_turn("assistant")

# Generate second turn text and audio tokens.
audio_out: list[torch.Tensor] = []
for t in model.generate_interleaved(**chat, max_new_tokens=512, audio_temperature=1.0, audio_top_k=4):
    if t.numel() == 1:
        print(processor.text.decode(t), end="", flush=True)
    else:
        audio_out.append(t)

# output: Sure thing! How about “Comfortable Chairs, Crafted with Care” or “Elegant Seats, Handcrafted for You”? Let me know if you’d like a few more options.

# Detokenize second turn audio, removing the last "end-of-audio" codes
audio_codes = torch.stack(audio_out[:-1], 1).unsqueeze(0)
waveform = processor.decode(audio_codes)
torchaudio.save("answer2.wav", waveform.cpu(), 24_000)

ASR

For ASR, we use sequential generation, with the fixed system prompt Perform ASR.. The output is capitalized and punctuated.

Input audio snippet

https://github.com/user-attachments/assets/b3cc017f-363d-49f3-8e7d-f6db9556900e

Model output: The stale smell of old beer lingers. It takes heat to bring out the odor. A cold dip restores health and zest. A salt pickle tastes fine with ham. Tacos al pastor are my favorite. A zestful food is the hot cross bun.

import torch
import torchaudio
from liquid_audio import LFM2AudioModel, LFM2AudioProcessor, ChatState, LFMModality

# Load models
HF_REPO = "LiquidAI/LFM2.5-Audio-1.5B"

processor = LFM2AudioProcessor.from_pretrained(HF_REPO).eval()
model = LFM2AudioModel.from_pretrained(HF_REPO).eval()

# Set up inputs for the model
chat = ChatState(processor)

chat.new_turn("system")
chat.add_text("Perform ASR.")
chat.end_turn()

chat.new_turn("user")
wav, sampling_rate = torchaudio.load("assets/asr.wav")
chat.add_audio(wav, sampling_rate)
chat.end_turn()

chat.new_turn("assistant")

# Generate text
for t in model.generate_sequential(**chat, max_new_tokens=512):
    if t.numel() == 1:
        print(processor.text.decode(t), end="", flush=True)

# Output: The stale smell of old beer lingers. It takes heat to bring out the odor. A cold dip restores health and zest. A salt pickle tastes fine with ham. Tacos al pastor are my favorite. A zestful food is the hot cross bun.

TTS

For TTS, we also use sequential generation. We support four pre-defined voices, which can be selected by choosing one of the four system prompts below

Perform TTS. Use the US male voice.
Perform TTS. Use the US female voice.
Perform TTS. Use the UK male voice.
Perform TTS. Use the UK female voice.

TTS Sample

System prompt: Perform TTS. Use the UK male voice.

Input sentence: What is this obsession people have with books? They put them in their houses—like they're trophies. What do you need it for after you read it?

Output audio

https://github.com/user-attachments/assets/8d57c184-b92e-4e1a-983b-d1f9d16d0d92

import torch
import torchaudio
from liquid_audio import LFM2AudioModel, LFM2AudioProcessor, ChatState, LFMModality

# Load models
HF_REPO = "LiquidAI/LFM2.5-Audio-1.5B"

processor = LFM2AudioProcessor.from_pretrained(HF_REPO).eval()
model = LFM2AudioModel.from_pretrained(HF_REPO).eval()

# Set up inputs for the model
chat = ChatState(processor)

chat.new_turn("system")
chat.add_text("Perform TTS. Use the UK male voice.")
chat.end_turn()

chat.new_turn("user")
chat.add_text("What is this obsession people have with books? They put them in their houses—like they're trophies. What do you need it for after you read it?")
chat.end_turn()

chat.new_turn("assistant")

# Generate text
audio_out: list[torch.Tensor] = []
for t in model.generate_sequential(**chat, max_new_tokens=512, audio_temperature = 0.8, audio_top_k=64):
    if t.numel() > 1:
        audio_out.append(t)

# Detokenize audio
audio_codes = torch.stack(audio_out[:-1], 1).unsqueeze(0)
waveform = processor.decode(audio_codes)
torchaudio.save("tts.wav", waveform.cpu(), 24_000)

License

The code in this repository and associated weights are licensed under the LFM Open License v1.0.

The code for the audio encoder is based on Nvidia NeMo, licensed under Apache 2.0, and the canary-180m-flash checkpoint, licensed under CC-BY 4.0. To simplify dependency resolution, we also ship the Python code of Kyutai Mimi, licensed under the MIT License.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

1.2.0

May 8, 2026

This version

1.1.0

Jan 6, 2026

1.0.0

Oct 1, 2025

0.0.0a1 pre-release

Oct 1, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

liquid_audio-1.1.0.tar.gz (135.6 kB view details)

Uploaded Jan 6, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

liquid_audio-1.1.0-py3-none-any.whl (165.8 kB view details)

Uploaded Jan 6, 2026 Python 3

File details

Details for the file liquid_audio-1.1.0.tar.gz.

File metadata

Download URL: liquid_audio-1.1.0.tar.gz
Upload date: Jan 6, 2026
Size: 135.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.9.21 {"installer":{"name":"uv","version":"0.9.21","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"22.04","id":"jammy","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for liquid_audio-1.1.0.tar.gz
Algorithm	Hash digest
SHA256	`57d633ce65c55e050bbef0671c7b8fad5cf8b75e3cc9e0d36fec38d7a04221c7`
MD5	`07e154f170058558b448200619ce04b8`
BLAKE2b-256	`e956f899e9c29481209027d9514438caace608cf0ce74cd85ca28b68dfcfd70e`

See more details on using hashes here.

File details

Details for the file liquid_audio-1.1.0-py3-none-any.whl.

File metadata

Download URL: liquid_audio-1.1.0-py3-none-any.whl
Upload date: Jan 6, 2026
Size: 165.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.9.21 {"installer":{"name":"uv","version":"0.9.21","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"22.04","id":"jammy","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for liquid_audio-1.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`4c876dd2b481664e2464568c93eabf5447511db1f8a9200294fd19add3e36097`
MD5	`918cbb35be1c19e5ab173da77c90c627`
BLAKE2b-256	`b552b69b3c9c0a4853c1f9a9670dd22c3aea708e26da857342feca8fd922d238`

See more details on using hashes here.

liquid-audio 1.1.0

Navigation

Verified details

Owner

Maintainers

Unverified details

Project links

Meta

Project description

Liquid Audio - Speech-to-Speech models

Updates

Installation

Usage

Gradio demo

Multi-turn, multi-modal chat

ASR

TTS

License

Project details

Verified details

Owner

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes