Skip to main content

High-quality Text-to-Speech synthesis with ONNX Runtime

Project description

Supertonic โ€” Lightning Fast, On-Device TTS

Supertonic Banner

GitHub GitHub Demo Models Colab Docs

Quick Start

pip install supertonic

CLI

# Note: First run will download the model (~260MB) from HuggingFace
supertonic tts 'Supertonic is a lightning fast, on-device TTS system.' -o output.wav

Python

from supertonic import TTS

# Note: First run downloads model automatically (~260MB)
tts = TTS(auto_download=True)

# Get a voice style
style = tts.get_voice_style(voice_name="M1")

# Generate speech
text = "The train delay was announced at 4:45 PM on Wed, Apr 3, 2024 due to track maintenance."
wav, duration = tts.synthesize(text, voice_style=style)

# Save to file
tts.save_audio(wav, "output.wav")

Requirements

Supertonic has minimal dependencies - just 4 core libraries:

  • onnxruntime - Fast ONNX model inference
  • numpy - Numerical operations
  • soundfile - Audio file I/O
  • huggingface-hub - Model downloads

Key Features

โšก Blazingly Fast: Generates speech up to 167ร— faster than real-time on consumer hardware (M4 Pro)

๐Ÿชถ Ultra Lightweight: Only 66M parameters, optimized for efficient on-device performance

๐Ÿ“ฑ On-Device Capable: Complete privacy and zero latency

๐ŸŽจ Natural Text Handling: Seamlessly processes complex expressions without G2P module

โš™๏ธ Highly Configurable: Adjust inference steps, batch processing, and other parameters

๐Ÿงฉ Flexible Deployment: Deploy across servers, browsers, and edge devices

Performance

We evaluated Supertonic's performance (with 2 inference steps) using two key metrics across input texts of varying lengths: Short (59 chars), Mid (152 chars), and Long (266 chars).

Metrics:

  • Characters per Second: Measures throughput by dividing the number of input characters by the time required to generate audio. Higher is better.
  • Real-time Factor (RTF): Measures the time taken to synthesize audio relative to its duration. Lower is better (e.g., RTF of 0.1 means it takes 0.1 seconds to generate one second of audio).

Characters per Second

System Short (59 chars) Mid (152 chars) Long (266 chars)
Supertonic (M4 pro - CPU) 912 1048 1263
Supertonic (M4 pro - WebGPU) 996 1801 2509
Supertonic (RTX4090) 2615 6548 12164
API ElevenLabs Flash v2.5 144 209 287
API OpenAI TTS-1 37 55 82
API Gemini 2.5 Flash TTS 12 18 24
API Supertone Sona speech 1 38 64 92
Open Kokoro 104 107 117
Open NeuTTS Air 37 42 47

Notes: API = Cloud-based API services (measured from Seoul) Open = Open-source models Supertonic (M4 pro - CPU) and (M4 pro - WebGPU): Tested with ONNX Supertonic (RTX4090): Tested with PyTorch model Kokoro: Tested on M4 Pro CPU with ONNX NeuTTS Air: Tested on M4 Pro CPU with Q8-GGUF

Real-time Factor

System Short (59 chars) Mid (152 chars) Long (266 chars)
Supertonic (M4 pro - CPU) 0.015 0.013 0.012
Supertonic (M4 pro - WebGPU) 0.014 0.007 0.006
Supertonic (RTX4090) 0.005 0.002 0.001
API ElevenLabs Flash v2.5 0.133 0.077 0.057
API OpenAI TTS-1 0.471 0.302 0.201
API Gemini 2.5 Flash TTS 1.060 0.673 0.541
API Supertone Sona speech 1 0.372 0.206 0.163
Open Kokoro 0.144 0.124 0.126
Open NeuTTS Air 0.390 0.338 0.343
Additional Performance Data (5-step inference)

Characters per Second (5-step)

System Short (59 chars) Mid (152 chars) Long (266 chars)
Supertonic (M4 pro - CPU) 596 691 850
Supertonic (M4 pro - WebGPU) 570 1118 1546
Supertonic (RTX4090) 1286 3757 6242

Real-time Factor (5-step)

System Short (59 chars) Mid (152 chars) Long (266 chars)
Supertonic (M4 pro - CPU) 0.023 0.019 0.018
Supertonic (M4 pro - WebGPU) 0.024 0.012 0.010
Supertonic (RTX4090) 0.011 0.004 0.002

Natural Text Handling

Supertonic is designed to handle complex, real-world text inputs that contain numbers, currency symbols, abbreviations, dates, and proper nouns.

๐ŸŽง View audio samples more easily: Check out our Interactive Demo for a better viewing experience of all audio examples

Overview of Test Cases:

Category Key Challenges Supertonic ElevenLabs OpenAI Gemini Microsoft
Financial Expression Decimal currency, abbreviated magnitudes (M, K), currency symbols, currency codes โœ… โŒ โŒ โŒ โŒ
Time and Date Time notation, abbreviated weekdays/months, date formats โœ… โŒ โŒ โŒ โŒ
Phone Number Area codes, hyphens, extensions (ext.) โœ… โŒ โŒ โŒ โŒ
Technical Unit Decimal numbers with units, abbreviated technical notations โœ… โŒ โŒ โŒ โŒ
Example 1: Financial Expression

Text:

"The startup secured $5.2M in venture capital, a huge leap from their initial $450K seed round."

Challenges:

  • Decimal point in currency ($5.2M should be read as "five point two million")
  • Abbreviated magnitude units (M for million, K for thousand)
  • Currency symbol ($) that needs to be properly pronounced as "dollars"

Audio Samples:

System Result Audio Sample
Supertonic โœ… ๐ŸŽง Play Audio
ElevenLabs Flash v2.5 โŒ ๐ŸŽง Play Audio
OpenAI TTS-1 โŒ ๐ŸŽง Play Audio
Gemini 2.5 Flash TTS โŒ ๐ŸŽง Play Audio
VibeVoice Realtime 0.5B โŒ ๐ŸŽง Play Audio
Example 2: Time and Date

Text:

"The train delay was announced at 4:45 PM on Wed, Apr 3, 2024 due to track maintenance."

Challenges:

  • Time expression with PM notation (4:45 PM)
  • Abbreviated weekday (Wed)
  • Abbreviated month (Apr)
  • Full date format (Apr 3, 2024)

Audio Samples:

System Result Audio Sample
Supertonic โœ… ๐ŸŽง Play Audio
ElevenLabs Flash v2.5 โŒ ๐ŸŽง Play Audio
OpenAI TTS-1 โŒ ๐ŸŽง Play Audio
Gemini 2.5 Flash TTS โŒ ๐ŸŽง Play Audio
VibeVoice Realtime 0.5B โŒ ๐ŸŽง Play Audio
Example 3: Phone Number

Text:

"You can reach the hotel front desk at (212) 555-0142 ext. 402 anytime."

Challenges:

  • Area code in parentheses that should be read as separate digits
  • Phone number with hyphen separator (555-0142)
  • Abbreviated extension notation (ext.)
  • Extension number (402)

Audio Samples:

System Result Audio Sample
Supertonic โœ… ๐ŸŽง Play Audio
ElevenLabs Flash v2.5 โŒ ๐ŸŽง Play Audio
OpenAI TTS-1 โŒ ๐ŸŽง Play Audio
Gemini 2.5 Flash TTS โŒ ๐ŸŽง Play Audio
VibeVoice Realtime 0.5B โŒ ๐ŸŽง Play Audio
Example 4: Technical Unit

Text:

"Our drone battery lasts 2.3h when flying at 30kph with full camera payload."

Challenges:

  • Decimal time duration with abbreviation (2.3h = two point three hours)
  • Speed unit with abbreviation (30kph = thirty kilometers per hour)
  • Technical abbreviations (h for hours, kph for kilometers per hour)
  • Technical/engineering context requiring proper pronunciation

Audio Samples:

System Result Audio Sample
Supertonic โœ… ๐ŸŽง Play Audio
ElevenLabs Flash v2.5 โŒ ๐ŸŽง Play Audio
OpenAI TTS-1 โŒ ๐ŸŽง Play Audio
Gemini 2.5 Flash TTS โŒ ๐ŸŽง Play Audio
VibeVoice Realtime 0.5B โŒ ๐ŸŽง Play Audio

Note: These samples demonstrate how each system handles text normalization and pronunciation of complex expressions without requiring pre-processing or phonetic annotations.

Citation

The following papers describe the core technologies used in Supertonic. If you use this system in your research or find these techniques useful, please consider citing the relevant papers:

SupertonicTTS: Main Architecture

This paper introduces the overall architecture of SupertonicTTS, including the speech autoencoder, flow-matching based text-to-latent module, and efficient design choices.

@article{kim2025supertonic,
  title={SupertonicTTS: Towards Highly Efficient and Streamlined Text-to-Speech System},
  author={Kim, Hyeongju and Yang, Jinhyeok and Yu, Yechan and Ji, Seunghun and Morton, Jacob and Bous, Frederik and Byun, Joon and Lee, Juheon},
  journal={arXiv preprint arXiv:2503.23108},
  year={2025},
  url={https://arxiv.org/abs/2503.23108}
}

Length-Aware RoPE: Text-Speech Alignment

This paper presents Length-Aware Rotary Position Embedding (LARoPE), which improves text-speech alignment in cross-attention mechanisms.

@article{kim2025larope,
  title={Length-Aware Rotary Position Embedding for Text-Speech Alignment},
  author={Kim, Hyeongju and Lee, Juheon and Yang, Jinhyeok and Morton, Jacob},
  journal={arXiv preprint arXiv:2509.11084},
  year={2025},
  url={https://arxiv.org/abs/2509.11084}
}

Self-Purifying Flow Matching: Training with Noisy Labels

This paper describes the self-purification technique for training flow matching models robustly with noisy or unreliable labels.

@article{kim2025spfm,
  title={Training Flow Matching Models with Reliable Labels via Self-Purification},
  author={Kim, Hyeongju and Yu, Yechan and Yi, June Young and Lee, Juheon},
  journal={arXiv preprint arXiv:2509.19091},
  year={2025},
  url={https://arxiv.org/abs/2509.19091}
}

Related Projects

๐Ÿ  Main Repository: github.com/supertone-inc/supertonic

๐ŸŽง Try it live: Hugging Face Spaces

๐Ÿค— Model Repository: Hugging Face Models

License

Code: MIT License

Model: OpenRAIL-M License

Copyright ยฉ 2025 Supertone Inc.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

supertonic-0.1.1.tar.gz (30.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

supertonic-0.1.1-py3-none-any.whl (28.3 kB view details)

Uploaded Python 3

File details

Details for the file supertonic-0.1.1.tar.gz.

File metadata

  • Download URL: supertonic-0.1.1.tar.gz
  • Upload date:
  • Size: 30.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for supertonic-0.1.1.tar.gz
Algorithm Hash digest
SHA256 7830a4f5a8e9f30299a8ed796762a56ddea44daf43276d10c739002c8d422c73
MD5 03f038cd707f0d5ed3fe79ee1514e07f
BLAKE2b-256 4eb15933bba1c1ccb189d85469edaafd8b5fa12ecf22f6da99066e632928e445

See more details on using hashes here.

File details

Details for the file supertonic-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: supertonic-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 28.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for supertonic-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 271bee50f40c31dd5789eed8b4c0a1d32ed575376ed5d5517e46f5b96ece6c80
MD5 9c1d9858773c44743af34c5b8d8f039f
BLAKE2b-256 a2d5ed5776bfe7ccd7b58ee149f93c6ba98af52dad343299f99752ea5d13a676

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page