Real-time voice assistant with Speech-to-Text, GPT, and Text-to-Speech (Soprano TTS)
Project description
🎙️ Real-Time Voice Assistant (STT + GPT + TTS)
This project is a local, streaming voice assistant that listens to your voice, transcribes it in real time (STT), generates an AI response using GPT, and speaks the answer back using SopranoTTS.
It's built from three key components:
- stt_server (
vocalyx-stt) – handles real-time speech-to-text via WebSockets. - tts_server (
vocalyx-tts) – streams GPT responses and converts them to speech using SopranoTTS. - client (
vocalyx) – connects everything together: records your mic, shows live transcription, sends it to GPT, and plays back AI-generated voice.
📦 Install
pip install vocalyx
Or install from source:
git clone <repo-url>
cd Voice-to-Voice
pip install -e .
🧩 Requirements
1. Python
Make sure you have Python 3.9+ and <=3.12 installed.
2. System Dependencies
You'll need:
ffmpeg(for audio handling)- A working microphone and audio output
portaudio(for PyAudio)
macOS
brew install portaudio ffmpeg
Ubuntu / Debian
sudo apt update
sudo apt install portaudio19-dev ffmpeg python3-pyaudio
Windows
- Install Python (make sure to add it to PATH)
- PyAudio binaries can be installed with:
pip install pipwin
pipwin install pyaudio
📦 Install Python Dependencies
Create a virtual environment (Python 3.12 recommended) and install dependencies:
python -m venv venv
source venv/bin/activate
pip install -r requirements.txt
Key Python Packages
RealtimeSTT– real-time speech-to-textopenai– GPT streaming responsessoprano-tts– neural TTS enginetorch,numpy– audio inference backendpyaudio,sounddevice– audio playbackpython-dotenv– environment variable loading
🔑 Environment Variables
Create a .env file in the project root with:
OPENAI_API_KEY=your_openai_api_key_here
Get your key from https://platform.openai.com/api-keys.
⚙️ How It Works
End-to-end flow:
[Microphone Input]
↓
[client.py] → Sends audio to STT server
↓
[stt_server.py] → Transcribes speech in real time
↓
[client.py] → Sends text to TTS server
↓
[tts_server.py] → Streams GPT text + converts to speech (SopranoTTS)
↓
[client.py] → Plays AI voice audio live
Each component communicates over WebSockets:
- STT control channel:
ws://localhost:8011 - STT data channel:
ws://localhost:8012 - TTS channel:
ws://localhost:8013
🔊 Audio Format (Important)
SopranoTTS streams raw float32 mono audio:
- Sample rate:
32000 Hz - Channels:
1 - Format:
float32
The client plays audio directly using paFloat32 without μ-law or int16 conversion. This ensures:
- Natural pitch
- Correct tempo
- No distortion
🚀 Running the System
You’ll need three terminals.
1️⃣ Start the STT Server
vocalyx-stt
Handles microphone audio and real-time transcription.
2️⃣ Start the TTS Server
vocalyx-tts
Streams GPT responses and converts them to speech using SopranoTTS.
3️⃣ Run the Client
vocalyx
The client:
- Captures microphone input
- Displays live transcription
- Sends prompts to GPT
- Plays streamed AI voice output
By default, it runs in continuous mode.
🗣️ Example Interaction
You:
What's a good way to stay focused today?
AI: (spoken + printed)
Try breaking your day into short focus sessions. Take a quick stretch between them.
⚙️ Optional Command-Line Arguments
You can tweak client.py behavior:
| Flag | Description | Default |
|---|---|---|
--tts-url |
TTS WebSocket server URL | ws://localhost:8013 |
--post-silence |
Silence after each utterance | 1.0 |
--speech-end-detection |
Adaptive silence detection | off |
--debug |
Print debug logs | off |
--norealtime |
Disable live transcription display | off |
--list |
List microphone devices | off |
List audio devices:
vocalyx --list
Select a specific mic:
vocalyx -i 2
🧠 Notes
- SopranoTTS is initialized once and reused for all requests.
- GPT responses are streamed sentence-by-sentence to minimize latency.
- Audio is streamed and played in near real time.
🧹 Troubleshooting
Audio sounds distorted or slow
- Ensure client playback uses
paFloat32at32000 Hz. - Do not apply μ-law or int16 conversion.
No response from GPT
- Verify
OPENAI_API_KEYin.env. - Check internet connectivity.
STT not transcribing
- Ensure
RealtimeSTTis installed correctly. - Verify microphone index using
--list.
🧾 License
This project is for personal and educational use.
💡 Future Improvements
- VAD-based auto start/stop for more natural conversations
- Opus/WebRTC streaming for browser clients
- GUI frontend for controlling STT/TTS parameters
- Interruptible (barge-in) speech handling
🏁 Summary
# Terminal 1
vocalyx-stt
# Terminal 2
vocalyx-tts
# Terminal 3
vocalyx
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file vocalyx-0.2.0.tar.gz.
File metadata
- Download URL: vocalyx-0.2.0.tar.gz
- Upload date:
- Size: 23.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.9.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ef8d5498f5ce973380d98476f488ca59a6bbe076be7710f54f69861c4e288c2f
|
|
| MD5 |
8e8c2ff85d467fa58d4c4dcea2ba260d
|
|
| BLAKE2b-256 |
dbafd4b2a2e81e1456469c25cab5859e28a0002b1403ec15a894a993e5542769
|
File details
Details for the file vocalyx-0.2.0-py3-none-any.whl.
File metadata
- Download URL: vocalyx-0.2.0-py3-none-any.whl
- Upload date:
- Size: 22.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.9.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
66987acc97cec44a7e5ba044b880bd8abad593396faa645f1cf13d7e7f7ff58c
|
|
| MD5 |
d2b04381b90e3335b721b22e9a54be42
|
|
| BLAKE2b-256 |
62c65bfde9e1394ba9e329420693aa22edad6f802b0b61d194e7e34a8ee79949
|