Skip to main content

Real-time voice assistant with Speech-to-Text, GPT, and Text-to-Speech (Soprano TTS)

Project description

🎙️ Real-Time Voice Assistant (STT + GPT + TTS)

This project is a local, streaming voice assistant that listens to your voice, transcribes it in real time (STT), generates an AI response using GPT, and speaks the answer back using SopranoTTS.

It's built from three key components:

  1. stt_server (vocalyx-stt) – handles real-time speech-to-text via WebSockets.
  2. tts_server (vocalyx-tts) – streams GPT responses and converts them to speech using SopranoTTS.
  3. client (vocalyx) – connects everything together: records your mic, shows live transcription, sends it to GPT, and plays back AI-generated voice.

📦 Install

pip install vocalyx

Or install from source:

git clone <repo-url>
cd Voice-to-Voice
pip install -e .

🧩 Requirements

1. Python

Make sure you have Python 3.9+ and <=3.12 installed.

2. System Dependencies

You'll need:

  • ffmpeg (for audio handling)
  • A working microphone and audio output
  • portaudio (for PyAudio)

macOS

brew install portaudio ffmpeg

Ubuntu / Debian

sudo apt update
sudo apt install portaudio19-dev ffmpeg python3-pyaudio

Windows

  • Install Python (make sure to add it to PATH)
  • PyAudio binaries can be installed with:
pip install pipwin
pipwin install pyaudio

📦 Install Python Dependencies

Create a virtual environment (Python 3.12 recommended) and install dependencies:

python -m venv venv
source venv/bin/activate
pip install -r requirements.txt

Key Python Packages

  • RealtimeSTT – real-time speech-to-text
  • openai – GPT streaming responses
  • soprano-tts – neural TTS engine
  • torch, numpy – audio inference backend
  • pyaudio, sounddevice – audio playback
  • python-dotenv – environment variable loading

🔑 Environment Variables

Create a .env file in the project root with:

OPENAI_API_KEY=your_openai_api_key_here

Get your key from https://platform.openai.com/api-keys.


⚙️ How It Works

End-to-end flow:

[Microphone Input]
        ↓
 [client.py] → Sends audio to STT server
        ↓
 [stt_server.py] → Transcribes speech in real time
        ↓
 [client.py] → Sends text to TTS server
        ↓
 [tts_server.py] → Streams GPT text + converts to speech (SopranoTTS)
        ↓
 [client.py] → Plays AI voice audio live

Each component communicates over WebSockets:

  • STT control channel: ws://localhost:8011
  • STT data channel: ws://localhost:8012
  • TTS channel: ws://localhost:8013

🔊 Audio Format (Important)

SopranoTTS streams raw float32 mono audio:

  • Sample rate: 32000 Hz
  • Channels: 1
  • Format: float32

The client plays audio directly using paFloat32 without μ-law or int16 conversion. This ensures:

  • Natural pitch
  • Correct tempo
  • No distortion

🚀 Running the System

You’ll need three terminals.

1️⃣ Start the STT Server

vocalyx-stt

Handles microphone audio and real-time transcription.


2️⃣ Start the TTS Server

vocalyx-tts

Streams GPT responses and converts them to speech using SopranoTTS.


3️⃣ Run the Client

vocalyx

The client:

  • Captures microphone input
  • Displays live transcription
  • Sends prompts to GPT
  • Plays streamed AI voice output

By default, it runs in continuous mode.


🗣️ Example Interaction

You:

What's a good way to stay focused today?

AI: (spoken + printed)

Try breaking your day into short focus sessions. Take a quick stretch between them.


⚙️ Optional Command-Line Arguments

You can tweak client.py behavior:

Flag Description Default
--tts-url TTS WebSocket server URL ws://localhost:8013
--post-silence Silence after each utterance 1.0
--speech-end-detection Adaptive silence detection off
--debug Print debug logs off
--norealtime Disable live transcription display off
--list List microphone devices off

List audio devices:

vocalyx --list

Select a specific mic:

vocalyx -i 2

🧠 Notes

  • SopranoTTS is initialized once and reused for all requests.
  • GPT responses are streamed sentence-by-sentence to minimize latency.
  • Audio is streamed and played in near real time.

🧹 Troubleshooting

Audio sounds distorted or slow

  • Ensure client playback uses paFloat32 at 32000 Hz.
  • Do not apply μ-law or int16 conversion.

No response from GPT

  • Verify OPENAI_API_KEY in .env.
  • Check internet connectivity.

STT not transcribing

  • Ensure RealtimeSTT is installed correctly.
  • Verify microphone index using --list.

🧾 License

This project is for personal and educational use.


💡 Future Improvements

  • VAD-based auto start/stop for more natural conversations
  • Opus/WebRTC streaming for browser clients
  • GUI frontend for controlling STT/TTS parameters
  • Interruptible (barge-in) speech handling

🏁 Summary

# Terminal 1
vocalyx-stt

# Terminal 2
vocalyx-tts

# Terminal 3
vocalyx

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vocalyx-0.2.0.tar.gz (23.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

vocalyx-0.2.0-py3-none-any.whl (22.7 kB view details)

Uploaded Python 3

File details

Details for the file vocalyx-0.2.0.tar.gz.

File metadata

  • Download URL: vocalyx-0.2.0.tar.gz
  • Upload date:
  • Size: 23.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.6

File hashes

Hashes for vocalyx-0.2.0.tar.gz
Algorithm Hash digest
SHA256 ef8d5498f5ce973380d98476f488ca59a6bbe076be7710f54f69861c4e288c2f
MD5 8e8c2ff85d467fa58d4c4dcea2ba260d
BLAKE2b-256 dbafd4b2a2e81e1456469c25cab5859e28a0002b1403ec15a894a993e5542769

See more details on using hashes here.

File details

Details for the file vocalyx-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: vocalyx-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 22.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.6

File hashes

Hashes for vocalyx-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 66987acc97cec44a7e5ba044b880bd8abad593396faa645f1cf13d7e7f7ff58c
MD5 d2b04381b90e3335b721b22e9a54be42
BLAKE2b-256 62c65bfde9e1394ba9e329420693aa22edad6f802b0b61d194e7e34a8ee79949

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page