A modular, voice conversation pipeline using MLX.
Project description
Project Roadmap & Checklist
✅ Phase 0: Component Sanity Check
- LLM: Validate
llama.cppwith a GGUF model on M4 Metal. - STT: Validate
mlx-whisperwith a locallarge-v3model. - TTS: Validate
mlx-audiowith theKokoro-82Mmodel. - Methodology: Establish the R&D Framework (Whitepaper, Roadmap, Journal, README).
✅ Phase 1: The Linear "Dumb Pipe" (Completed)
- All tasks complete.
- Outcome: Established a baseline "dead air" latency of ~6 seconds. Proved the synchronous model is not viable for real-time conversation.
🔲 Phase 2: Introducing Asynchrony and Streaming (Current Phase)
-
Task 2.1: Refactor to a Persistent Server Architecture. Convert
pipeline_v1.pyinto a long-running application where models are loaded only once at startup. -
Task 2.2: Implement LLM Token Streaming. Modify the
LLMEnginetoyieldtokens as they are generated, rather than returning the full response at the end. -
Task 2.3: Implement "First-Chunk" TTS. Modify the
TextToSpeechEngineto accept a stream of text. It will buffer text until it forms a complete sentence, synthesize that chunk of audio, and immediately send it for playback. -
Task 2.4: The Asynchronous Orchestrator. Replace the linear
if __name__ == "__main__"block with anasyncioevent loop. Useasyncio.Queueto create non-blocking pipes between the STT, LLM, and TTS components. -
Task 2.5: Implement Streaming Audio I/O. (The
PyAudiotask). Refactor theAudioRecorderto process audio in small chunks and implement a basic VAD (Voice Activity Detection) to detect the end of speech. -
Phase 3A: The Software-First State Machine. We perfect all turn-taking logic using explicit, software-based triggers.
-
Phase 3B: Real-Time Audio & VAD Integration. We swap our software triggers for a real-time audio stream and a VAD, now with a robust state machine ready to handle its output.
Phase 3A: Building the Core Conversational State Machine
Objective: To create a state-driven agent that understands concepts like LISTENING, THINKING, SPEAKING, and can handle interruptions, using simple keyboard inputs as triggers instead of a live microphone.
-
Task 3A.1: Architect the State Machine.
- Action: Formally define the agent's states (
IDLE,LISTENING,PROCESSING,SPEAKING,WAITING_FOR_USER). We will draw this out and define the exact conditions that trigger transitions between states (e.g., transition fromSPEAKINGtoLISTENINGif a "barge-in" event occurs). - Tool: We can use a simple Enum for the states. This is a pure software design task.
- Action: Formally define the agent's states (
-
Task 3A.2: Implement the "Push-to-Talk" Trigger.
- Action: Replace
input("Press Enter to start recording...")with a more interactive loop. We'll use a library likepynputorkeyboardto detect a key press and hold. - User Experience: "Hold the spacebar to talk."
- State Transition:
IDLE->key_press->LISTENING.LISTENING->key_release->PROCESSING. - Benefit: This gives us fine-grained control over the start and end of an utterance, perfectly simulating what a VAD would do, but without any of the acoustic complexity.
- Action: Replace
-
Task 3A.3: Implement Software-Based "Barge-In".
- Action: While the agent is in the
SPEAKINGstate (i.e., ourAudioPlayeris active), we will listen for another key press (e.g., the spacebar again). - State Transition:
SPEAKING->key_press->LISTENING. - Core Logic: When this transition occurs, the orchestrator must immediately:
- Send a signal to the
AudioPlayerto stop the current playback. - Cancel any pending TTS tasks.
- Clear any buffered text from the LLM stream.
- Begin recording the new user utterance.
- Send a signal to the
- Benefit: We will build and perfect the entire complex interruption logic in a 100% reproducible software environment.
- Action: While the agent is in the
-
Task 3A.4: Refactor the
AudioPlayerfor Interruptibility.- Action: Our current
AudioPlayeris not designed to be stopped mid-playback. We will modify it to support anstop_current_playback()method. This will likely involve using anasyncio.Eventthat the playback loop can check. - Benefit: This creates a crucial, reusable software primitive for controlling the agent's voice.
- Action: Our current
Phase 3B: Integrating Real-World Audio
Objective: To replace our software-based triggers (Push-to-Talk, Barge-In Key Press) with a live, continuous audio stream and a real VAD.
-
Task 3B.1: Implement Continuous Audio Streaming.
- Action: Refactor
AudioRecorderto use a non-blocking stream (now is the time forPyAudioorsounddevice's stream API). It will continuously read small chunks of audio from the microphone and place them into anasyncio.Queue.
- Action: Refactor
-
Task 3B.2: Integrate a VAD Model.
- Action: Create a new
VADProcessortask. It will consume audio chunks from the queue created in 3B.1. We'll use a lightweight, high-performance VAD likesilero-vad. - Output: The VAD task will not output audio; it will output events:
SPEECH_STARTED,SPEECH_ENDED.
- Action: Create a new
-
Task 3B.3: Connect VAD Events to the State Machine.
- Action: This is the final, elegant step. We replace our keyboard listeners from Phase 3A.
- State Transition:
IDLE->SPEECH_STARTEDevent ->LISTENING.LISTENING->SPEECH_ENDEDevent ->PROCESSING.SPEAKING->SPEECH_STARTEDevent ->BARGE_IN->LISTENING.
Phase 3C: Barge-In Interruption and Echo Cancellation
This is the final step to make the conversation feel truly natural.
The Plan:
-
Track Agent's Speech: The
VoiceAgentneeds to know what it is currently saying. When thetts_consumersynthesizes a sentence, we will store that sentence in a new state variable, e.g.,self.currently_speaking_text. -
Implement Software Echo Cancellation (
is_echo): We will create a new methodis_echo(self, user_text: str) -> bool. This method will compare the incominguser_text(from the STT) withself.currently_speaking_text.- A simple, robust first version can use a normalized string similarity metric. For example, convert both strings to lowercase, remove punctuation, and check if one is a substring of the other or if their Levenshtein distance is very small.
-
Implement the Interruption Handler (
handle_barge_in): This is the core logic. When an utterance is detected during theSPEAKINGstate and ouris_echofunction returnsFalse, we trigger the interruption.- Action 1: Silence the Agent. Immediately call a new
self.player.interrupt()method. This method needs to clear the player's queue and stop the current playback instantly. - Action 2: Cancel the Cognitive Pipeline. The current
run_pipelinetask must be cancelled. We can get a handle to it (e.g.,self.processing_task) and callself.processing_task.cancel(). This will stop the LLM and TTS from generating any more of the old response. - Action 3: Start the New Turn. Immediately start a new
run_pipelinetask with the new, interrupting user text.
- Action 1: Silence the Agent. Immediately call a new
- Task 3B.4: The Acoustic Echo Problem (Now a Manageable Task).
- Action: At this point, the system will work perfectly with a headset. Without one, it will hear itself and barge-in constantly. NOW we can tackle this.
- Solutions (to be explored):
- Software AEC: Integrate a software-based echo cancellation library (e.g.,
webrtc-audio-processing,speexdsp). - "Duck" and Mute: A simpler approach. When the agent is
SPEAKING, we can programmatically mute the microphone input or instruct the VAD to ignore any detected speech. This is less elegant but highly effective.
- Software AEC: Integrate a software-based echo cancellation library (e.g.,
The Path Forward: Phase 4 - The Production-Grade Refactor This architectural review gives us a crystal-clear roadmap for our next phase. We will stop adding new features and focus on hardening and optimizing the incredible system we've already built. Phase 4 Checklist:
Task 4.1: In-Memory Audio Streaming. Action: Refactor the tts_consumer and AudioPlayer to pass audio data as in-memory NumPy arrays, completely eliminating the TTS .wav file I/O. (This should be our very next task).
Task 4.2: Atomic State Transitions. Action: Introduce an asyncio.Lock to the ConversationManager and protect all self.state modifications.
Task 4.3: Robust Error Handling. Action: Implement more granular error handling within the run_pipeline method with specific verbal error messages.
Task 4.4: Resource Cleanup. Action: If we stick with any file I/O, ensure temporary files are deleted. Implement robust signal handling for graceful shutdown.
Task 4.5 (Research Spike): Single VAD System. Action: Create a branch to experiment with removing WebRTCVAD and using only Silero. Measure CPU impact and determine if the simplification is worth it.
Task 4.6 (Research Spike): Advanced Barge-In. Action: Brainstorm and prototype a more advanced is_echo function, potentially using acoustic features.
TODO:
Speech-to-Text (STT) Engine
Streaming Transcription : For real-time feedback, a streaming model that provides partial transcripts as the user speaks is the gold standard. This dramatically improves the perception of speed. Metadata Generation: This is where we can innovate. The STT shouldn't just output text. It could also output: Word-level timestamps: Crucial for understanding timing and for enabling features like real-time visual feedback. Acoustic Embeddings/Prosody Features: Capturing the way something was said (tone, pitch, energy). This data is invaluable for the LLM and TTS to generate a more contextually appropriate response.
-
real-Time Transcription:
- Generating the response token by token, rather than all at once. This allows the TTS to start speaking before the LLM has finished its entire thought, drastically reducing perceived latency.
-
Noise Reduction & Normalization: Cleaning the audio signal before it hits the STT engine to improve accuracy. Speaker Output Management: Handling the playback of the synthesized TTS audio, including managing potential overlaps or interruptions.
the Core Orchestrator & State Manager
State Machine Management: Tracking the system's state (e.g., LISTENING, THINKING, SPEAKING, IDLE). Turn-taking and Interruption Handling: Deciding when the LLM should process input and, critically, allowing the user to interrupt the TTS playback (barge-in). This is a hallmark of a natural-feeling system. Context Aggregation: It gathers information from all components—the text from STT, the emotional cue from its acoustic metadata, the conversation history—and formats it into a coherent prompt for the LLM. Dispatching Commands: Directing the LLM output to the TTS engine or potentially other system functions (e.g., running a script, calling an API).
Large Language Model (LLM) Engine
Generating the response token by token, rather than all at once. This allows the TTS to start speaking before the LLM has finished its entire thought, drastically reducing perceived latency.
Text-to-Speech (TTS) Engine
- The TTS engine should be able to take the original prosody metadata from the user's speech (via the STT and Orchestrator) and mirror it in its own output. If the user sounds inquisitive, the response should sound inquisitive. This creates an empathetic feedback loop. It could also take explicit prosody instructions from the LLM (e.g., [start_excited] That's a great idea! [end_excited]).
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file voicechain-0.1.0.tar.gz.
File metadata
- Download URL: voicechain-0.1.0.tar.gz
- Upload date:
- Size: 20.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e93445b37ea7d2401b4e7173cfd73b72e23d174e661ce382d30676ccc0ca6436
|
|
| MD5 |
5e65666fa49098a819c4344c94481ef9
|
|
| BLAKE2b-256 |
68b24587ed8c6324cb005ab61df576a296c50a9236e306032b72d228a35df2b0
|
File details
Details for the file voicechain-0.1.0-py3-none-any.whl.
File metadata
- Download URL: voicechain-0.1.0-py3-none-any.whl
- Upload date:
- Size: 20.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0311b6be8911318fb383139cedeb12915476adb4dc49a589d2e6e4bacca63fd0
|
|
| MD5 |
f552c2eaa5ae7ba88d7c50732737ff31
|
|
| BLAKE2b-256 |
06f1db2d3e90f4eca680b4d082c4bb008bac318d1fc4abed28242722d7b726fa
|