A modular, voice conversation pipeline using MLX.

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: MIT License
Operating System
- MacOS
Programming Language
- Python :: 3
Topic
- Scientific/Engineering :: Artificial Intelligence

Project description

Project Roadmap & Checklist

✅ Phase 0: Component Sanity Check

LLM: Validate llama.cpp with a GGUF model on M4 Metal.
STT: Validate mlx-whisper with a local large-v3 model.
TTS: Validate mlx-audio with the Kokoro-82M model.
Methodology: Establish the R&D Framework (Whitepaper, Roadmap, Journal, README).

✅ Phase 1: The Linear "Dumb Pipe" (Completed)

All tasks complete.
Outcome: Established a baseline "dead air" latency of ~6 seconds. Proved the synchronous model is not viable for real-time conversation.

🔲 Phase 2: Introducing Asynchrony and Streaming (Current Phase)

Task 2.1: Refactor to a Persistent Server Architecture. Convert pipeline_v1.py into a long-running application where models are loaded only once at startup.
Task 2.2: Implement LLM Token Streaming. Modify the LLMEngine to yield tokens as they are generated, rather than returning the full response at the end.
Task 2.3: Implement "First-Chunk" TTS. Modify the TextToSpeechEngine to accept a stream of text. It will buffer text until it forms a complete sentence, synthesize that chunk of audio, and immediately send it for playback.
Task 2.4: The Asynchronous Orchestrator. Replace the linear if __name__ == "__main__" block with an asyncio event loop. Use asyncio.Queue to create non-blocking pipes between the STT, LLM, and TTS components.
Task 2.5: Implement Streaming Audio I/O. (The PyAudio task). Refactor the AudioRecorder to process audio in small chunks and implement a basic VAD (Voice Activity Detection) to detect the end of speech.
Phase 3A: The Software-First State Machine. We perfect all turn-taking logic using explicit, software-based triggers.
Phase 3B: Real-Time Audio & VAD Integration. We swap our software triggers for a real-time audio stream and a VAD, now with a robust state machine ready to handle its output.

Phase 3A: Building the Core Conversational State Machine

Objective: To create a state-driven agent that understands concepts like LISTENING, THINKING, SPEAKING, and can handle interruptions, using simple keyboard inputs as triggers instead of a live microphone.

Task 3A.1: Architect the State Machine.
- Action: Formally define the agent's states (IDLE, LISTENING, PROCESSING, SPEAKING, WAITING_FOR_USER). We will draw this out and define the exact conditions that trigger transitions between states (e.g., transition from SPEAKING to LISTENING if a "barge-in" event occurs).
- Tool: We can use a simple Enum for the states. This is a pure software design task.
Task 3A.2: Implement the "Push-to-Talk" Trigger.
- Action: Replace input("Press Enter to start recording...") with a more interactive loop. We'll use a library like pynput or keyboard to detect a key press and hold.
- User Experience: "Hold the spacebar to talk."
- State Transition: IDLE -> key_press -> LISTENING. LISTENING -> key_release -> PROCESSING.
- Benefit: This gives us fine-grained control over the start and end of an utterance, perfectly simulating what a VAD would do, but without any of the acoustic complexity.
Task 3A.3: Implement Software-Based "Barge-In".
- Action: While the agent is in the SPEAKING state (i.e., our AudioPlayer is active), we will listen for another key press (e.g., the spacebar again).
- State Transition: SPEAKING -> key_press -> LISTENING.
- Core Logic: When this transition occurs, the orchestrator must immediately:
  1. Send a signal to the AudioPlayer to stop the current playback.
  2. Cancel any pending TTS tasks.
  3. Clear any buffered text from the LLM stream.
  4. Begin recording the new user utterance.
- Benefit: We will build and perfect the entire complex interruption logic in a 100% reproducible software environment.
Task 3A.4: Refactor the AudioPlayer for Interruptibility.
- Action: Our current AudioPlayer is not designed to be stopped mid-playback. We will modify it to support an stop_current_playback() method. This will likely involve using an asyncio.Event that the playback loop can check.
- Benefit: This creates a crucial, reusable software primitive for controlling the agent's voice.

Phase 3B: Integrating Real-World Audio

Objective: To replace our software-based triggers (Push-to-Talk, Barge-In Key Press) with a live, continuous audio stream and a real VAD.

Task 3B.1: Implement Continuous Audio Streaming.
- Action: Refactor AudioRecorder to use a non-blocking stream (now is the time for PyAudio or sounddevice's stream API). It will continuously read small chunks of audio from the microphone and place them into an asyncio.Queue.
Task 3B.2: Integrate a VAD Model.
- Action: Create a new VADProcessor task. It will consume audio chunks from the queue created in 3B.1. We'll use a lightweight, high-performance VAD like silero-vad.
- Output: The VAD task will not output audio; it will output events: SPEECH_STARTED, SPEECH_ENDED.
Task 3B.3: Connect VAD Events to the State Machine.
- Action: This is the final, elegant step. We replace our keyboard listeners from Phase 3A.
- State Transition:
  - IDLE -> SPEECH_STARTED event -> LISTENING.
  - LISTENING -> SPEECH_ENDED event -> PROCESSING.
  - SPEAKING -> SPEECH_STARTED event -> BARGE_IN -> LISTENING.

Phase 3C: Barge-In Interruption and Echo Cancellation

This is the final step to make the conversation feel truly natural.

The Plan:

Track Agent's Speech: The VoiceAgent needs to know what it is currently saying. When the tts_consumer synthesizes a sentence, we will store that sentence in a new state variable, e.g., self.currently_speaking_text.
Implement Software Echo Cancellation (is_echo): We will create a new method is_echo(self, user_text: str) -> bool. This method will compare the incoming user_text (from the STT) with self.currently_speaking_text.
- A simple, robust first version can use a normalized string similarity metric. For example, convert both strings to lowercase, remove punctuation, and check if one is a substring of the other or if their Levenshtein distance is very small.
Implement the Interruption Handler (handle_barge_in): This is the core logic. When an utterance is detected during the SPEAKING state and our is_echo function returns False, we trigger the interruption.
- Action 1: Silence the Agent. Immediately call a new self.player.interrupt() method. This method needs to clear the player's queue and stop the current playback instantly.
- Action 2: Cancel the Cognitive Pipeline. The current run_pipeline task must be cancelled. We can get a handle to it (e.g., self.processing_task) and call self.processing_task.cancel(). This will stop the LLM and TTS from generating any more of the old response.
- Action 3: Start the New Turn. Immediately start a new run_pipeline task with the new, interrupting user text.

Task 3B.4: The Acoustic Echo Problem (Now a Manageable Task).
- Action: At this point, the system will work perfectly with a headset. Without one, it will hear itself and barge-in constantly. NOW we can tackle this.
- Solutions (to be explored):
  1. Software AEC: Integrate a software-based echo cancellation library (e.g., webrtc-audio-processing, speexdsp).
  2. "Duck" and Mute: A simpler approach. When the agent is SPEAKING, we can programmatically mute the microphone input or instruct the VAD to ignore any detected speech. This is less elegant but highly effective.

The Path Forward: Phase 4 - The Production-Grade Refactor This architectural review gives us a crystal-clear roadmap for our next phase. We will stop adding new features and focus on hardening and optimizing the incredible system we've already built. Phase 4 Checklist:

Task 4.1: In-Memory Audio Streaming. Action: Refactor the tts_consumer and AudioPlayer to pass audio data as in-memory NumPy arrays, completely eliminating the TTS .wav file I/O. (This should be our very next task).

Task 4.2: Atomic State Transitions. Action: Introduce an asyncio.Lock to the ConversationManager and protect all self.state modifications.

Task 4.3: Robust Error Handling. Action: Implement more granular error handling within the run_pipeline method with specific verbal error messages.

Task 4.4: Resource Cleanup. Action: If we stick with any file I/O, ensure temporary files are deleted. Implement robust signal handling for graceful shutdown.

Task 4.5 (Research Spike): Single VAD System. Action: Create a branch to experiment with removing WebRTCVAD and using only Silero. Measure CPU impact and determine if the simplification is worth it.

Task 4.6 (Research Spike): Advanced Barge-In. Action: Brainstorm and prototype a more advanced is_echo function, potentially using acoustic features.

TODO:

Speech-to-Text (STT) Engine

Streaming Transcription : For real-time feedback, a streaming model that provides partial transcripts as the user speaks is the gold standard. This dramatically improves the perception of speed. Metadata Generation: This is where we can innovate. The STT shouldn't just output text. It could also output: Word-level timestamps: Crucial for understanding timing and for enabling features like real-time visual feedback. Acoustic Embeddings/Prosody Features: Capturing the way something was said (tone, pitch, energy). This data is invaluable for the LLM and TTS to generate a more contextually appropriate response.

real-Time Transcription:
- Generating the response token by token, rather than all at once. This allows the TTS to start speaking before the LLM has finished its entire thought, drastically reducing perceived latency.
Noise Reduction & Normalization: Cleaning the audio signal before it hits the STT engine to improve accuracy. Speaker Output Management: Handling the playback of the synthesized TTS audio, including managing potential overlaps or interruptions.

the Core Orchestrator & State Manager

State Machine Management: Tracking the system's state (e.g., LISTENING, THINKING, SPEAKING, IDLE). Turn-taking and Interruption Handling: Deciding when the LLM should process input and, critically, allowing the user to interrupt the TTS playback (barge-in). This is a hallmark of a natural-feeling system. Context Aggregation: It gathers information from all components—the text from STT, the emotional cue from its acoustic metadata, the conversation history—and formats it into a coherent prompt for the LLM. Dispatching Commands: Directing the LLM output to the TTS engine or potentially other system functions (e.g., running a script, calling an API).

Large Language Model (LLM) Engine

Generating the response token by token, rather than all at once. This allows the TTS to start speaking before the LLM has finished its entire thought, drastically reducing perceived latency.

Text-to-Speech (TTS) Engine

The TTS engine should be able to take the original prosody metadata from the user's speech (via the STT and Orchestrator) and mirror it in its own output. If the user sounds inquisitive, the response should sound inquisitive. This creates an empathetic feedback loop. It could also take explicit prosody instructions from the LLM (e.g., [start_excited] That's a great idea! [end_excited]).

Project details

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: MIT License
Operating System
- MacOS
Programming Language
- Python :: 3
Topic
- Scientific/Engineering :: Artificial Intelligence

Release history Release notifications | RSS feed

0.1.1

Aug 30, 2025

This version

0.1.0

Aug 30, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

voicechain-0.1.0.tar.gz (20.8 kB view details)

Uploaded Aug 30, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

voicechain-0.1.0-py3-none-any.whl (20.0 kB view details)

Uploaded Aug 30, 2025 Python 3

File details

Details for the file voicechain-0.1.0.tar.gz.

File metadata

Download URL: voicechain-0.1.0.tar.gz
Upload date: Aug 30, 2025
Size: 20.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for voicechain-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`e93445b37ea7d2401b4e7173cfd73b72e23d174e661ce382d30676ccc0ca6436`
MD5	`5e65666fa49098a819c4344c94481ef9`
BLAKE2b-256	`68b24587ed8c6324cb005ab61df576a296c50a9236e306032b72d228a35df2b0`

See more details on using hashes here.

File details

Details for the file voicechain-0.1.0-py3-none-any.whl.

File metadata

Download URL: voicechain-0.1.0-py3-none-any.whl
Upload date: Aug 30, 2025
Size: 20.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for voicechain-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`0311b6be8911318fb383139cedeb12915476adb4dc49a589d2e6e4bacca63fd0`
MD5	`f552c2eaa5ae7ba88d7c50732737ff31`
BLAKE2b-256	`06f1db2d3e90f4eca680b4d082c4bb008bac318d1fc4abed28242722d7b726fa`

See more details on using hashes here.

voiceChain 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Project Roadmap & Checklist

✅ Phase 0: Component Sanity Check

✅ Phase 1: The Linear "Dumb Pipe" (Completed)

🔲 Phase 2: Introducing Asynchrony and Streaming (Current Phase)

Phase 3A: Building the Core Conversational State Machine

Phase 3B: Integrating Real-World Audio

Phase 3C: Barge-In Interruption and Echo Cancellation

TODO:

Speech-to-Text (STT) Engine

the Core Orchestrator & State Manager

Large Language Model (LLM) Engine

Text-to-Speech (TTS) Engine

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes