Vaani is an open-source, AI-powered speech-to-text desktop app. Vaani (वाणी) refers to "speech" or "voice" in Sanskrit.
Project description
Vaani - Private, Offline, Universal AI Speech-to-Text Desktop App 🎤
Vaani (वाणी), meaning "speech" or "voice" in Sanskrit, is an open-source, AI-powered desktop application that provides private, offline, real-time speech-to-text transcription. Use your voice to type into any application on your desktop – web browsers, text editors, email clients, chat apps, and more. Your voice data is processed entirely on your local machine, ensuring your conversations and dictations remain confidential.
Vaani leverages the efficiency of faster-whisper and the flexibility of the PySide6 (Qt) framework to create a seamless, secure, and universal dictation experience.
📽️ Demo
✨ Features
- 🌐 Universal Input: Dictate directly into virtually any application or text field that accepts keyboard input on your desktop. Works seamlessly across browsers, documents, code editors, chat clients, etc.
- 🔒 Privacy First - Offline Processing: All transcription happens locally on your computer. Your voice data is never sent to the cloud.
- Real-time Transcription: Speak and watch your words appear in almost any active application.
- High-Quality AI Model: Powered by
faster-whisper, offering various model sizes (tiny to large) for a balance between speed and accuracy. Models are downloaded once and run locally. - GPU Acceleration: Supports CUDA (NVIDIA GPUs) for significantly faster local transcription (CPU fallback available).
- System Tray Control: Runs conveniently in the system tray for easy access.
- Configurable Global Hotkeys: Start/stop listening, open settings, test microphone, and more with customizable keyboard shortcuts (using the
keyboardlibrary). - Visual Recording Indicator: An optional, movable, always-on-top window shows when Vaani is actively listening, including a real-time audio energy meter.
- Robust Text Insertion (Windows): Uses multiple techniques (
pywin32APIs, fallback methods) for reliable text input across different Windows applications. - Comprehensive Settings:
- Select audio input device.
- Choose Whisper model size and processing device (CPU/CUDA).
- Adjust audio parameters (silence thresholds, padding).
- Configure hotkeys.
- Toggle noise reduction.
- Microphone Testing Utility: Includes a tool to visually test your microphone input and perform a sample transcription.
- Optional Noise Reduction: Helps filter out background noise for clearer transcriptions (requires
noisereduce).
🔒 Privacy & Offline Operation
A core design principle of Vaani is user privacy.
- Local Processing: Speech recognition is performed entirely on your device using the
faster-whisperlibrary and downloaded models. - No Cloud Dependency: Unlike many commercial speech-to-text services, Vaani does not require an internet connection for its core transcription functionality (only for the initial model download if not present) and never sends your audio data to external servers.
- Confidentiality: Your dictations, conversations, or sensitive information spoken while Vaani is active remain on your computer.
🎯 Target Audience & Platform
This project is currently aimed primarily at Python developers and technical users comfortable with installing Python packages and potentially troubleshooting environment issues.
Vaani has been primarily developed and tested on Windows. While the core components use cross-platform libraries (PySide6, faster-whisper), features like global hotkeys (keyboard) and the primary text insertion method (pywin32) have platform-specific behaviors or requirements:
- Windows: Best supported platform currently. Text insertion is most robust.
- Linux/macOS: May require additional setup (especially for
pyaudio). Global hotkeys via thekeyboardlibrary require root/administrator privileges and might interfere with system settings. Text insertion relies on thepyperclip/pyautoguifallback, which may have limitations. Contributions to improve cross-platform support are welcome!
🛠️ Prerequisites
- Python: Version 3.10 or higher.
- pip: Python's package installer (usually comes with Python).
- Audio Backend (PyAudio):
pyaudioinstallation can sometimes be tricky. You might need:- Windows: Usually works out-of-the-box if installing from wheels. Might require Microsoft Visual C++ Build Tools if building from source.
- Linux:
portaudio19-dev(Debian/Ubuntu) orportaudio-devel(Fedora). - macOS:
portaudio(via Homebrew:brew install portaudio).
- (Optional) NVIDIA GPU & CUDA: For GPU acceleration:
- A CUDA-compatible NVIDIA GPU.
- NVIDIA CUDA Toolkit installed and configured correctly (Vaani attempts to find it via the path specified in Settings, falling back to PATH environment variable on Windows).
faster-whisperrequires specific CUDA versions - check their documentation for details.
- (Linux/macOS) Root/Admin Privileges: Required for the
keyboardlibrary to capture global hotkeys.
🚀 Installation
Install from PyPI:
pip install vaani-speech-to-text
or
Install from GitHub:
git clone https://github.com/webstruck/vaani-speech-to-text.git
cd vaani-speech-to-text
uv pip install -r pyproject.toml
Note: The first time you run Vaani or select a new model size, the faster-whisper library will download the required model files (this requires an internet connection). Subsequent uses of that model will be fully offline.
▶️ Usage
Once installed, run the application from your terminal:
vaani
The application icon will appear in your system tray.
Default Hotkeys:
- Toggle Listening:
Ctrl+Alt+Z - Open Settings:
Ctrl+Alt+Q - Test Microphone:
Ctrl+Alt+T - Toggle Debug Mode:
Ctrl+Alt+D(Saves audio chunks locally if enabled) - Exit Application:
Ctrl+Alt+X
Right-click the system tray icon for menu options (Start/Stop, Settings, Test Mic, Exit).
Updates
LLM-Based Text Processing
The latest update introduces LLM-based text processing capabilities using Ollama:
- Enhanced Transcription Quality: Uses local language models to improve grammar, spelling, and overall text quality of transcriptions
- Easy Integration: Works with locally-running Ollama models like Mistral, Llama, etc. Works best with Gemma 3 Quantization aware trained models (QAT) gemma3:1b-it-qat which is just 1 GB.
- Fault-Tolerant Design: Falls back to basic text processing if LLM is not available or times out
To use this feature:
- Install Ollama on your system
- Run a model of your choice (e.g.,
ollama run gemma3:1b-it-qat)
⚙️ Configuration
Settings are stored in a settings.json file located in:
- Windows:
%USERPROFILE%\.speech_to_text_app(e.g.,C:\Users\YourName\.speech_to_text_app) - Linux/macOS:
~/.speech_to_text_app(e.g.,/home/yourname/.speech_to_text_app)
You can configure various options through the Settings dialog (accessible via hotkey or tray menu), including:
- Audio input device
- Whisper model size (
tiny,base,small,medium,large) - Processing device (
cpuorcuda) - CUDA Toolkit path (if needed)
- Silence detection parameters
- Hotkeys
- UI options (visual indicator)
📦 Dependencies
Key dependencies include:
PySide6: For the graphical user interface.faster-whisper: The core local AI transcription engine.keyboard: For global hotkey management.pyaudio: For microphone access.numpy,scipy: For audio data manipulation and processing.matplotlib: For the microphone test waveform display.pywin32: (Windows only) For robust text insertion.pyperclip,pyautogui: Fallback text insertion.noisereduce: Optional background noise reduction.
(See pyproject.toml for specific version requirements)
🩺 Troubleshooting
- Hotkeys Not Working:
- Linux/macOS: Ensure you are running the application with
sudoor as root (required by thekeyboardlibrary). This has security implications, be aware! - All Platforms: Check for conflicts with other applications using global hotkeys. Ensure the key names in the settings match those expected by the
keyboardlibrary.
- Linux/macOS: Ensure you are running the application with
- Audio Device Issues:
- Ensure the correct microphone is selected in Settings -> Audio.
- Verify
pyaudioinstalled correctly for your OS (see Prerequisites). - Check OS-level microphone permissions for Python or the terminal.
- Try the "Test Microphone" utility.
- Text Not Inserting:
- On Windows, Vaani tries multiple methods. Some specific applications (games, remote desktops, apps running with elevated privileges) might still resist input simulation.
- On Linux/macOS, relies on
pyperclip/pyautogui, which might not work in all environments (e.g., Wayland without specific setup).
- Slow Transcription:
- If using CPU, select a smaller model size (e.g.,
tiny,base,small). - If you have a compatible NVIDIA GPU, ensure CUDA is set up correctly and selected as the device in Settings. Remember, all processing is local, so performance depends entirely on your hardware.
- If using CPU, select a smaller model size (e.g.,
- Model Download Failed: Ensure you have an internet connection the first time you select a specific model size. Check for firewall issues blocking the download.
🗺️ Future Plans / Roadmap
- Create user-friendly installers for Windows (and potentially other platforms).
- Language selection for transcription (requires corresponding Whisper models).
- Add more post-processing text options.
- Improve cross-platform compatibility (text insertion, hotkey alternatives).
🙌 Contributing
Contributions are welcome! If you'd like to help improve Vaani, please feel free to:
- Fork the repository.
- Create a new branch (
git checkout -b feature/YourFeatureorbugfix/YourBugfix). - Make your changes.
- Commit your changes (
git commit -m 'Add some feature'). - Push to the branch (
git push origin feature/YourFeature). - Open a Pull Request.
Please report bugs or suggest features using the GitHub Issues tab.
📄 License
Distributed under the Apache-2.0 License. See LICENSE file for more information.
🙏 Acknowledgements
- The team behind Whisper and faster-whisper for enabling high-quality, local transcription.
- The developers of Qt and the PySide6 project.
- All the creators of the dependent Python libraries.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file vaani_speech_to_text-0.1.3.tar.gz.
File metadata
- Download URL: vaani_speech_to_text-0.1.3.tar.gz
- Upload date:
- Size: 57.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.6.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b9295269c58dc572c314adc0c8265c310b13831d5891418faafaa6af0d770348
|
|
| MD5 |
82ab210afb0f0d1667978755674a4354
|
|
| BLAKE2b-256 |
d05c476b525302ceb0a4132344f0051cf7159eb4eb26020d031a2ef2a4151f4e
|
File details
Details for the file vaani_speech_to_text-0.1.3-py3-none-any.whl.
File metadata
- Download URL: vaani_speech_to_text-0.1.3-py3-none-any.whl
- Upload date:
- Size: 60.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.6.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3a18f0d3684fb1a0b8e15db070ab1242a9898fc55f8fc63d1106240eece6da1f
|
|
| MD5 |
fc4ed0441e4637350f554d385d160083
|
|
| BLAKE2b-256 |
e25508b4962cb1c67b9aa9e4605db7726781ed6cfb527e7d6a60304b4135f8ff
|