Skip to main content

A keyboard-centric inline speech-to-text CLI tool that works wherever you can type

Project description

ispeak

A keyboard-centric inline speech-to-text tool that works wherever you can type; vim, emacs, firefox, and CLI/AI tools like aider, codex, claude, or whatever you fancy

ispeak logo
  • Multilingual, Local, Fast - Powered via faster-whisper
  • Transcribed Speech - As keyboard (type) or clipboard (copy) events
  • Inline UX - Recording indicator displayed in the active buffer & self-deletes
  • Hotkey-Driven & Configurable - Tune the operation/model to your liking
  • Post-Transcribe Plugin Pipeline - Replace, text2num, and num2text
  • Cross-Platform - Works on Linux/macOS/Windows with GPU or CPU

Quick Start

  1. Run: ispeak (add -b <program> to target a specific executable)
  2. Activate: Press the hotkey (default shift_l) - the 'recording indicator' is text-based (default ;)
  3. Record: Speak freely; no automatic timeout or voice activity cutoff
  4. Complete: Press the hotkey again to delete the indicator and transcribe your speech (abort via escape)
  5. Output: Your words appear as typed text at your cursor's location

IMPORTANT: The output goes to the application that currently has keyboard focus, which allows you to use the same ispeak instance between applications. This may be a feature or a bug.

▎Install

#> copy'n'paste system/global install
pip install ispeak
uv tool install ispeak
# cpu-only + plugins; it's better to simply clone & run: uv tool install ".[cpu,plugin]"
uv pip install --system "ispeak[plugin]" --torch-backend=cpu

uv is a python package installer

#> clone'n'install
git clone https://github.com/fetchTe/ispeak && cd ispeak

# global install (extra: cpu, cu118, cu128, plugin)
uv tool install ".[plugin]"      # CUDA + plugins
uv tool install ".[cpu,plugin]"  # CPU-only (no CUDA) + plugins

# local install (extra: cpu, cu118, cu128, plugin)
uv sync --group dev                # CUDA (default) + dev (ruff, pyright, pytest)
uv sync --extra cpu --extra plugin # CPU-only (no CUDA) + plugins

# pip install + plugins
pip install RealtimeSTT pynput pyperclip num2words text2num

▎Usage

# USAGE (v0.2.5)
  ispeak [options...]

# OPTIONS
  -b, --binary      Executable to launch with voice input (default: none)
  -c, --config      Path to configuration file
  -l, --log-file    Path to voice transcription append log file
  -n, --no-output   Disables all output/actions - typing, copying, and record indicator
  -p, --copy        Use the 'clipboard' to copy instead of the 'keyboard' to type the output
  -s, --setup       Configure voice settings
  -t, --test        Test voice input functionality
  --config-show     Show current configuration

# EXAMPLES
ispeak --setup         # Interactive configuration wizard
ispeak --copy          # Start with the output mode set as 'clipboard'
ispeak -l words.log    # Log transcriptions to file

# DEV/LOCAL USAGE
uv run ispeak --setup  # via uv

Configuration

Can be defined via JSON or TOML, and the lookup is performed in the following order:

  1. Environment Variable: ISPEAK_CONFIG environment variable is set to the path of the config file
  2. Platform-Specific Config
    • macOS: ~/Library/Preferences/ispeak/ispeak.{json,toml}
    • Windows: %APPDATA%\ispeak\ispeak.{json,toml} (or ~/AppData/Roaming/ispeak/ispeak.{json,toml})
    • Linux: $XDG_CONFIG_HOME/ispeak/ispeak.{json,toml} (or ~/.config/ispeak/ispeak.{json,toml})
  3. Local: ./ispeak.{json,toml} in the current working directory
  4. Default: fallback
{
  "ispeak": {
    "binary": null,
    "push_to_talk_key": "shift_l",
    "push_to_talk_key_delay": 0.3,
    "escape_key": "esc",
    "log_file": null,
    "output": "keyboard",
    "recording_indicator": ";",
    "delete_keyword": ["delete", "undo"],
    "delete_key": null,
    "strip_whitespace": true
  },
  "stt": {
    "model": "tiny",
    "language": "auto",
    "beam_size": 5,
    "compute_type": "auto",
    "download_root": null,
    "enable_realtime_transcription": false,
    "ensure_sentence_ends_with_period": true,
    "ensure_sentence_starting_uppercase": true,
    "initial_prompt": null,
    "no_log_file": true,
    "normalize_audio": true,
    "spinner": false
  },
  "plugin": {}
}

NOTE: Highly recommend using ispeak --setup for initial setup


ispeak

  • binary (str/null): Default executable to launch with voice input
  • delete_key (str/null): Key to trigger deletion of previous input via backspace
  • delete_keyword (list/bool): Words that trigger deletion of previous input via backspace (must be exact)
  • escape_key (str/null): Key to cancel current recording without transcription
  • log_file (str/null): Path to file for logging voice transcriptions
  • output (str/false): Mode of output; 'keyboard' (type), 'clipboard' (copy), or false for none
    • For all languages aside from English, using 'clipboard' is recommended
  • push_to_talk_key_delay (float): Brief delay after hotkey press to prevent input conflicts
  • push_to_talk_key (str/null): Hotkey to start/stop recording sessions
  • recording_indicator (str/null): Visual indicator typed when recording starts must be a typeable
  • strip_whitespace (bool): Remove extra whitespace from transcribed text

Hotkeys work via pynput and support:
╸ Simple characters: a, b, c, 1, etc.
╸ Special keys: end, alt_l, ctrl_l - (see pynput Key class)
╸ Key combinations: <ctrl>+<alt>+h, <shift>+<f1>


stt

A full config reference can be found in ./docs/stt-options.md
RealtimeSTT handles the input/mic setup and processing
faster-whisper is the actual speech-to-text engine implementation

  • model (str): Model size or path to local CTranslate2 model (for English variants append .en)
    • tiny: Ultra fast, workable accuracy (~39MB, CPU/GPU)
    • base: Respectable accuracy/speed (~74MB, CPU/GPU ~1GB/VRAM)
    • small: Decent accuracy (~244MB, CPU+/GPU ~2GB/VRAM)
    • medium: Good accuracy (~769MB, GPU ~3GB/VRAM)
    • large-v1/large-v2: Superb accuracy (~1550MB, GPU ~4GB/VRAM)
  • language (str): Language code (en, es, fr, de, etc) or "auto" for automatic detection
  • beam_size (int): Size to use for beam search decoding (worth bumping up)
  • download_root (str/null): Root path were the models are downloaded/loaded from
  • enable_realtime_transcription (bool): Enable continuous transcription (2x computation)
  • ensure_sentence_ends_with_period (bool): Add periods to sentences without punctuation
  • ensure_sentence_starting_uppercase (bool): Ensure sentences start with uppercase letters
  • initial_prompt (null/str): Initial prompt to be fed to the main transcription model
  • no_log_file (bool): Skip debug log file creation
  • normalize_audio (bool): Normalize audio range before processing for better transcription quality
  • spinner (bool): Show spinner animation (set to false to avoid terminal conflicts)

Apart from using faster-distil-whisper-large-v3, I've had good results with the following

{
  "model": "Systran/faster-distil-whisper-medium.en",
  "initial_prompt": "Welcome back, this discussion covers coherence, cohesion, and logical flow in programming.",
  "beam_size": 8,
  "post_speech_silence_duration": 0.4,
}

NOTE: initial_prompt defines style and/or spelling, not instructions cookbook/ref


Plugin

The plugin system processes transcribed text through a configurable pipeline of text transformation plugins. Plugins are loaded and executed in order based on their configuration, and each can be configured with the following fields:

  • use (bool): Enable/disable the plugin (default: true)
  • order (int): Execution order - plugins run in ascending order (default: 999)
  • settings (dict): Plugin-specific configuration options

replace

Regex-based text replacement, mainly for simple string replacements, but also capable of handling Regex patterns with capture groups and flags.

{
  "plugin": {
    "replace": {
      "use": true,
      "order": 1,
      "settings": {
        // simple string replacements
        "iSpeak": "ispeak",
        " one ": " 1 ",
        "read me": "README",

        // regex with capture groups
        "(\\s+)(semi)(\\s+)": ";\\g<3>",
        "(\\s+)(comma)(\\s+)": ",\\g<3>",

        // common voice transcription cleanup
        "\\s+question\\s*mark\\.?": "?",
        "\\s+exclamation\\s*mark\\.?": "!",
        
        // code-specific replacements
        "\\s+open\\s*paren\\s*": "(",
        "\\s+close\\s*paren\\s*": ")",
        "\\s+open\\s*brace\\s*": "{",
        "\\s+close\\s*brace\\s*": "}",

        // regex patterns with flags (/pattern/flags format)
        "/hello/i": "HI",           // case insensitive
        "/^start/m": "BEGIN",       // multiline
        "/comma/gmi": ","           // global, multiline, case insensitive
      }
    }
  }
}

Flags: Use /pattern/flags format (supports i, m, s, x flags)
Substitution: Use \1, \2 or \g<1>, \g<2> syntax
Tests: ./tests/test_plugin_replace.py

num2text

Convert digits to text numbers, like "42" into "forty-two" via num2words

{
  "plugin": {
    "num2text": {
      "use": true,
      "order": 3,
      "settings": {
        "lang": "en",         // language code
        "to": "cardinal",     // cardinal, ordinal, ordinal_num, currency, year
        "min": null,          // minimum value to convert
        "max": null,          // maximum value to convert
        "currency": "USD",    // currency code for currency conversion
        "cents": true,        // include cents in currency
        "percent": "percent"  // suffix for percentage conversion
      }
    }
  }
}

Tests: ./tests/test_plugin_num2text.py
Dependency: num2words -> uv pip install num2words


text2num

Convert text numbers to digits, like "forty-two" into "42" via text_to_num

{
  "plugin": {
    "text2num": {
      "use": true,
      "order": 2,
      "settings": {
        "lang": "en",
        "threshold": 0
      }
    }
  }
}

Tests: ./tests/test_plugin_text2num.py
Dependency: text_to_num -> uv pip install text_to_num
IMPORTANT: the threshold may, or, may not work if cardinal; check out the TestWishyWashyThreshold test for more dets


Troubleshooting

  • Hotkey/Keyboard Issues: Check/grant permissions see linux, macOS, windows
  • Recording Indicator Misfire(s): Increase push_to_talk_key_delay (try 0.2-1.0)
  • Transcription Typing/Character Issues: Try using "output": "clipboard"
  • Transcription Issues: Try the CPU-only and/or the following minimal test code to isolate the problem:
# test_audio.py -> uv run ./test_audio.py
from RealtimeSTT import AudioToTextRecorder

def process_text(text):
    print(f"Transcribed: {text}")

if __name__ == '__main__':
    print("Testing RealtimeSTT - speak after you see 'Listening...'")
    try:
        recorder = AudioToTextRecorder()
        while True:
            recorder.text(process_text)
    except KeyboardInterrupt:
        print("\nTest completed.")
    except Exception as e:
        print(f"Error: {e}")

Platform Limitations

These limitations/quirks come from the pynput docs

▎Linux

When running under X, the following must be true:

  • An X server must be running
  • The environment variable $DISPLAY must be set

When running under uinput, the following must be true:

  • You must run your script as root, so that it has the required permissions for uinput

The latter requirement for X means that running pynput over SSH generally will not work. To work around that, make sure to set $DISPLAY:

$ DISPLAY=:0 python -c 'import pynput'

Please note that the value DISPLAY=:0 is just an example. To find the actual value, please launch a terminal application from your desktop environment and issue the command echo $DISPLAY.

When running under Wayland, the X server emulator Xwayland will usually run, providing limited functionality. Notably, you will only receive input events from applications running under this emulator.

▎macOS

Recent versions of macOS restrict monitoring of the keyboard for security reasons. For that reason, one of the following must be true:

  • The process must run as root.
  • Your application must be white listed under Enable access for assistive devices. Note that this might require that you package your application, since otherwise the entire Python installation must be white listed.
  • On versions after Mojave, you may also need to whitelist your terminal application if running your script from a terminal.

All listener classes have the additional attribute IS_TRUSTED, which is True if no permissions are lacking.

▎Windows

Virtual events sent by other processes may not be received. This library takes precautions, however, to dispatch any virtual events generated to all currently running listeners of the current process.


Development

# USAGE (ispeak)
   make [flags...] <target>

# TARGET
  -------------------
   run                   execute entry-point -> uv run main.py
   build                 build wheel/source distributions -> hatch build
   clean                 delete build artifacts, cache files, and temporary files
  -------------------
   publish               publish to pypi.org -> twine upload
   publish_test          publish to test.pypi.org -> twine upload --repository testpypi
   publish_check         check distributions -> twine check
   release               clean, format, lint, test, build, check, and optionally publish
  -------------------
   install               install dependencies -> uv sync
   install_cpu           install dependencies -> uv sync --extra cpu
   install_dev           install dev dependencies -> uv sync --group dev --extra plugin
   install_plugin        install plugin dependencies -> uv sync --extra plugin
   update                update dependencies -> uv lock --upgrade && uv sync
   update_dry            show outdated dependencies  -> uv tree --outdated
   venv                  setup virtual environment if needed -> uv venv -p 3.11
  -------------------
   check                 run all checks: lint, type, and format
   format                format check -> ruff format --check
   lint                  lint check -> ruff check
   type                  type check -> pyright
   format_fix            auto-fix format -> ruff format
   lint_fix              auto-fix lint -> ruff check --fix
  -------------------
   test                  test -> pytest
   test_fast             test & fail-fast -> pytest -x -q
  -------------------
   help                  displays (this) help screen

# FLAGS
  -------------------
   UV                    [? ] uv build flag(s) (e.g: make UV="--no-build-isolation")
  -------------------
   BAIL                  [?1] fail fast (bail) on the first test or lint error
   PUBLISH               [?0] publishes to PyPI after build (requires twine config)
  -------------------
   DEBUG                 [?0] enables verbose logging for tools (uv, pytest, ruff)
   QUIET                 [?0] disables pretty-printed/log target (INIT/DONE) info
   NO_COLOR              [?0] disables color logging/ANSI codes

Contributing

  1. Fork the repository
  2. Create a feature branch: git checkout -b feature/amazing-feature
  3. Install development dependencies: uv sync --group dev
  4. Make your changes following the existing code style
  5. Run quality checks & test:
    make format_fix  # auto-fix format -> ruff format
    make check       # run all checks: lint, type, and format
    make test        # run all tests
    
  6. Commit your changes: git commit -m 'feat: add amazing feature'
  7. Push to your branch: git push origin feature/amazing-feature
  8. Open a Pull Request with a clear description of your changes

Respects

  • RealtimeSTT - A swell wrapper around faster-whisper that powers the speech-to-text engine
  • pynput - Cross-platform controller and monitorer for the keyboard
  • pyperclip - Cross-platform clipboard
  • whisper - The foundational speech-to-text recognition model

License

MIT License

Copyright (c) 2025 te <legal@fetchTe.com>

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ispeak-0.2.5.tar.gz (183.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ispeak-0.2.5-py3-none-any.whl (36.5 kB view details)

Uploaded Python 3

File details

Details for the file ispeak-0.2.5.tar.gz.

File metadata

  • Download URL: ispeak-0.2.5.tar.gz
  • Upload date:
  • Size: 183.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for ispeak-0.2.5.tar.gz
Algorithm Hash digest
SHA256 93a8b685fce899052813fadc4a7a7998788d0d505d4ce1b99379e2fd35992874
MD5 2af61fe294b630c4e8c18e3fdd962de9
BLAKE2b-256 8a8c25fa125245e3389a5f33b291e35fb07a45e62300f9e4902f180610fd7196

See more details on using hashes here.

File details

Details for the file ispeak-0.2.5-py3-none-any.whl.

File metadata

  • Download URL: ispeak-0.2.5-py3-none-any.whl
  • Upload date:
  • Size: 36.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for ispeak-0.2.5-py3-none-any.whl
Algorithm Hash digest
SHA256 4337fa9fb111c33141095b56afe0f2f20d49b4aaf7d28df79496fdf0f58173ac
MD5 ecbb130feb2b1bd6eb15afb5c4d350c2
BLAKE2b-256 e480fb0975ac48d4b2db4d1bddf97d217ec520cf32b3c86690d4c83fd9c90f9b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page