Skip to main content

Add your description here

Project description

Speak Now

A locally-hosted, low-latency speech-to-text solution with AI formatting capabilities.

Overview

Speak Now captures your speech in real-time, transcribes it, and allows you to paste it directly into any application with minimal latency. What sets it apart is the seamless integration with Google's Gemini AI to intelligently format your dictated text before pasting, all while maintaining a workflow that doesn't interrupt your focus.

Features

Prompt Deck Interface

Minimum (and completely hidden-able) UI

  • Real-time Speech Recognition: Captures your speech continuously with low latency
  • AI-Powered Formatting: Uses Gemini 1.5 Flash to transform raw transcription into polished text
  • Multiple Formatting Styles: Choose between Natural, Formal, Concise, or custom formatting styles
  • Hotkey Controls: Use keyboard shortcuts to control all aspects of the application
  • Hide-able UI: Interface can be completely hidden to avoid workflow disruption
  • History Tracking: Access your recent transcriptions for easy reuse
  • Recording Toggle: Pause and resume speech recognition as needed
  • Customizable Configuration: Adjust settings via a TOML configuration file

Setup

Install Speak Now via pip:

pip install speak-now

For optimal performance with GPU acceleration, see the RealtimeSTT documentation.

Launch the application with:

speak-now -c <config>

The application will use default settings if no configuration file is specified.

To start in hidden mode (UI remains hidden until manually toggled):

speak-now -c <config> --hidden

Alternatively, set start_hidden = true in your configuration file.

Hotkeys

Action Default Hotkey Description
Paste Raw Ctrl+` Paste unformatted transcription text
Format & Paste Alt+` Format transcription with Gemini and paste
Toggle Recording Ctrl+Alt+Space Start/pause speech recognition
Toggle Window Ctrl+Alt+V Show/hide the application window

Formatting Options

  • Natural: Improves flow and fixes grammar while maintaining your voice
  • Formal: Transforms text into professional, business-appropriate language
  • Concise: Condenses text while preserving important information
  • Catgirl: Fun transformation to sound like a cute catgirl (example of custom style)
  • None: No formatting, equivalent to "Paste Raw"

Configuration

Speak Now uses a TOML configuration file (stt_config.toml). Key settings include:

[api]
gemini_api_key = ""  # Set your Gemini API key or use environment variable
model = "gemini-1.5-flash"  # Choose Gemini model to use

[stt]
model = "large-v2"  # Speech recognition model
timeout = 1.0  # Recognition timeout

[hotkeys]
paste_raw = "ctrl+`"
paste_formatted = "alt+`"
toggle_recording = "ctrl+alt+space"
toggle_window = "ctrl+alt+v"

[ui]
opacity = 0.90
max_history_items = 10
default_format = "Concise"
start_hidden = false  # Set to true to start with the UI hidden

[formatting_prompts]
# Customize these prompts to change formatting behavior
Natural = "Reformat this transcription to sound more natural and fix any grammar issues: "
Formal = "Reformat this transcription into formal, professional language: "
Concise = "Reformat this transcription to be more concise while preserving all important information: "
Catgirl = "Reformat this transcription to sound like a cute catgirl talking: "
None = ""  # No formatting

Current Status

This project is a work in progress. While the core functionality works well, you may encounter occasional bugs or limitations as development continues. The focus is on maintaining low latency and seamless integration with your existing workflow.

Key Benefits

  • Minimal Disruption: Can operate completely in the background
  • Low Latency: Designed for real-time use with minimal delay
  • Integration: Works with any application that accepts text input
  • Customizable Experience: Tailor the tool to your specific needs
  • Privacy-Focused: Speech recognition runs locally

Building from Source

To build wheels manually, run the following commands:

python -m pip install build twine
python -m build
twine check dist/*
twine upload dist/*

License

The project uses MIT License. See LICENSE for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

speak_now-0.1.2.tar.gz (176.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

speak_now-0.1.2-py3-none-any.whl (18.1 kB view details)

Uploaded Python 3

File details

Details for the file speak_now-0.1.2.tar.gz.

File metadata

  • Download URL: speak_now-0.1.2.tar.gz
  • Upload date:
  • Size: 176.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.0

File hashes

Hashes for speak_now-0.1.2.tar.gz
Algorithm Hash digest
SHA256 45f4b67a8dd8d5bb2bbf793a8bc3b0084f9857f575e660ccd477b8ffe1c11333
MD5 319fc508990ce162b82e6e10f716d5a4
BLAKE2b-256 83b0402f654f3193b0b235db1616d00b271716682f4679ea9246f2ba5ffcbd06

See more details on using hashes here.

File details

Details for the file speak_now-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: speak_now-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 18.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.0

File hashes

Hashes for speak_now-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 9dfdd7a75ebd5d26cce523e621f4381f1be3e3398bfd80d966e17f3f214e5724
MD5 1da363dbb9a0ce46afcf663838fc01a4
BLAKE2b-256 3368d79643190ea8d239a2939f567d252be56579fde7fdfc9b90d406bbe64845

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page