listening_tool

No project description provided

These details have not been verified by PyPI

Project links

Project description

Setup

A simple toolset for using Whisper models to transcribe audio in real-time.

The listening_tool is a wrapper around the whisper library that provides a simple interface for transcribing audio in real-time. The module is designed to be versatile, piping the data to local or remote endpoints for further processing. All aspects of the transcription can be configured via a config file (see bottom).

Other Agent Tools

Thinking Tool - an Ollama based LLM server for distributed agentic operations.
Speaking Tool - A simple text-to-speech server using Kokoro models.

Prerequisites

MacOS

Install brew install portaudio

Linux

Ubuntu

sudo apt install portaudio19-dev -y

Quick Start

Install the package and create a config file.

pip install listening_tool

Create a config.yaml file with the following content according to configuration options below.

Below is a basic example of how to use the listening tool to transcribe audio in real-time.

from listening_tool import Config, RecordingDevice, ListeningTool, TranscriptionResult

def transcription_callback(text: str, result: TranscriptionResult) -> None:
    print("Here's what I heard: ")
    print(result)

config = Config.load("config.yaml")

recording_device = RecordingDevice(config.mic_config)
listening_tool = ListeningTool(
    config.listening_tool,
    recording_device,
)

listening_tool.listen(transcription_callback)

The transcription_callback function is called when a transcription is completed.

Documentation

Documentation

Attribution

The core of this code was heavily influenced and includes some code from:

Huge thanks to davabase for the initial code! All I've done is wrap it up in a nice package.

Send Text to Web API

import requests
from listening_tool import Config, RecordingDevice, ListeningTool, TranscriptionResult

def transcription_callback(text: str, result: TranscriptionResult) -> None:
    # Send the transcription to a REST API
    requests.post(
        "http://localhost:5000/transcribe",
        json={"text": text, "result": result.to_dict()}
    )

config = Config.load("config.yaml")
recording_device = RecordingDevice(config.mic_config)
listening_tool = ListeningTool(
    config.listening_tool,
    recording_device,
)
listening_tool.listen(transcription_callback)

The TranscriptionResult object has a .to_dict() method that converts the object to a dictionary, which can be serialized to JSON.

{
    "text": "This is only a test of words.",
    "segments": [
        {
            "id": 0,
            "seek": 0,
            "start": 0.0,
            "end": 1.8,
            "text": " This is only a test of words.",
            "tokens": [50363, 770, 318, 691, 257, 1332, 286, 2456, 13, 50463],
            "temperature": 0.0,
            "avg_logprob": -0.43947878750887787,
            "compression_ratio": 0.8285714285714286,
            "no_speech_prob": 0.0012085052439942956,
            "words": [
                {"word": " This", "start": 0.0, "end": 0.36, "probability": 0.750191330909729},
                {"word": " is", "start": 0.36, "end": 0.54, "probability": 0.997636079788208},
                {"word": " only", "start": 0.54, "end": 0.78, "probability": 0.998072624206543},
                {"word": " a", "start": 0.78, "end": 1.02, "probability": 0.9984667897224426},
                {"word": " test", "start": 1.02, "end": 1.28, "probability": 0.9980781078338623},
                {"word": " of", "start": 1.28, "end": 1.48, "probability": 0.99817955493927},
                {"word": " words.", "start": 1.48, "end": 1.8, "probability": 0.9987621307373047}
            ]
        }
    ],
    "language": "en",
    "processing_secs": 5.410359,
    "local_starttime": "2025-01-31T06:19:03.322642-06:00",
    "processing_rolling_avg_secs": 22.098183908976
}

Config

Config is a yaml file enabling control of all aspects of the audio recording, model config, and transcription formatting. Below is an example of a config file.

mic_config:
  mic_name: "Jabra SPEAK 410 USB: Audio (hw:3,0)" # Linux only
  sample_rate: 16000
  energy_threshold: 3000 # 0-4000

listening_tool:
  record_timeout: 2 # 0-10
  phrase_timeout: 3 # 0-10
  in_memory: True
  transcribe_config:
    #  'tiny.en', 'tiny', 'base.en', 'base', 'small.en', 'small', 
    #'medium.en', 'medium', 'large-v1', 'large-v2', 'large-v3', 
    # 'large', 'large-v3-turbo', 'turbo'
    model: medium.en

    # Whether to display the text being decoded to the console.
    # If True, displays all the details, If False, displays
    # minimal details. If None, does not display anything
    verbose: True

    # Temperature for sampling. It can be a tuple of temperatures,
    # which will be successively used upon failures according to
    # either compression_ratio_threshold or logprob_threshold.
    temperature: "(0.0, 0.2, 0.4, 0.6, 0.8, 1.0)" # "(0.0, 0.2, 0.4, 0.6, 0.8, 1.0)"

    # If the gzip compression ratio is above this value,
    # treat as failed
    compression_ratio_threshold: 2.4 # 2.4

    # If the average log probability over sampled tokens is below this value, treat as failed
    logprob_threshold: -1.0 # -1.0

    # If the no_speech probability is higher than this value AND
    # the average log probability over sampled tokens is below
    # logprob_threshold, consider the segment as silent
    no_speech_threshold: 0.6 # 0.6

    # if True, the previous output of the model is provided as a
    # prompt for the next window; disabling may make the text
    # inconsistent across windows, but the model becomes less
    # prone to getting stuck in a failure loop, such as repetition
    # looping or timestamps going out of sync.
    condition_on_previous_text: True # True

    # Extract word-level timestamps using the cross-attention
    # pattern and dynamic time warping, and include the timestamps
    # for each word in each segment.
    # NOTE: Setting this to true also adds word level data to the
    # output, which can be useful for downstream processing.  E.g.,
    # {
    #   'word': 'test',
    #   'start': np.float64(1.0),
    #   'end': np.float64(1.6),
    #   'probability': np.float64(0.8470910787582397)
    # }
    word_timestamps: True # False

    # If word_timestamps is True, merge these punctuation symbols
    # with the next word

    prepend_punctuations: '"''“¿([{-'

    # If word_timestamps is True, merge these punctuation symbols with the previous word
    append_punctuations: '"''.。,，!！?？:：”)]}、'

    # Optional text to provide as a prompt for the first window.
    # This can be used to provide, or "prompt-engineer" a context
    # for transcription, e.g. custom vocabularies or proper nouns
    # to make it more likely to predict those word correctly.
    initial_prompt: "" # ""

    # Comma-separated list start,end,start,end,... timestamps
    # (in seconds) of clips to process. The last end timestamp
    # defaults to the end of the file.
    clip_timestamps: "0" # "0"

    # When word_timestamps is True, skip silent periods **longer**
    # than this threshold (in seconds) when a possible
    # hallucination is detected
    hallucination_silence_threshold: None # float | None

    # Keyword arguments to construct DecodingOptions instances
    # TODO: How can DecodingOptions work?

logging_config:
  level: INFO # DEBUG, INFO, WARNING, ERROR, CRITICAL
  filepath: "talking.log"
  log_entry_format: "%(asctime)s - %(levelname)s - %(message)s"
  date_format: "%Y-%m-%d %H:%M:%S"

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.0.39

Mar 3, 2025

0.0.38

Mar 3, 2025

0.0.37

Mar 3, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

listening_tool-0.0.39.tar.gz (11.4 kB view details)

Uploaded Mar 3, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

listening_tool-0.0.39-py3-none-any.whl (12.4 kB view details)

Uploaded Mar 3, 2025 Python 3

File details

Details for the file listening_tool-0.0.39.tar.gz.

File metadata

Download URL: listening_tool-0.0.39.tar.gz
Upload date: Mar 3, 2025
Size: 11.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: poetry/1.8.2 CPython/3.12.8 Darwin/24.3.0

File hashes

Hashes for listening_tool-0.0.39.tar.gz
Algorithm	Hash digest
SHA256	`641515e605170bf62a555ff2f8dedd961e0d95285e2a62c6c2594bf333a1b814`
MD5	`86c34eb363ac81afe6d3708360b9d90e`
BLAKE2b-256	`8848e8fea78047489f42b6c7472e0af3447bc3534a87a03cf58b9f7f9e2a965a`

See more details on using hashes here.

File details

Details for the file listening_tool-0.0.39-py3-none-any.whl.

File metadata

Download URL: listening_tool-0.0.39-py3-none-any.whl
Upload date: Mar 3, 2025
Size: 12.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: poetry/1.8.2 CPython/3.12.8 Darwin/24.3.0

File hashes

Hashes for listening_tool-0.0.39-py3-none-any.whl
Algorithm	Hash digest
SHA256	`bbe9fb6cf1fd6ddfd05ce7d9d073af86f52096f0e539ea298764469cbabc9d93`
MD5	`35a69ddd9362e0f14b04b62eb17b3dfc`
BLAKE2b-256	`a904163e23612102964c175600115ac72313b84f5d83134af602120ed08ab068`

See more details on using hashes here.

listening_tool 0.0.39

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Setup

Other Agent Tools

Prerequisites

MacOS

Linux

Ubuntu

Quick Start

Documentation

Attribution

Send Text to Web API

Config

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes