Miscellaneous utilities for helping with audio transcription.

These details have not been verified by PyPI

Project links

Homepage

Project description

cjm-transcription-utils

Install

pip install cjm_transcription_utils

Project Structure

nbs/
├── chunking.ipynb            # Utilities for splitting audio into chunks using VAD timestamps and merging transcripts with overlap correction.
├── formatting.ipynb          # Utilities for formatting time intervals into human-readable timestamp ranges.
├── librosa.ipynb             # Audio loading and normalization utilities using librosa.
├── numerizer.ipynb           # Text number conversion utilities with patched numerizer to preserve articles like 'a'.
├── postprocessing.ipynb      # Transcript post-processing utilities for converting numbers to words and normalizing text.
├── pydub.ipynb               # Audio segment extraction utilities using pydub.
├── silero_vad.ipynb          # Voice Activity Detection utilities using the Silero VAD model.
└── timestamp_alignment.ipynb # Utilities for aligning VAD timestamps to corrected transcripts using fuzzy matching.

Total: 8 notebooks

Module Dependencies

graph LR
    chunking[chunking<br/>chunking]
    formatting[formatting<br/>formatting]
    librosa[librosa<br/>librosa]
    numerizer[numerizer<br/>numerizer]
    postprocessing[postprocessing<br/>postprocessing]
    pydub[pydub<br/>pydub]
    silero_vad[silero_vad<br/>silero vad]
    timestamp_alignment[timestamp_alignment<br/>timestamp alignment]

    silero_vad --> chunking
    silero_vad --> librosa

2 cross-module dependencies detected

CLI Reference

No CLI commands found in this project.

Module Overview

Detailed documentation for each module in the project:

chunking (`chunking.ipynb`)

Utilities for splitting audio into chunks using VAD timestamps and merging transcripts with overlap correction.

Import

from cjm_transcription_utils.chunking import (
    get_extended_timestamp_boundaries,
    get_extended_chunk_boundaries,
    generate_chunks_with_vad,
    generate_intermediate_chunks,
    generate_intermediate_chunk_tuples,
    merge_transcripts_with_overlaps
)

Functions

def get_extended_timestamp_boundaries(
    timestamps: List[Dict[str, float]], 
    index: int  # Index of the current timestamp
) -> Tuple[float, float]
    "Get extended boundaries for a timestamp using adjacent timestamps."

def get_extended_chunk_boundaries(
    chunks: List[Tuple[float, float]], 
    index: int  # Index of the current chunk
) -> Tuple[float, float]
    "Get extended boundaries for a chunk using adjacent chunks."

def generate_chunks_with_vad(
    audio_array: np.ndarray,  # Audio array
    duration: float,  # Total duration of audio in seconds
    max_chunk_seconds: float = 120,  # Maximum chunk duration in seconds
    max_chunk_seconds_offset: float = 0,  # Offset for chunk duration calculation
    speech_timestamps: Optional[List[Dict]] = None,  # List of speech timestamp dictionaries with 'start' and 'end' keys
    max_silence_threshold: float = 2.0  # Maximum silence duration (in seconds) before creating a new chunk
) -> Tuple[List[Tuple[float, float]], List[List[Dict]]]
    "Generate chunks using VAD timestamps with silence-based splitting"

def generate_intermediate_chunks(
    chunks: List[Tuple[float, float]],
    chunk_timestamps: List[List[Dict]],  # List of timestamp dictionaries for each chunk
    use_extended_boundaries: bool  # Whether to use extended boundaries from adjacent timestamps
) -> List[Tuple[float, float]]:  # List of tuples representing time intervals for intermediate chunks
    "Generate overlapping chunks between consecutive chunk boundaries"

def generate_intermediate_chunk_tuples(
    chunks: List[Tuple[float, float]],
    chunk_timestamps: List[List[Dict]],  # List of timestamp dictionaries for each chunk
    use_extended_boundaries: bool  # Whether to use extended boundaries from adjacent timestamps
) -> List[Tuple[Dict, Dict]]
    "Generate tuples of (last_timestamp, first_timestamp) from consecutive chunks."

def merge_transcripts_with_overlaps(
    normal_transcripts: List[str],  # List of transcripts for normal chunks
    intermediate_transcripts: List[str],  # List of transcripts for intermediate chunks
    segment_transcripts: List[Tuple[str, str]],
    verbose: bool = True  # Whether to print debug information
) -> str
    "Merge normal and intermediate transcripts with overlap correction"

formatting (`formatting.ipynb`)

Utilities for formatting time intervals into human-readable timestamp ranges.

Import

from cjm_transcription_utils.formatting import (
    time_interval_to_hms_range
)

Functions

def time_interval_to_hms_range(
    duration_tuple: tuple[float, float]  # A tuple of (start_seconds, end_seconds) as floats
) -> str:  # Formatted timestamp range string in [HH:MM:SS.ss]-[HH:MM:SS.ss] format
    "Convert a time interval tuple (start_seconds, end_seconds) to HMS timestamp range format."

librosa (`librosa.ipynb`)

Audio loading and normalization utilities using librosa.

Import

from cjm_transcription_utils.librosa import (
    load_audio
)

Functions

def load_audio(
    audio_path: str,  # Path to the audio file to load
    target_sr: int = 16000  # Target sample rate for resampling
) -> Tuple[np.ndarray, int]:  # Tuple of (normalized audio array, sample rate)
    "Load and normalize audio file"

numerizer (`numerizer.ipynb`)

Text number conversion utilities with patched numerizer to preserve articles like ‘a’.

Import

from cjm_transcription_utils.numerizer import (
    original_numerize_numerals,
    patched_numerize_numerals,
    smart_numerize
)

Functions

def patched_numerize_numerals(
    s: str,  # String to convert written numbers to digits
    ignore: list = None,  # List of words to ignore during conversion
    bias: str = None  # Conversion bias (e.g., 'ordinal')
) -> str:  # String with written numbers converted to digits
    "Patched version that doesn't convert 'a' to '1'"

def smart_numerize(
    text: str  # Text containing written numbers to convert
) -> str:  # Text with written numbers converted to digits
    "Convert written numbers to digits with special handling for compound ordinals."

postprocessing (`postprocessing.ipynb`)

Transcript post-processing utilities for converting numbers to words and normalizing text.

Import

from cjm_transcription_utils.postprocessing import (
    replace_integers_in_string,
    transcription_post_processing
)

Functions

def replace_integers_in_string(
    text: str  # Text containing integers to convert to words
) -> str:  # Text with integers converted to their word representation
    "Replace integer numbers with their word equivalents while preserving special formats."

def transcription_post_processing(
    transcript: str  # Raw transcript text to process
) -> str:  # Processed transcript with integers converted to words and dashes normalized
    "Apply post-processing transformations to transcript text."

pydub (`pydub.ipynb`)

Audio segment extraction utilities using pydub.

Import

from cjm_transcription_utils.pydub import (
    get_audio_segment
)

Functions

def get_audio_segment(
    audio: AudioSegment,  # Source audio segment to extract from
    start: float,  # Start time in seconds
    end: float,  # End time in seconds
    offset: float = 0  # Offset in milliseconds to expand segment boundaries
) -> AudioSegment:  # Extracted audio segment
    "Extract audio segment between start and end times"

silero vad (`silero_vad.ipynb`)

Voice Activity Detection utilities using the Silero VAD model.

Import

from cjm_transcription_utils.silero_vad import (
    prepare_audio_and_vad
)

Functions

def prepare_audio_and_vad(
    audio_path: str,  # Path to audio file
    max_chunk_seconds: float,  # Maximum chunk duration in seconds
    max_silence_threshold: float,  # Maximum silence duration before creating a new chunk
    include_timestamps: bool,  # Whether timestamps will be needed
    verbose: bool = True  # Whether to print progress
) -> tuple[np.ndarray, int, float, list, list, list]:  # Tuple of (audio array, sample rate, duration, speech timestamps, chunks, chunk timestamps)
    "Load audio and prepare VAD timestamps if needed."

timestamp alignment (`timestamp_alignment.ipynb`)

Utilities for aligning VAD timestamps to corrected transcripts using fuzzy matching.

Import

from cjm_transcription_utils.timestamp_alignment import (
    TranscriptAligner,
    align_timestamps_to_transcript
)

Functions

def align_timestamps_to_transcript(
    final_transcript: str,  # The final merged transcript
    timestamp_transcripts: List[str],  # List of transcripts for each timestamp segment
    speech_timestamps: List[Dict],  # List of speech timestamp dictionaries
    verbose: bool = True  # Whether to print alignment details
) -> List[Dict]
    "Align timestamp segments to the final transcript."

Classes

class TranscriptAligner:
    def __init__(self, 
                 correct_transcript: str, # The full, correct transcript text
                 segment_transcripts: List[str], # List of individual segment transcriptions (may have errors)
                 timestamps: List[Dict[str, float]], # List of timestamp dictionaries with 'start' and 'end' keys
                 confidence_threshold: int = 70 # Minimum confidence score to accept an alignment
                )
    "Aligns VAD timestamps to a corrected transcript using fuzzy matching."
    
    def __init__(self,
                     correct_transcript: str, # The full, correct transcript text
                     segment_transcripts: List[str], # List of individual segment transcriptions (may have errors)
                     timestamps: List[Dict[str, float]], # List of timestamp dictionaries with 'start' and 'end' keys
                     confidence_threshold: int = 70 # Minimum confidence score to accept an alignment
                    )
        "Initialize the transcript aligner with complete coverage and correction mechanisms."
    
    def align_timestamps_to_correct_transcript(
            self
        ) -> List[Dict]:  # List of alignment dictionaries with timestamp, text, and confidence info
        "Align timestamps to the correct transcript with optional corrections."

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

This version

0.0.3

Oct 25, 2025

0.0.2

Sep 11, 2025

0.0.1

Sep 11, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cjm_transcription_utils-0.0.3.tar.gz (21.5 kB view details)

Uploaded Oct 25, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

cjm_transcription_utils-0.0.3-py3-none-any.whl (22.1 kB view details)

Uploaded Oct 25, 2025 Python 3

File details

Details for the file cjm_transcription_utils-0.0.3.tar.gz.

File metadata

Download URL: cjm_transcription_utils-0.0.3.tar.gz
Upload date: Oct 25, 2025
Size: 21.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.13

File hashes

Hashes for cjm_transcription_utils-0.0.3.tar.gz
Algorithm	Hash digest
SHA256	`fa4fdaefa0483b5f822768fa27537b163933a5474acb7870b39cf883e327f7b4`
MD5	`33f1c3bdf3133621fc9aa157edebf24a`
BLAKE2b-256	`ca6839c3306481cece5850c08cea393d0d42e06886cffeb4a481858a83c565ed`

See more details on using hashes here.

File details

Details for the file cjm_transcription_utils-0.0.3-py3-none-any.whl.

File metadata

Download URL: cjm_transcription_utils-0.0.3-py3-none-any.whl
Upload date: Oct 25, 2025
Size: 22.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.13

File hashes

Hashes for cjm_transcription_utils-0.0.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`9f82996f676902a1931d85c2c359e2ad60d3d6bac08eb419cf0da43700304bd3`
MD5	`e28ede5326f5cb65b5d0aa89217147e8`
BLAKE2b-256	`efd6499ed3470e8142581115c4abd74e6e51710249ea769b12dbaf94555b0433`

See more details on using hashes here.

cjm-transcription-utils 0.0.3

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

cjm-transcription-utils

Install

Project Structure

Module Dependencies

CLI Reference

Module Overview

chunking (chunking.ipynb)

Import

Functions

formatting (formatting.ipynb)

Import

Functions

librosa (librosa.ipynb)

Import

Functions

numerizer (numerizer.ipynb)

Import

Functions

postprocessing (postprocessing.ipynb)

Import

Functions

pydub (pydub.ipynb)

Import

Functions

silero vad (silero_vad.ipynb)

Import

Functions

timestamp alignment (timestamp_alignment.ipynb)

Import

Functions

Classes

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

chunking (`chunking.ipynb`)

formatting (`formatting.ipynb`)

librosa (`librosa.ipynb`)

numerizer (`numerizer.ipynb`)

postprocessing (`postprocessing.ipynb`)

pydub (`pydub.ipynb`)

silero vad (`silero_vad.ipynb`)

timestamp alignment (`timestamp_alignment.ipynb`)