Skip to main content

Miscellaneous utilities for helping with audio transcription.

Project description

cjm-transcription-utils

Install

pip install cjm_transcription_utils

Project Structure

nbs/
├── chunking.ipynb            # Utilities for splitting audio into chunks using VAD timestamps and merging transcripts with overlap correction.
├── formatting.ipynb          # Utilities for formatting time intervals into human-readable timestamp ranges.
├── librosa.ipynb             # Audio loading and normalization utilities using librosa.
├── numerizer.ipynb           # Text number conversion utilities with patched numerizer to preserve articles like 'a'.
├── postprocessing.ipynb      # Transcript post-processing utilities for converting numbers to words and normalizing text.
├── pydub.ipynb               # Audio segment extraction utilities using pydub.
├── silero_vad.ipynb          # Voice Activity Detection utilities using the Silero VAD model.
└── timestamp_alignment.ipynb # Utilities for aligning VAD timestamps to corrected transcripts using fuzzy matching.

Total: 8 notebooks

Module Dependencies

graph LR
    chunking[chunking<br/>chunking]
    formatting[formatting<br/>formatting]
    librosa[librosa<br/>librosa]
    numerizer[numerizer<br/>numerizer]
    postprocessing[postprocessing<br/>postprocessing]
    pydub[pydub<br/>pydub]
    silero_vad[silero_vad<br/>silero vad]
    timestamp_alignment[timestamp_alignment<br/>timestamp alignment]

    silero_vad --> chunking
    silero_vad --> librosa

2 cross-module dependencies detected

CLI Reference

No CLI commands found in this project.

Module Overview

Detailed documentation for each module in the project:

chunking (chunking.ipynb)

Utilities for splitting audio into chunks using VAD timestamps and merging transcripts with overlap correction.

Import

from cjm_transcription_utils.chunking import (
    get_extended_timestamp_boundaries,
    get_extended_chunk_boundaries,
    generate_chunks_with_vad,
    generate_intermediate_chunks,
    generate_intermediate_chunk_tuples,
    merge_transcripts_with_overlaps
)

Functions

def get_extended_timestamp_boundaries(
    timestamps: List[Dict[str, float]], 
    index: int  # Index of the current timestamp
) -> Tuple[float, float]
    "Get extended boundaries for a timestamp using adjacent timestamps."
def get_extended_chunk_boundaries(
    chunks: List[Tuple[float, float]], 
    index: int  # Index of the current chunk
) -> Tuple[float, float]
    "Get extended boundaries for a chunk using adjacent chunks."
def generate_chunks_with_vad(
    audio_array: np.ndarray,  # Audio array
    duration: float,  # Total duration of audio in seconds
    max_chunk_seconds: float = 120,  # Maximum chunk duration in seconds
    max_chunk_seconds_offset: float = 0,  # Offset for chunk duration calculation
    speech_timestamps: Optional[List[Dict]] = None,  # List of speech timestamp dictionaries with 'start' and 'end' keys
    max_silence_threshold: float = 2.0  # Maximum silence duration (in seconds) before creating a new chunk
) -> Tuple[List[Tuple[float, float]], List[List[Dict]]]
    "Generate chunks using VAD timestamps with silence-based splitting"
def generate_intermediate_chunks(
    chunks: List[Tuple[float, float]],
    chunk_timestamps: List[List[Dict]],  # List of timestamp dictionaries for each chunk
    use_extended_boundaries: bool  # Whether to use extended boundaries from adjacent timestamps
) -> List[Tuple[float, float]]:  # List of tuples representing time intervals for intermediate chunks
    "Generate overlapping chunks between consecutive chunk boundaries"
def generate_intermediate_chunk_tuples(
    chunks: List[Tuple[float, float]],
    chunk_timestamps: List[List[Dict]],  # List of timestamp dictionaries for each chunk
    use_extended_boundaries: bool  # Whether to use extended boundaries from adjacent timestamps
) -> List[Tuple[Dict, Dict]]
    "Generate tuples of (last_timestamp, first_timestamp) from consecutive chunks."
def merge_transcripts_with_overlaps(
    normal_transcripts: List[str],  # List of transcripts for normal chunks
    intermediate_transcripts: List[str],  # List of transcripts for intermediate chunks
    segment_transcripts: List[Tuple[str, str]],
    verbose: bool = True  # Whether to print debug information
) -> str
    "Merge normal and intermediate transcripts with overlap correction"

formatting (formatting.ipynb)

Utilities for formatting time intervals into human-readable timestamp ranges.

Import

from cjm_transcription_utils.formatting import (
    time_interval_to_hms_range
)

Functions

def time_interval_to_hms_range(
    duration_tuple: tuple[float, float]  # A tuple of (start_seconds, end_seconds) as floats
) -> str:  # Formatted timestamp range string in [HH:MM:SS.ss]-[HH:MM:SS.ss] format
    "Convert a time interval tuple (start_seconds, end_seconds) to HMS timestamp range format."

librosa (librosa.ipynb)

Audio loading and normalization utilities using librosa.

Import

from cjm_transcription_utils.librosa import (
    load_audio
)

Functions

def load_audio(
    audio_path: str,  # Path to the audio file to load
    target_sr: int = 16000  # Target sample rate for resampling
) -> Tuple[np.ndarray, int]:  # Tuple of (normalized audio array, sample rate)
    "Load and normalize audio file"

numerizer (numerizer.ipynb)

Text number conversion utilities with patched numerizer to preserve articles like ‘a’.

Import

from cjm_transcription_utils.numerizer import (
    original_numerize_numerals,
    patched_numerize_numerals,
    smart_numerize
)

Functions

def patched_numerize_numerals(
    s: str,  # String to convert written numbers to digits
    ignore: list = None,  # List of words to ignore during conversion
    bias: str = None  # Conversion bias (e.g., 'ordinal')
) -> str:  # String with written numbers converted to digits
    "Patched version that doesn't convert 'a' to '1'"
def smart_numerize(
    text: str  # Text containing written numbers to convert
) -> str:  # Text with written numbers converted to digits
    "Convert written numbers to digits with special handling for compound ordinals."

postprocessing (postprocessing.ipynb)

Transcript post-processing utilities for converting numbers to words and normalizing text.

Import

from cjm_transcription_utils.postprocessing import (
    replace_integers_in_string,
    transcription_post_processing
)

Functions

def replace_integers_in_string(
    text: str  # Text containing integers to convert to words
) -> str:  # Text with integers converted to their word representation
    "Replace integer numbers with their word equivalents while preserving special formats."
def transcription_post_processing(
    transcript: str  # Raw transcript text to process
) -> str:  # Processed transcript with integers converted to words and dashes normalized
    "Apply post-processing transformations to transcript text."

pydub (pydub.ipynb)

Audio segment extraction utilities using pydub.

Import

from cjm_transcription_utils.pydub import (
    get_audio_segment
)

Functions

def get_audio_segment(
    audio: AudioSegment,  # Source audio segment to extract from
    start: float,  # Start time in seconds
    end: float,  # End time in seconds
    offset: float = 0  # Offset in milliseconds to expand segment boundaries
) -> AudioSegment:  # Extracted audio segment
    "Extract audio segment between start and end times"

silero vad (silero_vad.ipynb)

Voice Activity Detection utilities using the Silero VAD model.

Import

from cjm_transcription_utils.silero_vad import (
    prepare_audio_and_vad
)

Functions

def prepare_audio_and_vad(
    audio_path: str,  # Path to audio file
    max_chunk_seconds: float,  # Maximum chunk duration in seconds
    max_silence_threshold: float,  # Maximum silence duration before creating a new chunk
    include_timestamps: bool,  # Whether timestamps will be needed
    verbose: bool = True  # Whether to print progress
) -> tuple[np.ndarray, int, float, list, list, list]:  # Tuple of (audio array, sample rate, duration, speech timestamps, chunks, chunk timestamps)
    "Load audio and prepare VAD timestamps if needed."

timestamp alignment (timestamp_alignment.ipynb)

Utilities for aligning VAD timestamps to corrected transcripts using fuzzy matching.

Import

from cjm_transcription_utils.timestamp_alignment import (
    TranscriptAligner,
    align_timestamps_to_transcript
)

Functions

def align_timestamps_to_transcript(
    final_transcript: str,  # The final merged transcript
    timestamp_transcripts: List[str],  # List of transcripts for each timestamp segment
    speech_timestamps: List[Dict],  # List of speech timestamp dictionaries
    verbose: bool = True  # Whether to print alignment details
) -> List[Dict]
    "Align timestamp segments to the final transcript."

Classes

class TranscriptAligner:
    def __init__(self, 
                 correct_transcript: str, # The full, correct transcript text
                 segment_transcripts: List[str], # List of individual segment transcriptions (may have errors)
                 timestamps: List[Dict[str, float]], # List of timestamp dictionaries with 'start' and 'end' keys
                 confidence_threshold: int = 70 # Minimum confidence score to accept an alignment
                )
    "Aligns VAD timestamps to a corrected transcript using fuzzy matching."
    
    def __init__(self,
                     correct_transcript: str, # The full, correct transcript text
                     segment_transcripts: List[str], # List of individual segment transcriptions (may have errors)
                     timestamps: List[Dict[str, float]], # List of timestamp dictionaries with 'start' and 'end' keys
                     confidence_threshold: int = 70 # Minimum confidence score to accept an alignment
                    )
        "Initialize the transcript aligner with complete coverage and correction mechanisms."
    
    def align_timestamps_to_correct_transcript(
            self
        ) -> List[Dict]:  # List of alignment dictionaries with timestamp, text, and confidence info
        "Align timestamps to the correct transcript with optional corrections."

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cjm_transcription_utils-0.0.3.tar.gz (21.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

cjm_transcription_utils-0.0.3-py3-none-any.whl (22.1 kB view details)

Uploaded Python 3

File details

Details for the file cjm_transcription_utils-0.0.3.tar.gz.

File metadata

  • Download URL: cjm_transcription_utils-0.0.3.tar.gz
  • Upload date:
  • Size: 21.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.13

File hashes

Hashes for cjm_transcription_utils-0.0.3.tar.gz
Algorithm Hash digest
SHA256 fa4fdaefa0483b5f822768fa27537b163933a5474acb7870b39cf883e327f7b4
MD5 33f1c3bdf3133621fc9aa157edebf24a
BLAKE2b-256 ca6839c3306481cece5850c08cea393d0d42e06886cffeb4a481858a83c565ed

See more details on using hashes here.

File details

Details for the file cjm_transcription_utils-0.0.3-py3-none-any.whl.

File metadata

File hashes

Hashes for cjm_transcription_utils-0.0.3-py3-none-any.whl
Algorithm Hash digest
SHA256 9f82996f676902a1931d85c2c359e2ad60d3d6bac08eb419cf0da43700304bd3
MD5 e28ede5326f5cb65b5d0aa89217147e8
BLAKE2b-256 efd6499ed3470e8142581115c4abd74e6e51710249ea769b12dbaf94555b0433

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page