Skip to main content

Miscellaneous utilities for helping with audio transcription.

Project description

cjm-transcription-utils

Install

pip install cjm_transcription_utils

Project Structure

nbs/
├── chunking.ipynb            # Fill in a module description here
├── formatting.ipynb          # Fill in a module description here
├── librosa.ipynb             # Fill in a module description here
├── numerizer.ipynb           # Fill in a module description here
├── postprocessing.ipynb      # Fill in a module description here
├── pydub.ipynb               # Fill in a module description here
├── silero_vad.ipynb          # Fill in a module description here
└── timestamp_alignment.ipynb # Fill in a module description here

Total: 8 notebooks

Module Dependencies

graph LR
    chunking[chunking<br/>chunking]
    formatting[formatting<br/>formatting]
    librosa[librosa<br/>librosa]
    numerizer[numerizer<br/>numerizer]
    postprocessing[postprocessing<br/>postprocessing]
    pydub[pydub<br/>pydub]
    silero_vad[silero_vad<br/>silero vad]
    timestamp_alignment[timestamp_alignment<br/>timestamp alignment]

    silero_vad --> chunking

1 cross-module dependencies detected

CLI Reference

No CLI commands found in this project.

Module Overview

Detailed documentation for each module in the project:

chunking (chunking.ipynb)

Fill in a module description here

Import

from cjm_transcription_utils.chunking import (
    get_extended_timestamp_boundaries,
    get_extended_chunk_boundaries,
    generate_chunks_with_vad,
    generate_intermediate_chunks,
    generate_intermediate_chunk_tuples,
    merge_transcripts_with_overlaps
)

Functions

def get_extended_timestamp_boundaries(
    timestamps: List[Dict[str, float]], 
    index: int  # Index of the current timestamp
) -> Tuple[float, float]
    "Get extended boundaries for a timestamp using adjacent timestamps."
def get_extended_chunk_boundaries(
    chunks: List[Tuple[float, float]], 
    index: int  # Index of the current chunk
) -> Tuple[float, float]
    "Get extended boundaries for a chunk using adjacent chunks."
def generate_chunks_with_vad(
    audio_array: np.ndarray,  # Audio array
    duration: float,  # Total duration of audio in seconds
    max_chunk_seconds: float = 120,  # Maximum chunk duration in seconds
    max_chunk_seconds_offset: float = 0,  # Offset for chunk duration calculation
    speech_timestamps: Optional[List[Dict]] = None,  # List of speech timestamp dictionaries with 'start' and 'end' keys
    max_silence_threshold: float = 2.0  # Maximum silence duration (in seconds) before creating a new chunk
) -> Tuple[List[Tuple[float, float]], List[List[Dict]]]
    "Generate chunks using VAD timestamps with silence-based splitting"
def generate_intermediate_chunks(
    chunks: List[Tuple[float, float]],
    chunk_timestamps: List[List[Dict]],  # TODO: Add description
    use_extended_boundaries:bool  # TODO: Add description
    
) -> List[Tuple[float, float]]:  # TODO: Add return description
    "Generate overlapping chunks between consecutive chunk boundaries"
def generate_intermediate_chunk_tuples(
    chunks: List[Tuple[float, float]],
    chunk_timestamps: List[List[Dict]],  # List of timestamp dictionaries for each chunk
    use_extended_boundaries:bool  # TODO: Add description
) -> List[Tuple[Dict, Dict]]
    "Generate tuples of (last_timestamp, first_timestamp) from consecutive chunks."
def merge_transcripts_with_overlaps(
    normal_transcripts: List[str],  # List of transcripts for normal chunks
    intermediate_transcripts: List[str],  # List of transcripts for intermediate chunks
    segment_transcripts: List[Tuple[str, str]],
    verbose: bool = True  # Whether to print debug information
) -> str
    "Merge normal and intermediate transcripts with overlap correction"

formatting (formatting.ipynb)

Fill in a module description here

Import

from cjm_transcription_utils.formatting import (
    time_interval_to_hms_range
)

Functions

def time_interval_to_hms_range(
    duration_tuple  # A tuple of (start_seconds, end_seconds) as floats
)
    "Convert a time interval tuple (start_seconds, end_seconds) to HMS timestamp range format."

librosa (librosa.ipynb)

Fill in a module description here

Import

from cjm_transcription_utils.librosa import (
    load_audio
)

Functions

def load_audio(
    audio_path: str,  # TODO: Add description
    target_sr: int = 16000  # TODO: Add description
) -> Tuple[np.ndarray, int]:  # TODO: Add return description
    "Load and normalize audio file"

numerizer (numerizer.ipynb)

Fill in a module description here

Import

from cjm_transcription_utils.numerizer import (
    original_numerize_numerals,
    patched_numerize_numerals,
    smart_numerize
)

Functions

def patched_numerize_numerals(
    s,  # TODO: Add type hint and description
    ignore=None,  # TODO: Add type hint and description
    bias=None  # TODO: Add type hint and description
): # TODO: Add type hint
    "Patched version that doesn't convert 'a' to '1'"
def smart_numerize(
    text  # TODO: Add type hint and description
): # TODO: Add type hint
    "TODO: Add function description"

postprocessing (postprocessing.ipynb)

Fill in a module description here

Import

from cjm_transcription_utils.postprocessing import (
    replace_integers_in_string,
    transcription_post_processing
)

Functions

def replace_integers_in_string(
    text  # TODO: Add type hint and description
): # TODO: Add type hint
    "TODO: Add function description"
def transcription_post_processing(
    transcript:str  # TODO: Add description
)->str:  # TODO: Add return description
    "TODO: Add function description"

pydub (pydub.ipynb)

Fill in a module description here

Import

from cjm_transcription_utils.pydub import (
    get_audio_segment
)

Functions

def get_audio_segment(
    audio: AudioSegment,  # TODO: Add description
    start: float,  # TODO: Add description
    end: float,  # TODO: Add description
    offset: float=0  # TODO: Add description
) -> AudioSegment:  # TODO: Add return description
    "Extract audio segment between start and end times"

silero vad (silero_vad.ipynb)

Fill in a module description here

Import

from cjm_transcription_utils.silero_vad import (
    prepare_audio_and_vad
)

Functions

def prepare_audio_and_vad(
    audio_path: str,  # Path to audio file
    max_chunk_seconds: float,  # Maximum chunk duration in seconds
    max_silence_threshold: float,  # Maximum silence duration before creating a new chunk
    include_timestamps: bool,  # Whether timestamps will be needed
    verbose: bool = True  # Whether to print progress
)
    "Load audio and prepare VAD timestamps if needed."

timestamp alignment (timestamp_alignment.ipynb)

Fill in a module description here

Import

from cjm_transcription_utils.timestamp_alignment import (
    TranscriptAligner,
    align_timestamps_to_transcript
)

Functions

def align_timestamps_to_transcript(
    final_transcript: str,  # The final merged transcript
    timestamp_transcripts: List[str],  # List of transcripts for each timestamp segment
    speech_timestamps: List[Dict],  # List of speech timestamp dictionaries
    verbose: bool = True  # Whether to print alignment details
) -> List[Dict]
    "Align timestamp segments to the final transcript."

Classes

class TranscriptAligner:
    def __init__(self, 
                 correct_transcript: str, # The full, correct transcript text
                 segment_transcripts: List[str], # List of individual segment transcriptions (may have errors)
                 timestamps: List[Dict[str, float]], # List of timestamp dictionaries with 'start' and 'end' keys
                 confidence_threshold: int = 70 # Minimum confidence score to accept an alignment
                )
    "TODO: Add class description"
    
    def __init__(self,
                     correct_transcript: str, # The full, correct transcript text
                     segment_transcripts: List[str], # List of individual segment transcriptions (may have errors)
                     timestamps: List[Dict[str, float]], # List of timestamp dictionaries with 'start' and 'end' keys
                     confidence_threshold: int = 70 # Minimum confidence score to accept an alignment
                    )
        "Initialize the transcript aligner with complete coverage and correction mechanisms."
    
    def align_timestamps_to_correct_transcript(
            self
        ) -> List[Dict]:  # TODO: Add return description
        "Align timestamps to the correct transcript with optional corrections."

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cjm_transcription_utils-0.0.1.tar.gz (19.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

cjm_transcription_utils-0.0.1-py3-none-any.whl (20.8 kB view details)

Uploaded Python 3

File details

Details for the file cjm_transcription_utils-0.0.1.tar.gz.

File metadata

  • Download URL: cjm_transcription_utils-0.0.1.tar.gz
  • Upload date:
  • Size: 19.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.13

File hashes

Hashes for cjm_transcription_utils-0.0.1.tar.gz
Algorithm Hash digest
SHA256 5f4682768e3d5dc0fda84e26970ba365e2fdf4f31f0f4d2a3d855be81dae846c
MD5 763ddcd9441f2d4ee6358f00c9d549e3
BLAKE2b-256 5fcdf9ed007c10602cbc84d735c7d687757695160ff9d5f023f3b15830d865c7

See more details on using hashes here.

File details

Details for the file cjm_transcription_utils-0.0.1-py3-none-any.whl.

File metadata

File hashes

Hashes for cjm_transcription_utils-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 0a82cd948bf1f5c74a384d14b2d9016a6503bf4dfcd9d7010e74e66ed30756aa
MD5 da1cfd69934807025f274d07eb47f5e5
BLAKE2b-256 5da67156ee4eda61cdc88efef057a6caed8545f124ca5113ac4fb474033d107b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page