Miscellaneous utilities for helping with audio transcription.
Project description
cjm-transcription-utils
Install
pip install cjm_transcription_utils
Project Structure
nbs/
├── chunking.ipynb # Utilities for splitting audio into chunks using VAD timestamps and merging transcripts with overlap correction.
├── formatting.ipynb # Utilities for formatting time intervals into human-readable timestamp ranges.
├── librosa.ipynb # Audio loading and normalization utilities using librosa.
├── numerizer.ipynb # Text number conversion utilities with patched numerizer to preserve articles like 'a'.
├── postprocessing.ipynb # Transcript post-processing utilities for converting numbers to words and normalizing text.
├── pydub.ipynb # Audio segment extraction utilities using pydub.
├── silero_vad.ipynb # Voice Activity Detection utilities using the Silero VAD model.
└── timestamp_alignment.ipynb # Utilities for aligning VAD timestamps to corrected transcripts using fuzzy matching.
Total: 8 notebooks
Module Dependencies
graph LR
chunking[chunking<br/>chunking]
formatting[formatting<br/>formatting]
librosa[librosa<br/>librosa]
numerizer[numerizer<br/>numerizer]
postprocessing[postprocessing<br/>postprocessing]
pydub[pydub<br/>pydub]
silero_vad[silero_vad<br/>silero vad]
timestamp_alignment[timestamp_alignment<br/>timestamp alignment]
silero_vad --> chunking
silero_vad --> librosa
2 cross-module dependencies detected
CLI Reference
No CLI commands found in this project.
Module Overview
Detailed documentation for each module in the project:
chunking (chunking.ipynb)
Utilities for splitting audio into chunks using VAD timestamps and merging transcripts with overlap correction.
Import
from cjm_transcription_utils.chunking import (
get_extended_timestamp_boundaries,
get_extended_chunk_boundaries,
generate_chunks_with_vad,
generate_intermediate_chunks,
generate_intermediate_chunk_tuples,
merge_transcripts_with_overlaps
)
Functions
def get_extended_timestamp_boundaries(
timestamps: List[Dict[str, float]],
index: int # Index of the current timestamp
) -> Tuple[float, float]
"Get extended boundaries for a timestamp using adjacent timestamps."
def get_extended_chunk_boundaries(
chunks: List[Tuple[float, float]],
index: int # Index of the current chunk
) -> Tuple[float, float]
"Get extended boundaries for a chunk using adjacent chunks."
def generate_chunks_with_vad(
audio_array: np.ndarray, # Audio array
duration: float, # Total duration of audio in seconds
max_chunk_seconds: float = 120, # Maximum chunk duration in seconds
max_chunk_seconds_offset: float = 0, # Offset for chunk duration calculation
speech_timestamps: Optional[List[Dict]] = None, # List of speech timestamp dictionaries with 'start' and 'end' keys
max_silence_threshold: float = 2.0 # Maximum silence duration (in seconds) before creating a new chunk
) -> Tuple[List[Tuple[float, float]], List[List[Dict]]]
"Generate chunks using VAD timestamps with silence-based splitting"
def generate_intermediate_chunks(
chunks: List[Tuple[float, float]],
chunk_timestamps: List[List[Dict]], # List of timestamp dictionaries for each chunk
use_extended_boundaries: bool # Whether to use extended boundaries from adjacent timestamps
) -> List[Tuple[float, float]]: # List of tuples representing time intervals for intermediate chunks
"Generate overlapping chunks between consecutive chunk boundaries"
def generate_intermediate_chunk_tuples(
chunks: List[Tuple[float, float]],
chunk_timestamps: List[List[Dict]], # List of timestamp dictionaries for each chunk
use_extended_boundaries: bool # Whether to use extended boundaries from adjacent timestamps
) -> List[Tuple[Dict, Dict]]
"Generate tuples of (last_timestamp, first_timestamp) from consecutive chunks."
def merge_transcripts_with_overlaps(
normal_transcripts: List[str], # List of transcripts for normal chunks
intermediate_transcripts: List[str], # List of transcripts for intermediate chunks
segment_transcripts: List[Tuple[str, str]],
verbose: bool = True # Whether to print debug information
) -> str
"Merge normal and intermediate transcripts with overlap correction"
formatting (formatting.ipynb)
Utilities for formatting time intervals into human-readable timestamp ranges.
Import
from cjm_transcription_utils.formatting import (
time_interval_to_hms_range
)
Functions
def time_interval_to_hms_range(
duration_tuple: tuple[float, float] # A tuple of (start_seconds, end_seconds) as floats
) -> str: # Formatted timestamp range string in [HH:MM:SS.ss]-[HH:MM:SS.ss] format
"Convert a time interval tuple (start_seconds, end_seconds) to HMS timestamp range format."
librosa (librosa.ipynb)
Audio loading and normalization utilities using librosa.
Import
from cjm_transcription_utils.librosa import (
load_audio
)
Functions
def load_audio(
audio_path: str, # Path to the audio file to load
target_sr: int = 16000 # Target sample rate for resampling
) -> Tuple[np.ndarray, int]: # Tuple of (normalized audio array, sample rate)
"Load and normalize audio file"
numerizer (numerizer.ipynb)
Text number conversion utilities with patched numerizer to preserve articles like ‘a’.
Import
from cjm_transcription_utils.numerizer import (
original_numerize_numerals,
patched_numerize_numerals,
smart_numerize
)
Functions
def patched_numerize_numerals(
s: str, # String to convert written numbers to digits
ignore: list = None, # List of words to ignore during conversion
bias: str = None # Conversion bias (e.g., 'ordinal')
) -> str: # String with written numbers converted to digits
"Patched version that doesn't convert 'a' to '1'"
def smart_numerize(
text: str # Text containing written numbers to convert
) -> str: # Text with written numbers converted to digits
"Convert written numbers to digits with special handling for compound ordinals."
postprocessing (postprocessing.ipynb)
Transcript post-processing utilities for converting numbers to words and normalizing text.
Import
from cjm_transcription_utils.postprocessing import (
replace_integers_in_string,
transcription_post_processing
)
Functions
def replace_integers_in_string(
text: str # Text containing integers to convert to words
) -> str: # Text with integers converted to their word representation
"Replace integer numbers with their word equivalents while preserving special formats."
def transcription_post_processing(
transcript: str # Raw transcript text to process
) -> str: # Processed transcript with integers converted to words and dashes normalized
"Apply post-processing transformations to transcript text."
pydub (pydub.ipynb)
Audio segment extraction utilities using pydub.
Import
from cjm_transcription_utils.pydub import (
get_audio_segment
)
Functions
def get_audio_segment(
audio: AudioSegment, # Source audio segment to extract from
start: float, # Start time in seconds
end: float, # End time in seconds
offset: float = 0 # Offset in milliseconds to expand segment boundaries
) -> AudioSegment: # Extracted audio segment
"Extract audio segment between start and end times"
silero vad (silero_vad.ipynb)
Voice Activity Detection utilities using the Silero VAD model.
Import
from cjm_transcription_utils.silero_vad import (
prepare_audio_and_vad
)
Functions
def prepare_audio_and_vad(
audio_path: str, # Path to audio file
max_chunk_seconds: float, # Maximum chunk duration in seconds
max_silence_threshold: float, # Maximum silence duration before creating a new chunk
include_timestamps: bool, # Whether timestamps will be needed
verbose: bool = True # Whether to print progress
) -> tuple[np.ndarray, int, float, list, list, list]: # Tuple of (audio array, sample rate, duration, speech timestamps, chunks, chunk timestamps)
"Load audio and prepare VAD timestamps if needed."
timestamp alignment (timestamp_alignment.ipynb)
Utilities for aligning VAD timestamps to corrected transcripts using fuzzy matching.
Import
from cjm_transcription_utils.timestamp_alignment import (
TranscriptAligner,
align_timestamps_to_transcript
)
Functions
def align_timestamps_to_transcript(
final_transcript: str, # The final merged transcript
timestamp_transcripts: List[str], # List of transcripts for each timestamp segment
speech_timestamps: List[Dict], # List of speech timestamp dictionaries
verbose: bool = True # Whether to print alignment details
) -> List[Dict]
"Align timestamp segments to the final transcript."
Classes
class TranscriptAligner:
def __init__(self,
correct_transcript: str, # The full, correct transcript text
segment_transcripts: List[str], # List of individual segment transcriptions (may have errors)
timestamps: List[Dict[str, float]], # List of timestamp dictionaries with 'start' and 'end' keys
confidence_threshold: int = 70 # Minimum confidence score to accept an alignment
)
"Aligns VAD timestamps to a corrected transcript using fuzzy matching."
def __init__(self,
correct_transcript: str, # The full, correct transcript text
segment_transcripts: List[str], # List of individual segment transcriptions (may have errors)
timestamps: List[Dict[str, float]], # List of timestamp dictionaries with 'start' and 'end' keys
confidence_threshold: int = 70 # Minimum confidence score to accept an alignment
)
"Initialize the transcript aligner with complete coverage and correction mechanisms."
def align_timestamps_to_correct_transcript(
self
) -> List[Dict]: # List of alignment dictionaries with timestamp, text, and confidence info
"Align timestamps to the correct transcript with optional corrections."
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file cjm_transcription_utils-0.0.3.tar.gz.
File metadata
- Download URL: cjm_transcription_utils-0.0.3.tar.gz
- Upload date:
- Size: 21.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
fa4fdaefa0483b5f822768fa27537b163933a5474acb7870b39cf883e327f7b4
|
|
| MD5 |
33f1c3bdf3133621fc9aa157edebf24a
|
|
| BLAKE2b-256 |
ca6839c3306481cece5850c08cea393d0d42e06886cffeb4a481858a83c565ed
|
File details
Details for the file cjm_transcription_utils-0.0.3-py3-none-any.whl.
File metadata
- Download URL: cjm_transcription_utils-0.0.3-py3-none-any.whl
- Upload date:
- Size: 22.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9f82996f676902a1931d85c2c359e2ad60d3d6bac08eb419cf0da43700304bd3
|
|
| MD5 |
e28ede5326f5cb65b5d0aa89217147e8
|
|
| BLAKE2b-256 |
efd6499ed3470e8142581115c4abd74e6e51710249ea769b12dbaf94555b0433
|