Miscellaneous utilities for helping with audio transcription.
Project description
cjm-transcription-utils
Install
pip install cjm_transcription_utils
Project Structure
nbs/
├── chunking.ipynb # Fill in a module description here
├── formatting.ipynb # Fill in a module description here
├── librosa.ipynb # Fill in a module description here
├── numerizer.ipynb # Fill in a module description here
├── postprocessing.ipynb # Fill in a module description here
├── pydub.ipynb # Fill in a module description here
├── silero_vad.ipynb # Fill in a module description here
└── timestamp_alignment.ipynb # Fill in a module description here
Total: 8 notebooks
Module Dependencies
graph LR
chunking[chunking<br/>chunking]
formatting[formatting<br/>formatting]
librosa[librosa<br/>librosa]
numerizer[numerizer<br/>numerizer]
postprocessing[postprocessing<br/>postprocessing]
pydub[pydub<br/>pydub]
silero_vad[silero_vad<br/>silero vad]
timestamp_alignment[timestamp_alignment<br/>timestamp alignment]
silero_vad --> chunking
1 cross-module dependencies detected
CLI Reference
No CLI commands found in this project.
Module Overview
Detailed documentation for each module in the project:
chunking (chunking.ipynb)
Fill in a module description here
Import
from cjm_transcription_utils.chunking import (
get_extended_timestamp_boundaries,
get_extended_chunk_boundaries,
generate_chunks_with_vad,
generate_intermediate_chunks,
generate_intermediate_chunk_tuples,
merge_transcripts_with_overlaps
)
Functions
def get_extended_timestamp_boundaries(
timestamps: List[Dict[str, float]],
index: int # Index of the current timestamp
) -> Tuple[float, float]
"Get extended boundaries for a timestamp using adjacent timestamps."
def get_extended_chunk_boundaries(
chunks: List[Tuple[float, float]],
index: int # Index of the current chunk
) -> Tuple[float, float]
"Get extended boundaries for a chunk using adjacent chunks."
def generate_chunks_with_vad(
audio_array: np.ndarray, # Audio array
duration: float, # Total duration of audio in seconds
max_chunk_seconds: float = 120, # Maximum chunk duration in seconds
max_chunk_seconds_offset: float = 0, # Offset for chunk duration calculation
speech_timestamps: Optional[List[Dict]] = None, # List of speech timestamp dictionaries with 'start' and 'end' keys
max_silence_threshold: float = 2.0 # Maximum silence duration (in seconds) before creating a new chunk
) -> Tuple[List[Tuple[float, float]], List[List[Dict]]]
"Generate chunks using VAD timestamps with silence-based splitting"
def generate_intermediate_chunks(
chunks: List[Tuple[float, float]],
chunk_timestamps: List[List[Dict]], # TODO: Add description
use_extended_boundaries:bool # TODO: Add description
) -> List[Tuple[float, float]]: # TODO: Add return description
"Generate overlapping chunks between consecutive chunk boundaries"
def generate_intermediate_chunk_tuples(
chunks: List[Tuple[float, float]],
chunk_timestamps: List[List[Dict]], # List of timestamp dictionaries for each chunk
use_extended_boundaries:bool # TODO: Add description
) -> List[Tuple[Dict, Dict]]
"Generate tuples of (last_timestamp, first_timestamp) from consecutive chunks."
def merge_transcripts_with_overlaps(
normal_transcripts: List[str], # List of transcripts for normal chunks
intermediate_transcripts: List[str], # List of transcripts for intermediate chunks
segment_transcripts: List[Tuple[str, str]],
verbose: bool = True # Whether to print debug information
) -> str
"Merge normal and intermediate transcripts with overlap correction"
formatting (formatting.ipynb)
Fill in a module description here
Import
from cjm_transcription_utils.formatting import (
time_interval_to_hms_range
)
Functions
def time_interval_to_hms_range(
duration_tuple # A tuple of (start_seconds, end_seconds) as floats
)
"Convert a time interval tuple (start_seconds, end_seconds) to HMS timestamp range format."
librosa (librosa.ipynb)
Fill in a module description here
Import
from cjm_transcription_utils.librosa import (
load_audio
)
Functions
def load_audio(
audio_path: str, # TODO: Add description
target_sr: int = 16000 # TODO: Add description
) -> Tuple[np.ndarray, int]: # TODO: Add return description
"Load and normalize audio file"
numerizer (numerizer.ipynb)
Fill in a module description here
Import
from cjm_transcription_utils.numerizer import (
original_numerize_numerals,
patched_numerize_numerals,
smart_numerize
)
Functions
def patched_numerize_numerals(
s, # TODO: Add type hint and description
ignore=None, # TODO: Add type hint and description
bias=None # TODO: Add type hint and description
): # TODO: Add type hint
"Patched version that doesn't convert 'a' to '1'"
def smart_numerize(
text # TODO: Add type hint and description
): # TODO: Add type hint
"TODO: Add function description"
postprocessing (postprocessing.ipynb)
Fill in a module description here
Import
from cjm_transcription_utils.postprocessing import (
replace_integers_in_string,
transcription_post_processing
)
Functions
def replace_integers_in_string(
text # TODO: Add type hint and description
): # TODO: Add type hint
"TODO: Add function description"
def transcription_post_processing(
transcript:str # TODO: Add description
)->str: # TODO: Add return description
"TODO: Add function description"
pydub (pydub.ipynb)
Fill in a module description here
Import
from cjm_transcription_utils.pydub import (
get_audio_segment
)
Functions
def get_audio_segment(
audio: AudioSegment, # TODO: Add description
start: float, # TODO: Add description
end: float, # TODO: Add description
offset: float=0 # TODO: Add description
) -> AudioSegment: # TODO: Add return description
"Extract audio segment between start and end times"
silero vad (silero_vad.ipynb)
Fill in a module description here
Import
from cjm_transcription_utils.silero_vad import (
prepare_audio_and_vad
)
Functions
def prepare_audio_and_vad(
audio_path: str, # Path to audio file
max_chunk_seconds: float, # Maximum chunk duration in seconds
max_silence_threshold: float, # Maximum silence duration before creating a new chunk
include_timestamps: bool, # Whether timestamps will be needed
verbose: bool = True # Whether to print progress
)
"Load audio and prepare VAD timestamps if needed."
timestamp alignment (timestamp_alignment.ipynb)
Fill in a module description here
Import
from cjm_transcription_utils.timestamp_alignment import (
TranscriptAligner,
align_timestamps_to_transcript
)
Functions
def align_timestamps_to_transcript(
final_transcript: str, # The final merged transcript
timestamp_transcripts: List[str], # List of transcripts for each timestamp segment
speech_timestamps: List[Dict], # List of speech timestamp dictionaries
verbose: bool = True # Whether to print alignment details
) -> List[Dict]
"Align timestamp segments to the final transcript."
Classes
class TranscriptAligner:
def __init__(self,
correct_transcript: str, # The full, correct transcript text
segment_transcripts: List[str], # List of individual segment transcriptions (may have errors)
timestamps: List[Dict[str, float]], # List of timestamp dictionaries with 'start' and 'end' keys
confidence_threshold: int = 70 # Minimum confidence score to accept an alignment
)
"TODO: Add class description"
def __init__(self,
correct_transcript: str, # The full, correct transcript text
segment_transcripts: List[str], # List of individual segment transcriptions (may have errors)
timestamps: List[Dict[str, float]], # List of timestamp dictionaries with 'start' and 'end' keys
confidence_threshold: int = 70 # Minimum confidence score to accept an alignment
)
"Initialize the transcript aligner with complete coverage and correction mechanisms."
def align_timestamps_to_correct_transcript(
self
) -> List[Dict]: # TODO: Add return description
"Align timestamps to the correct transcript with optional corrections."
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file cjm_transcription_utils-0.0.1.tar.gz.
File metadata
- Download URL: cjm_transcription_utils-0.0.1.tar.gz
- Upload date:
- Size: 19.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5f4682768e3d5dc0fda84e26970ba365e2fdf4f31f0f4d2a3d855be81dae846c
|
|
| MD5 |
763ddcd9441f2d4ee6358f00c9d549e3
|
|
| BLAKE2b-256 |
5fcdf9ed007c10602cbc84d735c7d687757695160ff9d5f023f3b15830d865c7
|
File details
Details for the file cjm_transcription_utils-0.0.1-py3-none-any.whl.
File metadata
- Download URL: cjm_transcription_utils-0.0.1-py3-none-any.whl
- Upload date:
- Size: 20.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0a82cd948bf1f5c74a384d14b2d9016a6503bf4dfcd9d7010e74e66ed30756aa
|
|
| MD5 |
da1cfd69934807025f274d07eb47f5e5
|
|
| BLAKE2b-256 |
5da67156ee4eda61cdc88efef057a6caed8545f124ca5113ac4fb474033d107b
|