Skip to main content

Audio denoising, VAD, speaker diarization, and transcription pipeline using Demucs, Silero VAD, pyannote, and Whisper

Project description

dinnote audio transcription

Processes audio through a four-step pipeline to produce a transcription JSON with per-speaker diarization: denoising (Demucs), voice activity detection (Silero VAD), speaker diarization (pyannote), and transcription (Whisper).

Installation

pip install dinnote

This installs CPU-only torch by default. For GPU acceleration, install the CUDA build first:

pip install torch torchaudio --index-url https://download.pytorch.org/whl/cu124
pip install dinnote

dinnote uses CUDA automatically when available and warns at startup if it is not.

On first run, dinnote copies default config files to your platform config directory:

  • Windows: %APPDATA%\dinnote\
  • macOS: ~/Library/Application Support/dinnote/
  • Linux: ~/.config/dinnote/

Edit config.yaml and vocab.txt to customize settings.

Speaker diarization requires a HuggingFace token with access to pyannote/speaker-diarization-3.1. Set it via diarize.hf_token in config.yaml.

CLI usage

dinnote input/audio.mp3            # single file
dinnote input/                     # all audio files in a folder
dinnote input/audio.mp3 -f         # force re-run all steps
dinnote input/audio.mp3 -c path/to/config.yaml   # custom config
dinnote input/audio.mp3 -o results/              # custom output dir

Each step checks whether its output already exists and skips it if so. Use -f to force all steps to re-run.

Output is written to output/<filename>/ and contains:

  • <filename>_denoised.wav (vocals isolated from background noise)
  • <filename>_vad.json (detected speech segment boundaries)
  • <filename>_diarization.json (per-speaker turn boundaries from pyannote)
  • <filename>_transcription.json (final transcription with timestamps and speaker labels)

Python API

from pathlib import Path
import dinnote
from dinnote import PipelineConfig, VadConfig, DiarizeConfig, TranscribeConfig

# Run the full pipeline with defaults
dinnote.process_file(
    input_path=Path("recording.wav"),
    output_dir=Path("output"),
)

# Custom config
config = PipelineConfig(
    vad=VadConfig(threshold=0.4, max_segment_length_sec=20),
    diarize=DiarizeConfig(num_speakers=2),
    transcribe=TranscribeConfig(model="small", language="en"),
)
dinnote.process_file(Path("recording.wav"), Path("output"), config=config)

# Or use individual stages
from dinnote import denoise, vad, diarize, transcribe

denoised      = denoise.run(Path("recording.wav"), Path("output/recording"), config={})
vad_file      = vad.run(denoised, Path("output/recording"), config={})
diarization   = diarize.run(denoised, Path("output/recording"), config={})
result        = transcribe.run(denoised, Path("output/recording"), config={}, diarization_path=diarization)

Configuration

denoise:
  model: htdemucs        # htdemucs | htdemucs_ft | mdx | mdx_extra | htdemucs_6s

vad:
  threshold: 0.5         # 0.0–1.0, higher = requires clearer speech
  min_speech_duration_ms: 250
  min_silence_duration_ms: 100
  padding_ms: 500
  max_segment_length_sec: 30
  merge_within_sec: 1.0

diarize:
  # hf_token: hf_...
  num_speakers: null     # fix speaker count or leave null to let pyannote estimate
  min_speakers: null
  max_speakers: null
  min_turn_ms: 200       # turns shorter than this are discarded (ms)

transcribe:
  model: base            # tiny | base | small | medium | large
  language: en           # set to null to auto-detect
  temperature: null      # null = Whisper fallback sequence, 0 = greedy
  no_speech_threshold: 0.6
  logprob_threshold: -1.0
  compression_ratio_threshold: 2.4
  condition_on_previous_text: false
  vocab_file: null       # path to domain-specific vocabulary, defaults to vocab.txt in config dir

Add domain-specific vocabulary to vocab.txt to improve transcription accuracy on unusual words and jargon. For noisy or technical audio, set temperature: 0 to disable Whisper's fallback to higher-temperature decoding, and consider filtering out common hallucinations specific to your dataset.

If num_speakers is known in advance, setting it gives more reliable diarization. Otherwise use min_speakers/max_speakers to constrain the range, or leave both null to let pyannote estimate freely.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dinnote-0.1.1.tar.gz (16.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dinnote-0.1.1-py3-none-any.whl (18.6 kB view details)

Uploaded Python 3

File details

Details for the file dinnote-0.1.1.tar.gz.

File metadata

  • Download URL: dinnote-0.1.1.tar.gz
  • Upload date:
  • Size: 16.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for dinnote-0.1.1.tar.gz
Algorithm Hash digest
SHA256 b68bef6825ca8022b70b39c26422d8f95a3ce557f5e03540b3ac746be1857225
MD5 521f0b9bc2d62fc847b895fc7209f3c2
BLAKE2b-256 8aa53150f7f168d0e510e33f16cfc336264629330b3e06a6b94e34ea9c60dbee

See more details on using hashes here.

File details

Details for the file dinnote-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: dinnote-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 18.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for dinnote-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 0094fd17a38fdeca221575b5322bf0be5ebba9a6ad4d73c315edb7ebd2568f2e
MD5 1da132d30698b0f4d140aa908563c959
BLAKE2b-256 f1fb83d6a763a84a6abdb0e0dbd271bc69d36d4c557a9d867c89e1742e2388ab

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page