Skip to main content

Composable lyric-audio alignment pipeline with staged execution, batch processing, and LRC/ASS export.

Project description

py-roller

py-roller is a local command-line pipeline for generating rolling lyric files from audio and plain-text lyrics.

It can split vocals, filter audio, transcribe local audio with faster-whisper or wav2vec2-style backends, parse lyrics, align lyric lines, and export LRC or ASS karaoke output.

  • Package name: py-roller
  • CLI command: py-roller
  • Python import package: pyroller

Install

Use a fresh virtual environment when possible. First install the lightweight base package from source:

pip install -e .

Then install the validated audio/transcriber runtime:

py-roller install
py-roller doctor

py-roller install installs a pinned Torch/Torchaudio/TorchVision profile first, then installs the bundled audio-core requirements with matching constraints, validates the environment, and runs py-roller doctor unless --skip-doctor is passed.

Install profiles:

  • auto default: try the best validated profile for this machine, then fall back to CPU if validation fails.
  • cpu: force the CPU-only profile.
  • cu124: force the CUDA 12.4 profile.

Useful install commands:

py-roller install --profile cpu
py-roller install --profile cu124
py-roller install --dry-run
py-roller install --no-reset-torch
py-roller doctor --output-format json
py-roller install --progress-format jsonl --output-format json

Machine-readable runtime checks and install progress

doctor can print a machine-readable report for GUI frontends and automated runtime checks:

py-roller doctor --output-format json

The JSON report includes ok, the Python executable and platform, one entry per runtime check, and a suggested_next_step when the environment needs repair. The default remains the terminal checklist:

py-roller doctor --output-format human

install now separates live progress from the final summary:

py-roller install --progress-format human --output-format human   # default
py-roller install --progress-format jsonl --output-format json    # GUI-friendly
py-roller install --progress-format both --output-format json     # debug both streams

--progress-format jsonl emits PYROLLER_EVENT lines for install lifecycle events, selected profiles, subprocess starts/completions, subprocess output, validation, doctor, completion, failures, and heartbeat messages when a pip subprocess is still running without output. --output-format json prints a final install report containing the requested profile, selected profile, step results, validation results, and doctor summary when doctor is run.

Quick start

Set --language explicitly whenever the song language is known. Use zh for Chinese, en for English, and mul only when you need the multilingual fallback.

Raw audio + lyrics -> LRC

py-roller run \
  --stages s,f,t,p,a,w \
  --audio ./song.mp3 \
  --lyrics ./song.txt \
  --language zh \
  --filter-chain noise_gate,dereverb \
  --output-roller ./song.lrc

Prepared vocal track + lyrics -> ASS karaoke

py-roller run \
  --stages t,p,a,w \
  --audio ./vocals.wav \
  --lyrics ./song.txt \
  --language zh \
  --writer-backend ass_karaoke \
  --output-roller ./song.ass

Batch process directories by filename stem

py-roller batch \
  --stages t,p,a,w \
  --audio ./audio_dir \
  --lyrics ./lyrics_dir \
  --language zh \
  --output-roller ./out_dir

Pipeline model

py-roller runs a contiguous chain of stages in this fixed order:

s -> f -> t -> p -> a -> w
splitter -> filter -> transcriber -> parser -> aligner -> writer

Valid examples:

  • s,f,t,p,a,w: full pipeline from raw audio and lyrics.
  • t,p,a,w: start from prepared vocal/filtered audio.
  • a,w: start from existing timed_units and parsed_lyrics artifacts.
  • w: rewrite from an existing alignment_result artifact.

Invalid examples:

  • s,t,w: skips required intermediate stages.
  • s,p,a: skips required intermediate stages.

Legal chain-start inputs:

  • --audio: valid when the chain starts at s, f, or t.
  • --lyrics: required when the chain includes p.
  • --timed-units and --parsed-lyrics: valid when the chain starts at a.
  • --alignment-result: valid when the chain starts at w.

Final outputs are only the explicit --output-* paths:

  • --output-vocal-audio
  • --output-filtered-audio
  • --output-timed-units
  • --output-parsed-lyrics
  • --output-alignment-result
  • --output-roller

Intermediate files under --intermediate are temporary working state unless --cleanup never is used.

Common workflows

Start from raw audio

py-roller run \
  --stages s,f,t,p,a,w \
  --audio ./song.mp3 \
  --lyrics ./song.txt \
  --language zh \
  --output-roller ./song.lrc

Start from already-separated vocals

py-roller run \
  --stages t,p,a,w \
  --audio ./vocals.wav \
  --lyrics ./song.txt \
  --language zh \
  --output-roller ./song.lrc

Start from aligner artifacts

py-roller run \
  --stages a,w \
  --timed-units ./song.timed_units.json \
  --parsed-lyrics ./song.parsed_lyrics.json \
  --output-roller ./song.lrc

For repeated or partially omitted lyrics, choose a repetition mode explicitly:

py-roller run \
  --stages a,w \
  --timed-units ./song.timed_units.json \
  --parsed-lyrics ./song.parsed_lyrics.json \
  --aligner-repetition few \
  --output-roller ./song.lrc

--aligner-repetition accepts:

  • none: default standard global_dp_v1 behavior; best when repeated lyric lines are written out in full.
  • few: uses global DP as a proposal, then repairs sparse repeated or omitted regions between trusted anchors.
  • full: uses per-line candidate generation plus beam search for highly repetitive or anchorless songs.

Rewrite only from an existing alignment result

py-roller run \
  --stages w \
  --alignment-result ./song.alignment.json \
  --writer-backend ass_karaoke \
  --output-roller ./song.ass

Backends and defaults

Backend selection is language-aware. The default language is mul for compatibility, but zh or en gives clearer transcriber/parser defaults when the song language is known.

Transcriber defaults

  • zh -> faster_whisper
  • en -> faster_whisper
  • mul -> faster_whisper

Additional transcriber backends:

  • zh also supports --transcriber-backend mms_phonetic for the Chinese phonetic CTC path.
  • mul also supports --transcriber-backend wav2vec2_phoneme for the multilingual phoneme CTC fallback.

Parser defaults

  • zh -> zh_router_pinyin
  • en -> en_arpabet
  • mul -> mul_ipa

Other defaults

  • aligner backend -> global_dp_v1
  • aligner repetition mode -> none
  • writer backend -> lrc_ms
  • writer spacing -> keep
  • cleanup policy -> on-success
  • transcriber model store -> ~/.cache/py-roller/models/transcriber
  • transcriber device -> auto-detected (CUDA GPU if available, otherwise CPU)
  • transcriber VAD filter -> enabled (skips silence to speed up transcription)

Transcriber models and Hugging Face downloads

Transcriber execution is local. py-roller does not send audio to a cloud transcription API.

Model resolution order:

  1. Resolve --transcriber-model-name, or use the backend default model name.
  2. Look for the model in the py-roller transcriber model store.
  3. If not found and offline mode is not enabled, materialize/download the model into the model store.
  4. Load the resolved local model path for inference.

Useful model options:

  • --transcriber-model-path: local model store root.
  • --transcriber-model-name: model alias, Hugging Face repo id, or explicit local path.
  • --transcriber-local-files-only: refuse network access and use only local files/cache.

For faster_whisper, aliases such as large-v2, large-v3, and turbo resolve to the corresponding Systran/faster-whisper-* snapshots.

Example with a custom model store:

py-roller run \
  --stages t,p,a,w \
  --audio ./vocals.wav \
  --lyrics ./song.txt \
  --language zh \
  --transcriber-model-path ./models/transcriber \
  --output-roller ./song.lrc

Offline run after the model already exists locally:

py-roller run \
  --stages t,p,a,w \
  --audio ./vocals.wav \
  --lyrics ./song.txt \
  --language zh \
  --transcriber-model-path ./models/transcriber \
  --transcriber-local-files-only \
  --output-roller ./song.lrc

Restricted or unstable networks

Hugging Face model downloads can be affected by proxies, timeouts, and XET/CAS behavior. py-roller exposes the common controls directly:

  • --transcriber-hf-xet {auto,on,off}: use off when XET/CAS hangs or fails on your network.
  • --transcriber-hf-proxy URL: use one HTTP or SOCKS proxy for model downloads.
  • --transcriber-hf-etag-timeout SECONDS: metadata/etag timeout.
  • --transcriber-hf-download-timeout SECONDS: large file download timeout.
  • --transcriber-hf-max-workers INT: snapshot download parallelism; lower values such as 1 or 2 are often better for fragile proxies.

Avoid XET/CAS when it is unreliable:

py-roller run \
  --stages t,p,a,w \
  --audio ./vocals.wav \
  --lyrics ./song.txt \
  --language zh \
  --transcriber-hf-xet off \
  --output-roller ./song.lrc

Use a local SOCKS proxy and conservative download settings:

py-roller run \
  --stages t,p,a,w \
  --audio ./vocals.wav \
  --lyrics ./song.txt \
  --language zh \
  --transcriber-hf-proxy socks5://127.0.0.1:7890 \
  --transcriber-hf-download-timeout 120 \
  --transcriber-hf-etag-timeout 30 \
  --transcriber-hf-max-workers 2 \
  --output-roller ./song.lrc

audio-core installs SOCKS support through httpx[socks]. If the environment was installed manually and SOCKS support is missing, run:

py-roller install

or install the missing dependency directly:

pip install "httpx[socks]"

VAD filtering

faster-whisper VAD (Voice Activity Detection) filtering skips silent sections during transcription, reducing processing time by 20–40% for songs with pauses or instrumental breaks. It is enabled by default.

Disable VAD filtering if you need word-level timestamps for every segment, or if the VAD model is cutting audio too aggressively:

py-roller run \
  --stages t,p,a,w \
  --audio ./vocals.wav \
  --lyrics ./song.txt \
  --language zh \
  --no-transcriber-vad-filter \
  --output-roller ./song.lrc

GPU auto-detection

When --transcriber-device is not explicitly set, py-roller automatically checks for an available CUDA GPU. If found, the transcriber defaults to device=cuda with compute_type=float16 for significantly faster inference. This can be overridden with --transcriber-device cpu or --transcriber-compute-type int8.

Model pre-download

Pre-download a transcriber model into the local model store so that later pipeline runs can use --transcriber-local-files-only without touching the network:

py-roller cache-model --language zh
py-roller cache-model --language zh --transcriber-model-name large-v3
py-roller cache-model --language zh --transcriber-hf-xet off --transcriber-hf-proxy socks5://127.0.0.1:7890

Writer behavior

LRC

The default writer is lrc_ms, which writes LRC lines with millisecond precision.

Supported writer backends:

  • lrc_ms: millisecond precision.
  • lrc_cs: centisecond precision.
  • lrc_compressed: millisecond precision, with consecutive identical timestamps compressed.
  • ass_karaoke: ASS dialogue output with karaoke timing tags.

ASS karaoke

ass_karaoke writes ASS dialogue lines with karaoke timing tags.

Current behavior:

  • structural/spacing line output follows --writer-spacing (keep by default).
  • display end time prefers matched unit timing instead of blindly extending to the next line.
  • unmatched lines receive a short visible-duration fallback.

Example:

py-roller run \
  --stages w \
  --alignment-result ./song.alignment.json \
  --writer-backend ass_karaoke \
  --output-roller ./song.ass

Batch mode

batch uses the same stage semantics as run, but applies them to many tasks.

Directory pairing

Directory mode currently supports:

--pair-by stem

Default candidate globs:

  • --audio-glob "*.mp3"
  • --lyrics-glob "*.txt"

Matching is non-recursive.

Example:

py-roller batch \
  --stages t,p,a,w \
  --audio ./audio_dir \
  --lyrics ./lyrics_dir \
  --language zh \
  --output-roller ./out_dir

Batch controls

  • --jobs N: maximum number of parallel workers.
  • --continue-on-error: keep processing remaining tasks after failures.
  • --skip-existing: skip tasks whose declared final outputs already exist.
  • --manifest jobs.yaml: load explicit per-task paths from YAML instead of pairing by stem.

Parallelism guidance:

  • CPU-only: start with --jobs 1 or --jobs 2.
  • Single GPU: usually start with --jobs 1.

YAML manifest format

Manifest mode is useful when filenames do not match cleanly by stem.

The manifest defines per-task input and output paths only. It does not override stage selection, language, backend choice, filter settings, jobs, or other batch-level options.

Supported top-level forms:

tasks:
  - id: song01
    audio: ./audio/song01_master.mp3
    lyrics: ./lyrics/song01_final.txt
    output_roller: ./out/song01.lrc

or:

- id: song01
  audio: ./audio/song01_master.mp3
  lyrics: ./lyrics/song01_final.txt
  output_roller: ./out/song01.lrc

Allowed manifest input keys:

  • audio
  • lyrics
  • timed_units
  • parsed_lyrics
  • alignment_result

Allowed manifest output keys:

  • output_vocal_audio
  • output_filtered_audio
  • output_timed_units
  • output_parsed_lyrics
  • output_alignment_result
  • output_roller

Optional helper key:

  • id

Validation rules:

  • each task must be a mapping.
  • unknown keys are rejected.
  • inputs must match the selected chain start.
  • outputs must be valid final outputs for the selected chain.
  • task ids/stems must be unique.
  • final output paths must not conflict across tasks.
  • relative paths are resolved relative to the manifest file location.

YAML config for CLI defaults

Use --config to load YAML defaults.

Priority order:

built-in defaults < config YAML < explicit CLI arguments

Section model:

  • shared: defaults applied to both run and batch.
  • run: currently no extra keys beyond shared.
  • batch: defaults for batch-only options such as jobs and skip_existing.

Example:

shared:
  language: zh
  writer_spacing: keep
  writer_backend: lrc_ms
  intermediate: ./tmp/py-roller-artifacts
  cleanup: on-success
  transcriber_device: cpu
  transcriber_model_path: ~/.cache/py-roller/models/transcriber
  transcriber_local_files_only: false
  transcriber_vad_filter: true
  transcriber_hf_xet: auto
  transcriber_hf_proxy: null
  transcriber_hf_download_timeout: 120
  transcriber_hf_etag_timeout: 30
  transcriber_hf_max_workers: 2
  splitter_backend: demucs
  splitter_demucs_model: htdemucs
  splitter_demucs_device: cpu
  splitter_demucs_jobs: 0
  splitter_demucs_overlap: 0.25
  splitter_demucs_segment: 8
  filter_chain:
    - noise_gate
    - dereverb

batch:
  jobs: 2
  audio_glob: "*.mp3"
  lyrics_glob: "*.txt"
  timed_units_glob: "*.json"
  parsed_lyrics_glob: "*.json"
  alignment_result_glob: "*.json"
  continue_on_error: true

filter_chain can be written either as a comma-separated string or as a YAML list. Quote on or off for transcriber_hf_xet if your YAML parser treats them as booleans; py-roller also accepts boolean true/false there as on/off for convenience.

Progress, logs, and cleanup

The project exposes progress in two layers:

  • human-readable logs for normal terminal use;
  • optional machine-readable JSONL events for GUI frontends such as lrc-roller.

Use --progress-format to choose the progress output mode:

py-roller run ... --progress-format human   # default, terminal-friendly logs
py-roller run ... --progress-format jsonl   # structured PYROLLER_EVENT lines
py-roller run ... --progress-format both    # logs plus structured events

jsonl emits one parseable event per line with the PYROLLER_EVENT prefix, for example:

PYROLLER_EVENT {"type":"download_progress","stage":"model_download","parent_stage":"preflight","repo_id":"Systran/faster-whisper-large-v2","file":"model.bin","bytes_downloaded":1534203904,"bytes_total":3086912962,"progress":0.497}

This is intended for frontends that need reliable stage and download progress instead of parsing mixed logs from tqdm, Demucs, and huggingface_hub. Human-readable mode remains the default so existing CLI workflows are unchanged.

Structured events use progress as the canonical 0.0 to 1.0 field. A percent compatibility alias is still emitted for early GUI integrations. Standard stages are preflight, model_download, splitter, filter, transcriber, parser, aligner, and writer; model download events also include parent_stage: preflight.

Current progress coverage:

  • run lifecycle events: run_started, run_completed, and run_failed;
  • model preflight and Hugging Face model download events, including cache path, proxy/XET settings, file count, largest file name, bytes downloaded, total bytes when known, and estimated speed;
  • heartbeat events during long model downloads and faster-whisper transcription periods;
  • splitter/Demucs seconds-based progress as structured splitter events;
  • filter phase progress;
  • transcriber phase progress, including faster-whisper segment count, last processed audio time, duration hints, and text previews when available;
  • parser, aligner, and writer stage events;
  • artifact write events and failure events.

In single-task run, human progress is shown as terminal logs/progress bars when supported. In batch, per-task progress is logged to avoid multiple workers fighting for one terminal. GUI frontends should prefer --progress-format jsonl.

Intermediate files live under:

--intermediate/<task-id>/splitter
--intermediate/<task-id>/filter
--intermediate/<task-id>/logs

Default intermediate root:

<system temp>/py-roller-artifacts

Cleanup policy:

  • --cleanup on-success: remove per-task intermediate directories after successful tasks.
  • --cleanup never: keep intermediate audio and logs for inspection.

Troubleshooting

Check the environment

py-roller doctor

For integrations, use JSON output:

py-roller doctor --output-format json

doctor checks Python, Torch, Torchaudio, faster-whisper, CTranslate2, transformers, SOCKS proxy support, Demucs, and librosa.

If it reports a broken audio/transcriber environment, start with:

py-roller install

Hugging Face model download progress

For restricted networks, prefer disabling HF XET/CAS and using remote-DNS SOCKS:

py-roller run ... \
  --transcriber-hf-xet off \
  --transcriber-hf-proxy socks5h://127.0.0.1:9909

If a model has already been materialized into the py-roller model store, use local-only mode to avoid touching the network on later runs:

py-roller run ... \
  --transcriber-model-path ~/.cache/py-roller/models/transcriber \
  --transcriber-local-files-only

The Hugging Face file-count progress shown by huggingface_hub can appear stuck on large model files. Use --progress-format jsonl or both to get byte-level download_progress events with cache growth, speed, and total size when available.

Interruption and child process cleanup

Batch fail-fast actively stops launching further work after the first task failure when --continue-on-error is not set, and worker cleanup includes a Windows-specific process-tree branch.

For older runs or already orphaned processes, Linux/macOS cleanup examples are still useful:

pkill -TERM -f 'python .*pyroller'
pkill -TERM -f 'demucs.separate|demucs'

If anything still survives:

pkill -KILL -f 'python .*pyroller'
pkill -KILL -f 'demucs.separate|demucs'

Inspect candidates first with:

ps -ef | grep -E 'pyroller|demucs'

Dependency policy

py-roller install prefers the newest validated dependency line for this release:

  • Torch/TorchAudio/TorchVision are installed from the official 2.6.0 family for every built-in profile.
  • SOCKS proxy support is installed by default through httpx[socks], so Hugging Face downloads do not fail merely because socksio is missing.

If you upgrade or override audio/transcriber packages manually, run py-roller doctor before using transcription-heavy pipelines.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

py_roller-0.5.7.tar.gz (106.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

py_roller-0.5.7-py3-none-any.whl (133.2 kB view details)

Uploaded Python 3

File details

Details for the file py_roller-0.5.7.tar.gz.

File metadata

  • Download URL: py_roller-0.5.7.tar.gz
  • Upload date:
  • Size: 106.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for py_roller-0.5.7.tar.gz
Algorithm Hash digest
SHA256 e3bbdfea205d0cb87d885181e420dfa3f4e02a08e4f53c5ed5e5e5c5826c10a8
MD5 b6c4447e4f3655b87d5c70db24c363f0
BLAKE2b-256 e1994b2b3993107c03220da44c5f2e4b4025681bf2a8ffefffca7306dd5f9373

See more details on using hashes here.

Provenance

The following attestation bundles were made for py_roller-0.5.7.tar.gz:

Publisher: python-publish.yml on Harmonese/py-roller

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file py_roller-0.5.7-py3-none-any.whl.

File metadata

  • Download URL: py_roller-0.5.7-py3-none-any.whl
  • Upload date:
  • Size: 133.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for py_roller-0.5.7-py3-none-any.whl
Algorithm Hash digest
SHA256 774f2affaeeede29b96e920e1740facebc80319b04ce128005a2ac76e302adb3
MD5 c2075807b24c0ff51d14dfa5e435322c
BLAKE2b-256 94a8348c0583ae70e0057d1f1a3fd1788b08ced571df977aa031778197bee643

See more details on using hashes here.

Provenance

The following attestation bundles were made for py_roller-0.5.7-py3-none-any.whl:

Publisher: python-publish.yml on Harmonese/py-roller

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page