Skip to main content

Constraint-aware audio resynthesis and distillation pipeline.

Project description

CARD Framework

This repository is the current implementation of CARD: Constraint-aware Audio Resynthesis and Distillation, the project described in EEE_196_CARD_UCL.md.

The paper is the conceptual and academic baseline. The codebase, however, has already moved beyond parts of the manuscript's original implementation plan. This README therefore prioritizes what the repository actually does now. When the paper and the current code diverge, treat the code, config, and coder_docs as the source of truth for day-to-day development.

Paper Metadata

Authors

  • Rei Dennis Agustin, 2022-03027, BS Electronics Engineering
  • Sean Luigi P. Caranzo, 2022-05398, BS Computer Engineering
  • Johnbell R. De Leon, 2021-01437, BS Computer Engineering
  • Christian Klein C. Ramos, 2022-03126, BS Electronics Engineering

Research Adviser

  • Rowel D. Atienza

Affiliation

  • University of the Philippines Diliman
  • December 2025

Abstract

CARD addresses the long-form podcast consumption bottleneck by generating a shorter conversational audio output that retains speaker identity and prosodic character instead of collapsing everything into plain text. The project combines transcript generation, speaker-aware summarization, voice-cloned resynthesis, and conversational overlap handling so a multi-speaker recording can be compressed toward a user-defined duration without discarding the listening experience that makes the original medium valuable.

High-Level Architecture

flowchart LR
    A[Source Audio] --> B[Stage 1<br/>Audio Ingestion]
    B --> C[Transcript JSON<br/>Speaker Metadata]
    C --> D[Stage 2<br/>Summarizer + Critic Loop]
    D --> E[Summary XML<br/>Speaker-Tagged Turns]
    E --> F[Stage 3<br/>Voice Clone Resynthesis]
    F --> G[Cloned Summary Audio]
    G --> H[Stage 4<br/>Interjector / Backchannels]
    H --> I[Final Conversational Audio]

    C -. Optional evaluation input .-> J[Benchmarks]
    E -. Optional evaluation input .-> J

    K[Hydra Config + Provider Adapters] -. controls .-> B
    K -. controls .-> D
    K -. controls .-> F
    K -. controls .-> H

What CARD Does

CARD is a multi-stage pipeline for converting long-form multi-speaker audio into a shorter, speaker-aware, resynthesized conversational output.

At a high level, the repository currently supports:

  • Stage 1: Audio ingestion and transcript generation
    • Source separation
    • ASR, diarization, and alignment
    • Transcript JSON generation with speaker metadata
  • Stage 2: Constraint-aware summarization
    • Summarizer and critic agent loop
    • Duration-first summary generation with speaker-tagged XML output
    • Retrieval-backed or full-transcript summarization paths
  • Stage 3: Voice cloning and resynthesis
    • Speaker sample generation
    • Voice-cloned rendering of summary turns
    • Live-draft voice cloning during summarizer edits
  • Stage 4: Conversational interjection
    • Optional overlap and backchannel synthesis on top of the cloned summary
  • Benchmarking and evaluation
    • Summarization benchmark workflows
    • Source-grounded QA benchmark workflows
    • Diarization benchmark workflows

Paper vs. Current Repository

EEE_196_CARD_UCL.md explains the original CARD paper, problem framing, and proposed module design. The repository now reflects a more developed engineering system than that initial write-up.

Important differences from the manuscript-level description include:

  • The repo is now configuration-driven through Hydra instead of being tied to one fixed experimental path.
  • The runtime is now duration-first, centered on target_seconds and tolerance checks, rather than a simple word-budget-only workflow.
  • The summary output contract is now speaker-tagged XML, which feeds the downstream voice-clone and interjector stages.
  • The default stage-2/stage-3 flow can use live-draft voice cloning, where turn audio is rendered during summary editing instead of only after the final draft is approved.
  • The repository includes substantial benchmarking, evaluation, and operator tooling that goes beyond the initial paper narrative.
  • Provider support has expanded: the codebase is organized around adapters and config-selected backends rather than a single hardcoded model stack.

In short: the paper explains why CARD exists; this repository captures how CARD currently works.

Repository Layout

src/card_framework/
  agents/           A2A executors, DTOs, tool loops, client transport
  audio_pipeline/   Audio ingestion, speaker samples, voice cloning, interjector
  benchmark/        Summarization, QA, and diarization benchmarks
  cli/              Runtime, setup, calibration, matrix, and eval entrypoints
  config/           Hydra configuration
  orchestration/    Transcript DTOs and stage orchestration
  prompts/          Jinja2 prompt templates
  providers/        LLM and embedding provider adapters
  retrieval/        Transcript indexing and retrieval
  runtime/          Runtime planning and execution support
  shared/           Shared utilities, events, and logging
  _vendor/index_tts/

Other important locations:

  • artifacts/: generated transcripts, cloned audio, benchmark outputs, and other runtime artifacts
  • checkpoints/: local model/runtime checkpoints
  • coder_docs/: repository-specific architecture, workflow, and maintenance guidance

Common Commands

uv sync --dev
uv run python -m card_framework.cli.main --help
uv run python -m card_framework.cli.setup_and_run --help
uv run python -m card_framework.cli.calibrate --help
uv run python -m card_framework.cli.run_summary_matrix --help
uv run python -m card_framework.benchmark.run --help
uv run python -m card_framework.benchmark.diarization --help
uv run python -m card_framework.benchmark.qa --help
uv run ruff check .
uv run pytest

Common execution entrypoints:

uv run python -m card_framework.cli.setup_and_run --audio-path <path-to-audio>
uv run python -m card_framework.cli.main
uv run python -m card_framework.cli.calibrate

Package Usage

The repository now exposes a library entrypoint for installed-package use:

pip install card-framework
from card_framework import infer

result = infer(
    "audio.wav",
    "outputs/run_001",
    300,
    device="cpu",
    vllm_url="http://localhost:8000/v1",
)
print(result.summary_xml_path)
print(result.final_audio_path)

infer(audio_wav, output_dir, target_duration_seconds, *, device, ...) runs the full stage-1 to stage-4 pipeline and returns an InferenceResult with the main emitted artifact paths. target_duration_seconds is required for every call and overrides any duration target declared in the loaded config file. device is also required and must be either cpu or cuda. vllm_url is the first-class packaged-runtime override for OpenAI-compatible endpoints, and it forces the shared summarizer, critic, and interjector LLM path onto the provided vLLM-compatible server for that call. The call writes into output_dir using this high-level layout:

outputs/run_001/
  transcript.json
  summary.xml
  agent_interactions.log
  audio_stage/
    voice_clone/
    interjector/

Installed-package runtime notes:

  • Supported public packaged-runtime platform as of March 9, 2026: Windows only. macOS and Linux are not yet validated for the public pip install card-framework whole-pipeline path, and infer(...) now fails fast on those platforms instead of attempting a partial run.
  • CARD_FRAMEWORK_CONFIG: optional path to a full YAML config file when you need to override the default packaged provider/runtime config for infer(...).
  • CARD_FRAMEWORK_HOME: optional writable runtime home used for extracted IndexTTS assets, checkpoints, and bootstrap state. If unset, the package uses the platform-appropriate user data directory.
  • CARD_FRAMEWORK_VLLM_URL: optional environment-variable equivalent of the vllm_url= argument.
  • CARD_FRAMEWORK_VLLM_API_KEY: optional environment-variable equivalent of the vllm_api_key= argument. If omitted for vLLM, the packaged runtime uses EMPTY, which matches the common local keyless vLLM setup.
  • If you choose device="cuda", the packaged runtime currently supports only CUDA 12.6. infer(...) now validates that the installed PyTorch build reports CUDA 12.6 before it proceeds.
  • The packaged default is now vLLM-first. If the effective config selects another provider, infer(...) resolves required credentials before it starts the subprocess runtime:
    • interactive terminals: infer(...) securely prompts for missing API keys or access tokens without echoing them and without placing them on the subprocess command line
    • non-interactive runs: infer(...) fails fast with an actionable error that names the missing config field and the supported environment variable
  • Supported credential environment variables for the packaged path include DEEPSEEK_API_KEY, GEMINI_API_KEY or GOOGLE_API_KEY, ZAI_API_KEY, HUGGINGFACE_TOKEN or HF_TOKEN, and the configured audio.diarization.pyannote.auth_token_env value.
  • CARD_FRAMEWORK_FFMPEG_EXECUTABLE: optional path to a custom ffmpeg binary. When unset, packaged infer(...) falls back to the bundled imageio-ffmpeg executable and prepends its directory to PATH for nested subprocesses.
  • CARD_FRAMEWORK_UV_EXECUTABLE: optional path to a custom uv binary. When unset, packaged infer(...) resolves the installed uv console script from the active environment before bootstrapping the vendored IndexTTS runtime.
  • Packaged infer(...) no longer publishes ctc-forced-aligner in Requires-Dist. It first tries to install the pinned upstream source on demand when stage-1 forced alignment needs it. If that bootstrap cannot complete, packaged inference falls back to approximate segment-derived timing instead of failing the whole run.

Public PyPI Release

This repository now includes a GitHub Actions trusted-publishing workflow at .github/workflows/publish-pypi.yml that publishes tags matching v* to PyPI.

The public PyPI project already exists. As of March 9, 2026:

  • 1.0.1 is the first public release, but it published the wrong bare ctc-forced-aligner dependency name for downstream pip users.
  • v1.0.2 was tagged but never published because PyPI rejected the direct Git dependency metadata.
  • 1.0.3 is the current public recovery release.
  • The next install-path fix must ship under a new version such as 1.0.4; do not reuse a failed or already-published version number.

Repository-side release steps:

  1. Run the release preflight in coder_docs/github_actions_release_spec.md, including build, targeted tests, and artifact-scoped uv publish --dry-run.

  2. Tag the release from main and push it, for example:

    git tag -a v1.0.4 -m v1.0.4
    git push origin v1.0.4
    
  3. Do not assume the release is complete just because the tag push succeeded. Watch the GitHub Actions run to completion and inspect failures directly if needed:

    gh run list --workflow "Publish PyPI Package" --limit 1
    gh run watch <run-id> --exit-status
    gh run view <run-id> --log-failed
    
  4. After the workflow succeeds, verify the public release:

    python -m pip install --no-cache-dir card-framework
    python -c "from card_framework import infer; print(infer)"
    

For the repo-specific release build standards and post-tag verification rules, see coder_docs/github_actions_release_spec.md.

Documentation

If you are changing behavior, prompts, workflows, or commands, start with coder_docs/codebase_guide.md.

License

This repository is source-available under LICENSE.md, using the PolyForm Noncommercial 1.0.0 license. Noncommercial use is allowed; commercial use requires separate permission from the licensors.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

card_framework-1.0.4.tar.gz (979.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

card_framework-1.0.4-py3-none-any.whl (1.1 MB view details)

Uploaded Python 3

File details

Details for the file card_framework-1.0.4.tar.gz.

File metadata

  • Download URL: card_framework-1.0.4.tar.gz
  • Upload date:
  • Size: 979.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.10.0 {"installer":{"name":"uv","version":"0.10.0","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for card_framework-1.0.4.tar.gz
Algorithm Hash digest
SHA256 6a19bfca663e2ba4c4b032612b3c532c19198aa6eb79c6a3e523508eca444d49
MD5 d8c7834d21e8d53e96a71c1cbe3de38b
BLAKE2b-256 c6f3745bf1523e5dbfba5a8e49f86356a3eb903da63d6699f5e91a3e1c7e6d17

See more details on using hashes here.

File details

Details for the file card_framework-1.0.4-py3-none-any.whl.

File metadata

  • Download URL: card_framework-1.0.4-py3-none-any.whl
  • Upload date:
  • Size: 1.1 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.10.0 {"installer":{"name":"uv","version":"0.10.0","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for card_framework-1.0.4-py3-none-any.whl
Algorithm Hash digest
SHA256 a7b53eaf24b4061363c5964b3a8af29eb98e405e6cbe26ee7d6ad26f34fc1b8b
MD5 777b1bd40323f73b8b55b1c393ac6e72
BLAKE2b-256 fc9fac38c2c8baa4e4d9ceb23c15d1d70a6b9c2b01238b2416335501669628df

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page