Constraint-aware audio resynthesis and distillation pipeline.
Project description
CARD Framework
This repository is the current implementation of CARD: Constraint-aware Audio
Resynthesis and Distillation, the project described in
EEE_196_CARD_UCL.md.
The paper is the conceptual and academic baseline. The codebase, however, has
already moved beyond parts of the manuscript's original implementation plan.
This README therefore prioritizes what the repository actually does now.
When the paper and the current code diverge, treat the code, config, and
coder_docs as the source of truth for day-to-day development.
Paper Metadata
Authors
- Rei Dennis Agustin, 2022-03027, BS Electronics Engineering
- Sean Luigi P. Caranzo, 2022-05398, BS Computer Engineering
- Johnbell R. De Leon, 2021-01437, BS Computer Engineering
- Christian Klein C. Ramos, 2022-03126, BS Electronics Engineering
Research Adviser
- Rowel D. Atienza
Affiliation
- University of the Philippines Diliman
- December 2025
Abstract
CARD addresses the long-form podcast consumption bottleneck by generating a shorter conversational audio output that retains speaker identity and prosodic character instead of collapsing everything into plain text. The project combines transcript generation, speaker-aware summarization, voice-cloned resynthesis, and conversational overlap handling so a multi-speaker recording can be compressed toward a user-defined duration without discarding the listening experience that makes the original medium valuable.
High-Level Architecture
flowchart LR
A[Source Audio] --> B[Stage 1<br/>Audio Ingestion]
B --> C[Transcript JSON<br/>Speaker Metadata]
C --> D[Stage 2<br/>Summarizer + Critic Loop]
D --> E[Summary XML<br/>Speaker-Tagged Turns]
E --> F[Stage 3<br/>Voice Clone Resynthesis]
F --> G[Cloned Summary Audio]
G --> H[Stage 4<br/>Interjector / Backchannels]
H --> I[Final Conversational Audio]
C -. Optional evaluation input .-> J[Benchmarks]
E -. Optional evaluation input .-> J
K[Hydra Config + Provider Adapters] -. controls .-> B
K -. controls .-> D
K -. controls .-> F
K -. controls .-> H
What CARD Does
CARD is a multi-stage pipeline for converting long-form multi-speaker audio into a shorter, speaker-aware, resynthesized conversational output.
At a high level, the repository currently supports:
- Stage 1: Audio ingestion and transcript generation
- Source separation
- ASR, diarization, and alignment
- Transcript JSON generation with speaker metadata
- Stage 2: Constraint-aware summarization
- Summarizer and critic agent loop
- Duration-first summary generation with speaker-tagged XML output
- Retrieval-backed or full-transcript summarization paths
- Stage 3: Voice cloning and resynthesis
- Speaker sample generation
- Voice-cloned rendering of summary turns
- Live-draft voice cloning during summarizer edits
- Stage 4: Conversational interjection
- Optional overlap and backchannel synthesis on top of the cloned summary
- Benchmarking and evaluation
- Summarization benchmark workflows
- Source-grounded QA benchmark workflows
- Diarization benchmark workflows
Paper vs. Current Repository
EEE_196_CARD_UCL.md explains the original CARD paper,
problem framing, and proposed module design. The repository now reflects a more
developed engineering system than that initial write-up.
Important differences from the manuscript-level description include:
- The repo is now configuration-driven through Hydra instead of being tied to one fixed experimental path.
- The runtime is now duration-first, centered on
target_secondsand tolerance checks, rather than a simple word-budget-only workflow. - The summary output contract is now speaker-tagged XML, which feeds the downstream voice-clone and interjector stages.
- The default stage-2/stage-3 flow can use live-draft voice cloning, where turn audio is rendered during summary editing instead of only after the final draft is approved.
- The repository includes substantial benchmarking, evaluation, and operator tooling that goes beyond the initial paper narrative.
- Provider support has expanded: the codebase is organized around adapters and config-selected backends rather than a single hardcoded model stack.
In short: the paper explains why CARD exists; this repository captures how CARD currently works.
Repository Layout
src/card_framework/
agents/ A2A executors, DTOs, tool loops, client transport
audio_pipeline/ Audio ingestion, speaker samples, voice cloning, interjector
benchmark/ Summarization, QA, and diarization benchmarks
cli/ Runtime, setup, calibration, matrix, and eval entrypoints
config/ Hydra configuration
orchestration/ Transcript DTOs and stage orchestration
prompts/ Jinja2 prompt templates
providers/ LLM and embedding provider adapters
retrieval/ Transcript indexing and retrieval
runtime/ Runtime planning and execution support
shared/ Shared utilities, events, and logging
_vendor/index_tts/
Other important locations:
artifacts/: generated transcripts, cloned audio, benchmark outputs, and other runtime artifactscheckpoints/: local model/runtime checkpointscoder_docs/: repository-specific architecture, workflow, and maintenance guidance
Common Commands
uv sync --dev
uv run python -m card_framework.cli.main --help
uv run python -m card_framework.cli.setup_and_run --help
uv run python -m card_framework.cli.calibrate --help
uv run python -m card_framework.cli.run_summary_matrix --help
uv run python -m card_framework.benchmark.run --help
uv run python -m card_framework.benchmark.diarization --help
uv run python -m card_framework.benchmark.qa --help
uv run ruff check .
uv run pytest
Common execution entrypoints:
uv run python -m card_framework.cli.setup_and_run --audio-path <path-to-audio>
uv run python -m card_framework.cli.main
uv run python -m card_framework.cli.calibrate
Package Usage
The repository now exposes a library entrypoint for installed-package use:
pip install card-framework
from card_framework import infer
result = infer(
"audio.wav",
"outputs/run_001",
300,
device="cpu",
vllm_url="http://localhost:8000/v1",
)
print(result.summary_xml_path)
print(result.final_audio_path)
infer(audio_wav, output_dir, target_duration_seconds, *, device, ...) runs
the full stage-1 to stage-4 pipeline and returns an InferenceResult with the
main emitted artifact paths. target_duration_seconds is required for every
call and overrides any duration target declared in the loaded config file.
device is also required and must be either cpu or cuda. vllm_url is the
first-class packaged-runtime override for OpenAI-compatible endpoints, and it
forces the shared summarizer, critic, and interjector LLM path onto the
provided vLLM-compatible server for that call. The call writes into output_dir
using this high-level layout:
outputs/run_001/
transcript.json
summary.xml
agent_interactions.log
audio_stage/
voice_clone/
interjector/
Installed-package runtime notes:
- Supported public packaged-runtime platform as of March 9, 2026: Windows only.
macOS and Linux are not yet validated for the public
pip install card-frameworkwhole-pipeline path, andinfer(...)now fails fast on those platforms instead of attempting a partial run. CARD_FRAMEWORK_CONFIG: optional path to a full YAML config file when you need to override the default packaged provider/runtime config forinfer(...).CARD_FRAMEWORK_HOME: optional writable runtime home used for extracted IndexTTS assets, checkpoints, and bootstrap state. If unset, the package uses the platform-appropriate user data directory.CARD_FRAMEWORK_VLLM_URL: optional environment-variable equivalent of thevllm_url=argument.CARD_FRAMEWORK_VLLM_API_KEY: optional environment-variable equivalent of thevllm_api_key=argument. If omitted for vLLM, the packaged runtime usesEMPTY, which matches the common local keyless vLLM setup.- If you choose
device="cuda", the packaged runtime currently supports only CUDA 12.6.infer(...)now inspects the installed PyTorch build first and, when the host itself reports CUDA 12.6, automatically replaces CPU-only or mismatchedtorchandtorchaudiowheels with the CUDA 12.6 build before it proceeds. In a uv-managed project it usesuv pip; otherwise it falls back topython -m pip. - The packaged default is now vLLM-first. If the effective config selects
another provider,
infer(...)resolves required credentials before it starts the subprocess runtime:- interactive terminals:
infer(...)securely prompts for missing API keys or access tokens without echoing them and without placing them on the subprocess command line - non-interactive runs:
infer(...)fails fast with an actionable error that names the missing config field and the supported environment variable
- interactive terminals:
- Supported credential environment variables for the packaged path include
DEEPSEEK_API_KEY,GEMINI_API_KEYorGOOGLE_API_KEY,ZAI_API_KEY,HUGGINGFACE_TOKENorHF_TOKEN, and the configuredaudio.diarization.pyannote.auth_token_envvalue. CARD_FRAMEWORK_FFMPEG_EXECUTABLE: optional path to a customffmpegbinary. When unset, packagedinfer(...)falls back to the bundledimageio-ffmpegexecutable and prepends its directory toPATHfor nested subprocesses.CARD_FRAMEWORK_UV_EXECUTABLE: optional path to a customuvbinary. When unset, packagedinfer(...)resolves the installeduvconsole script from the active environment before bootstrapping the vendored IndexTTS runtime.- Packaged
infer(...)no longer publishesctc-forced-alignerinRequires-Dist. It first tries to install the pinned upstream source on demand when stage-1 forced alignment needs it. If that bootstrap cannot complete, packaged inference falls back to approximate segment-derived timing instead of failing the whole run.
Public PyPI Release
This repository now includes a GitHub Actions trusted-publishing workflow at
.github/workflows/publish-pypi.yml that publishes tags matching v* to PyPI.
The public PyPI project already exists. As of March 9, 2026:
1.0.1is the first public release, but it published the wrong barectc-forced-alignerdependency name for downstreampipusers.v1.0.2was tagged but never published because PyPI rejected the direct Git dependency metadata.1.0.4is the current public release.- The next install-path fix must ship under a new version such as
1.0.5; do not reuse a failed or already-published version number.
Repository-side release steps:
-
Create a dedicated release-preparation branch such as
release/v1.0.5from the target integration branch, then run the release preflight incoder_docs/github_actions_release_spec.md, including build, targeted tests, and artifact-scopeduv publish --dry-run. -
Merge the reviewed release branch, then tag the merged integration-branch commit and push it, for example:
git tag -a v1.0.5 -m v1.0.5 git push origin v1.0.5
-
Do not assume the release is complete just because the tag push succeeded. Watch the GitHub Actions run to completion and inspect failures directly if needed:
gh run list --workflow "Publish PyPI Package" --limit 1 gh run watch <run-id> --exit-status gh run view <run-id> --log-failed
-
After the workflow succeeds, verify the public release:
python -m pip install --no-cache-dir card-framework python -c "from card_framework import infer; print(infer)"
For the repo-specific release build standards and post-tag verification rules,
see coder_docs/github_actions_release_spec.md.
Documentation
EEE_196_CARD_UCL.md: the CARD paper and project manuscriptcoder_docs/codebase_guide.md: current architecture, runtime flow, commands, and maintenance expectationscoder_docs/memory/errors_and_notes.md: repository memory for recurring pitfalls and prior fixescoder_docs/fault_localization_workflow.md: bug triage and failing-test workflow
If you are changing behavior, prompts, workflows, or commands, start with
coder_docs/codebase_guide.md.
License
This repository is source-available under
LICENSE.md, using the PolyForm Noncommercial 1.0.0
license. Noncommercial use is allowed; commercial use requires separate
permission from the licensors.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file card_framework-1.0.5.tar.gz.
File metadata
- Download URL: card_framework-1.0.5.tar.gz
- Upload date:
- Size: 982.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: uv/0.10.0 {"installer":{"name":"uv","version":"0.10.0","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c953a1f3d462f693ee796219194cfbfd00db5a2d8af6e8ad92a680ccebb81ecd
|
|
| MD5 |
0c93dfcd2de7d42e00ae1806263c22f0
|
|
| BLAKE2b-256 |
4b4379b088cffe0e04f3d2f07a5280b8e924051056f37d738705ba582b90ad22
|
File details
Details for the file card_framework-1.0.5-py3-none-any.whl.
File metadata
- Download URL: card_framework-1.0.5-py3-none-any.whl
- Upload date:
- Size: 1.1 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: uv/0.10.0 {"installer":{"name":"uv","version":"0.10.0","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ca5194ea19656c0559bc1c1cb18e089e03adc44de1e8e2fb3e391c31e6f9c9df
|
|
| MD5 |
8c5d049ccbe4102d14bdd80387b535d7
|
|
| BLAKE2b-256 |
9f1fee07486b9f83657c501df280a86263dcfb6eb05c65d15e989408492e6d80
|