Constraint-aware audio resynthesis and distillation pipeline.
Project description
CARD Framework
This repository is the current implementation of CARD: Constraint-aware Audio
Resynthesis and Distillation, the project described in
EEE_196_CARD_UCL.md.
The paper is the conceptual and academic baseline. The codebase, however, has
already moved beyond parts of the manuscript's original implementation plan.
This README therefore prioritizes what the repository actually does now.
When the paper and the current code diverge, treat the code, config, and
coder_docs as the source of truth for day-to-day development.
Paper Metadata
Authors
- Rei Dennis Agustin, 2022-03027, BS Electronics Engineering
- Sean Luigi P. Caranzo, 2022-05398, BS Computer Engineering
- Johnbell R. De Leon, 2021-01437, BS Computer Engineering
- Christian Klein C. Ramos, 2022-03126, BS Electronics Engineering
Research Adviser
- Rowel D. Atienza
Affiliation
- University of the Philippines Diliman
- December 2025
Abstract
CARD addresses the long-form podcast consumption bottleneck by generating a shorter conversational audio output that retains speaker identity and prosodic character instead of collapsing everything into plain text. The project combines transcript generation, speaker-aware summarization, voice-cloned resynthesis, and conversational overlap handling so a multi-speaker recording can be compressed toward a user-defined duration without discarding the listening experience that makes the original medium valuable.
High-Level Architecture
flowchart LR
A[Source Audio] --> B[Stage 1<br/>Audio Ingestion]
B --> C[Transcript JSON<br/>Speaker Metadata]
C --> D[Stage 2<br/>Summarizer + Critic Loop]
D --> E[Summary XML<br/>Speaker-Tagged Turns]
E --> F[Stage 3<br/>Voice Clone Resynthesis]
F --> G[Cloned Summary Audio]
G --> H[Stage 4<br/>Interjector / Backchannels]
H --> I[Final Conversational Audio]
C -. Optional evaluation input .-> J[Benchmarks]
E -. Optional evaluation input .-> J
K[Hydra Config + Provider Adapters] -. controls .-> B
K -. controls .-> D
K -. controls .-> F
K -. controls .-> H
What CARD Does
CARD is a multi-stage pipeline for converting long-form multi-speaker audio into a shorter, speaker-aware, resynthesized conversational output.
At a high level, the repository currently supports:
- Stage 1: Audio ingestion and transcript generation
- Source separation
- ASR, diarization, and alignment
- Transcript JSON generation with speaker metadata
- Stage 2: Constraint-aware summarization
- Summarizer and critic agent loop
- Duration-first summary generation with speaker-tagged XML output
- Retrieval-backed or full-transcript summarization paths
- Stage 3: Voice cloning and resynthesis
- Speaker sample generation
- Voice-cloned rendering of summary turns
- Live-draft voice cloning during summarizer edits
- Stage 4: Conversational interjection
- Optional overlap and backchannel synthesis on top of the cloned summary
- Benchmarking and evaluation
- Summarization benchmark workflows
- Source-grounded QA benchmark workflows
- Diarization benchmark workflows
Paper vs. Current Repository
EEE_196_CARD_UCL.md explains the original CARD paper,
problem framing, and proposed module design. The repository now reflects a more
developed engineering system than that initial write-up.
Important differences from the manuscript-level description include:
- The repo is now configuration-driven through Hydra instead of being tied to one fixed experimental path.
- The runtime is now duration-first, centered on
target_secondsand tolerance checks, rather than a simple word-budget-only workflow. - The summary output contract is now speaker-tagged XML, which feeds the downstream voice-clone and interjector stages.
- The default stage-2/stage-3 flow can use live-draft voice cloning, where turn audio is rendered during summary editing instead of only after the final draft is approved.
- The repository includes substantial benchmarking, evaluation, and operator tooling that goes beyond the initial paper narrative.
- Provider support has expanded: the codebase is organized around adapters and config-selected backends rather than a single hardcoded model stack.
In short: the paper explains why CARD exists; this repository captures how CARD currently works.
Repository Layout
src/card_framework/
agents/ A2A executors, DTOs, tool loops, client transport
audio_pipeline/ Audio ingestion, speaker samples, voice cloning, interjector
benchmark/ Summarization, QA, and diarization benchmarks
cli/ Runtime, setup, calibration, matrix, and eval entrypoints
config/ Hydra configuration
orchestration/ Transcript DTOs and stage orchestration
prompts/ Jinja2 prompt templates
providers/ LLM and embedding provider adapters
retrieval/ Transcript indexing and retrieval
runtime/ Runtime planning and execution support
shared/ Shared utilities, events, and logging
_vendor/index_tts/
Other important locations:
artifacts/: generated transcripts, cloned audio, benchmark outputs, and other runtime artifactscheckpoints/: local model/runtime checkpointscoder_docs/: repository-specific architecture, workflow, and maintenance guidance
Common Commands
uv sync --dev
uv run python -m card_framework.cli.main --help
uv run python -m card_framework.cli.setup_and_run --help
uv run python -m card_framework.cli.calibrate --help
uv run python -m card_framework.cli.run_summary_matrix --help
uv run python -m card_framework.benchmark.run --help
uv run python -m card_framework.benchmark.diarization --help
uv run python -m card_framework.benchmark.qa --help
uv run ruff check .
uv run pytest
Common execution entrypoints:
uv run python -m card_framework.cli.setup_and_run --audio-path <path-to-audio>
uv run python -m card_framework.cli.main
uv run python -m card_framework.cli.calibrate
Package Usage
The repository now exposes a library entrypoint for installed-package use:
pip install card-framework
from card_framework import infer
result = infer(
"audio.wav",
"outputs/run_001",
300,
device="cpu",
vllm_url="http://localhost:8000/v1",
)
print(result.summary_xml_path)
print(result.final_audio_path)
infer(audio_wav, output_dir, target_duration_seconds, *, device, ...) runs
the full stage-1 to stage-4 pipeline and returns an InferenceResult with the
main emitted artifact paths. target_duration_seconds is required for every
call and overrides any duration target declared in the loaded config file.
device is also required and must be either cpu or cuda. vllm_url is the
first-class packaged-runtime override for OpenAI-compatible endpoints, and it
forces the shared summarizer, critic, and interjector LLM path onto the
provided vLLM-compatible server for that call. The call writes into output_dir
using this high-level layout:
outputs/run_001/
transcript.json
summary.xml
agent_interactions.log
audio_stage/
voice_clone/
interjector/
Installed-package runtime notes:
- Supported public packaged-runtime platform as of March 9, 2026: Windows only.
macOS and Linux are not yet validated for the public
pip install card-frameworkwhole-pipeline path, andinfer(...)now fails fast on those platforms instead of attempting a partial run. CARD_FRAMEWORK_CONFIG: optional path to a full YAML config file when you need to override the default packaged provider/runtime config forinfer(...).CARD_FRAMEWORK_HOME: optional writable runtime home used for extracted IndexTTS assets, checkpoints, and bootstrap state. If unset, the package uses the platform-appropriate user data directory.CARD_FRAMEWORK_VLLM_URL: optional environment-variable equivalent of thevllm_url=argument.CARD_FRAMEWORK_VLLM_API_KEY: optional environment-variable equivalent of thevllm_api_key=argument. If omitted for vLLM, the packaged runtime usesEMPTY, which matches the common local keyless vLLM setup.- If you choose
device="cuda", the packaged runtime currently supports only CUDA 12.6.infer(...)now validates that the installed PyTorch build reports CUDA 12.6 before it proceeds. - The packaged default is now vLLM-first. If the effective config selects
another provider,
infer(...)resolves required credentials before it starts the subprocess runtime:- interactive terminals:
infer(...)securely prompts for missing API keys or access tokens without echoing them and without placing them on the subprocess command line - non-interactive runs:
infer(...)fails fast with an actionable error that names the missing config field and the supported environment variable
- interactive terminals:
- Supported credential environment variables for the packaged path include
DEEPSEEK_API_KEY,GEMINI_API_KEYorGOOGLE_API_KEY,ZAI_API_KEY,HUGGINGFACE_TOKENorHF_TOKEN, and the configuredaudio.diarization.pyannote.auth_token_envvalue. - Whole-pipeline inference still requires external tools such as
ffmpeg. When voice cloning or calibration paths are active, the package also expectsuvso it can bootstrap the vendored IndexTTS runtime in the writable runtime home on first use.
Public PyPI Release
This repository now includes a GitHub Actions trusted-publishing workflow at
.github/workflows/publish-pypi.yml that publishes tags matching v* to PyPI.
For the first public release of card-framework, use PyPI's pending
publisher flow because the project does not exist on PyPI yet. Configure:
- PyPI project name:
card-framework - GitHub owner:
Lolfaceftw - GitHub repository:
card-framework - Workflow filename:
publish-pypi.yml - Environment name:
pypi
Repository-side release steps:
-
Merge the publishing workflow to
main. -
In GitHub repository settings, create the
pypienvironment. -
In PyPI account settings, add the pending trusted publisher with the fields above.
-
Tag the release from
mainand push it, for example:git tag -a v0.1.0 -m v0.1.0 git push origin v0.1.0
-
After the workflow succeeds, verify the public release:
python -m pip install --no-cache-dir card-framework python -c "from card_framework import infer; print(infer)"
Documentation
EEE_196_CARD_UCL.md: the CARD paper and project manuscriptcoder_docs/codebase_guide.md: current architecture, runtime flow, commands, and maintenance expectationscoder_docs/memory/errors_and_notes.md: repository memory for recurring pitfalls and prior fixescoder_docs/fault_localization_workflow.md: bug triage and failing-test workflow
If you are changing behavior, prompts, workflows, or commands, start with
coder_docs/codebase_guide.md.
License
This repository is source-available under
LICENSE.md, using the PolyForm Noncommercial 1.0.0
license. Noncommercial use is allowed; commercial use requires separate
permission from the licensors.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file card_framework-1.0.1.tar.gz.
File metadata
- Download URL: card_framework-1.0.1.tar.gz
- Upload date:
- Size: 976.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: uv/0.10.0 {"installer":{"name":"uv","version":"0.10.0","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ed08697c694e268040611a9a82283ca7b38a26e4674ee3ffb33cfa052286c794
|
|
| MD5 |
d10f3c7353cb453c2bf1d49c21077eaf
|
|
| BLAKE2b-256 |
7528528e1c3300bc34063d4c847aaeac54ced1d539cd45e89de8b2e7daa682bd
|
File details
Details for the file card_framework-1.0.1-py3-none-any.whl.
File metadata
- Download URL: card_framework-1.0.1-py3-none-any.whl
- Upload date:
- Size: 1.1 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: uv/0.10.0 {"installer":{"name":"uv","version":"0.10.0","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5fe502c8b98a36efb8bdbc842df81869cfc7c9cffe582e6601303ceccb8a4e3b
|
|
| MD5 |
2cc59ee0d8993990e7f4dd3a7c9da98d
|
|
| BLAKE2b-256 |
03ccb6b4d477032c38c37dc94f0cb0ccd1fc0a5ee6212c2db48e3494fb252533
|