Prosody Modification Network
Project description
Prosody and Pronunciation Modification Network (ProMoNet)
Table of contents
- Installation
- Usage
- Application programming interface (API)
- Command-line interface (CLI)
- Training
- Citation
Installation
pip install promonet
We are working on adding torbi
, our fast Viterbi decoding implementation to PyTorch. Until then, you must manually download and install torbi
. You can track the progress of incorporation into PyTorch here.
Usage
Our included model checkpoint allows speech editing and synthesis for VCTK speakers.
To use promonet
with other speakers, you must first perform speaker
adaptation on a dataset of recordings of the target speaker. You can then use
the resulting model checkpoint to perform speech editing in the target
speaker's voice. All of this can be done using either the API or CLI.
import promonet
###############################################################################
# Speaker adaptation
###############################################################################
# Speaker's name
name = 'max'
# Audio files for adaptation
files = [...]
# GPU index to perform adaptation and editing on
gpu = 0
# Perform speaker adaptation
checkpoint = promonet.adapt.speaker(name, files, gpu=gpu)
###############################################################################
# Speech editing
###############################################################################
# Load speech to edit
audio = promonet.load.audio('test.wav')
# Get features to edit
loudness, pitch, periodicity, ppg = promonet.preprocess.from_audio(
audio,
promonet.SAMPLE_RATE,
gpu)
# We'll use a ratio of 2.0 for all editing examples
ratio = 2.0
# Perform pitch-shifting
shifted = promonet.synthesize.from_features(
*promonet.edit.from_features(
loudness,
pitch,
periodicity,
ppg,
pitch_shift_cents=promonet.convert.ratio_to_cents(ratio)),
checkpoint=checkpoint,
gpu=gpu)
# Perform time-stretching
stretched = promonet.synthesize.from_features(
*promonet.edit.from_features(
loudness,
pitch,
periodicity,
ppg,
time_stretch_ratio=ratio),
checkpoint=checkpoint,
gpu=gpu)
# Perform loudness editing
scaled = promonet.synthesize.from_features(
*promonet.edit.from_features(
loudness,
pitch,
periodicity,
ppg,
loudness_scale_db=promonet.convert.ratio_to_db(ratio)),
checkpoint=checkpoint,
gpu=gpu)
# Edit spectral balance (> 1 for Alvin and the Chipmunks; < 1 for Patrick Star)
alvin = promonet.synthesize.from_features(
loudness,
pitch,
periodicity,
ppg,
spectral_balance_ratio=ratio,
checkpoint=checkpoint,
gpu=gpu)
See the ppgs.edit
submodule documentation for the pronunciation (PPG) editing API.
Application programming interface (API)
Adaptation API
promonet.adapt.speaker
def speaker(
name: str,
files: List[Path],
checkpoint: Path = None,
gpu: Optional[int] = None
) -> Path:
"""Perform speaker adaptation
Args:
name: The name of the speaker
files: The audio files to use for adaptation
checkpoint: The model checkpoint directory
gpu: The gpu to run adaptation on
Returns:
checkpoint: The file containing the trained generator checkpoint
"""
Preprocessing API
promonet.preprocess.from_audio
def from_audio(
audio: torch.Tensor,
sample_rate: int = promonet.SAMPLE_RATE,
gpu: Optional[int] = None,
features: list = ['loudness', 'pitch', 'periodicity', 'ppg']
) -> Union[
Tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor],
Tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor, str]
]:
"""Preprocess audio
Arguments
audio: Audio to preprocess
sample_rate: Audio sample rate
gpu: The GPU index
features: The features to preprocess.
Options: ['loudness', 'pitch', 'periodicity', 'ppg', 'text'].
Returns
loudness: The loudness contour
periodicity: The periodicity contour
pitch: The pitch contour
ppg: The phonetic posteriorgram
text: The text transcript
"""
promonet.preprocess.from_file
def from_file(
file: Union[str, bytes, os.PathLike],
gpu: Optional[int] = None,
features: list = ['loudness', 'pitch', 'periodicity', 'ppg']
) -> Union[
Tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor],
Tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor, str]
]:
"""Preprocess audio on disk
Arguments
file: Audio file to preprocess
gpu: The GPU index
features: The features to preprocess.
Options: ['loudness', 'pitch', 'periodicity', 'ppg', 'text'].
Returns
loudness: The loudness contour
pitch: The pitch contour
periodicity: The periodicity contour
ppg: The phonetic posteriorgram
text: The text transcript
"""
promonet.preprocess.from_file_to_file
def from_file_to_file(
file: Union[str, bytes, os.PathLike],
output_prefix: Optional[Union[str, os.PathLike]] = None,
gpu: Optional[int] = None,
features: list = ['loudness', 'pitch', 'periodicity', 'ppg']
) -> None:
"""Preprocess audio on disk and save
Arguments
file: Audio file to preprocess
output_prefix: File to save features, minus extension
gpu: The GPU index
features: The features to preprocess.
Options: ['loudness', 'pitch', 'periodicity', 'ppg', 'text'].
"""
promonet.preprocess.from_files_to_files
def from_files_to_files(
files: List[Union[str, bytes, os.PathLike]],
output_prefixes: Optional[List[Union[str, os.PathLike]]] = None,
gpu: Optional[int] = None,
features: list = ['loudness', 'pitch', 'periodicity', 'ppg']
) -> None:
"""Preprocess multiple audio files on disk and save
Arguments
files: Audio files to preprocess
output_prefixes: Files to save features, minus extension
gpu: The GPU index
features: The features to preprocess.
Options: ['loudness', 'pitch', 'periodicity', 'ppg', 'text'].
"""
Editing API
promonet.edit.from_features
def from_features(
loudness: torch.Tensor,
pitch: torch.Tensor,
periodicity: torch.Tensor,
ppg: torch.Tensor,
pitch_shift_cents: Optional[float] = None,
time_stretch_ratio: Optional[float] = None,
loudness_scale_db: Optional[float] = None
) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor]:
"""Edit speech representation
Arguments
loudness: Loudness contour to edit
pitch: Pitch contour to edit
periodicity: Periodicity contour to edit
ppg: PPG to edit
pitch_shift_cents: Amount of pitch-shifting in cents
time_stretch_ratio: Amount of time-stretching. Faster when above one.
loudness_scale_db: Loudness ratio editing in dB (not recommended; use loudness)
Returns
edited_loudness, edited_pitch, edited_periodicity, edited_ppg
"""
promonet.edit.from_file
def from_file(
loudness_file: Union[str, bytes, os.PathLike],
pitch_file: Union[str, bytes, os.PathLike],
periodicity_file: Union[str, bytes, os.PathLike],
ppg_file: Union[str, bytes, os.PathLike],
pitch_shift_cents: Optional[float] = None,
time_stretch_ratio: Optional[float] = None,
loudness_scale_db: Optional[float] = None
) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor]:
"""Edit speech representation on disk
Arguments
loudness_file: Loudness file to edit
pitch_file: Pitch file to edit
periodicity_file: Periodicity file to edit
ppg_file: PPG file to edit
pitch_shift_cents: Amount of pitch-shifting in cents
time_stretch_ratio: Amount of time-stretching. Faster when above one.
loudness_scale_db: Loudness ratio editing in dB (not recommended; use loudness)
Returns
edited_loudness, edited_pitch, edited_periodicity, edited_ppg
"""
promonet.edit.from_file_to_file
def from_file_to_file(
loudness_file: Union[str, bytes, os.PathLike],
pitch_file: Union[str, bytes, os.PathLike],
periodicity_file: Union[str, bytes, os.PathLike],
ppg_file: Union[str, bytes, os.PathLike],
output_prefix: Union[str, bytes, os.PathLike],
pitch_shift_cents: Optional[float] = None,
time_stretch_ratio: Optional[float] = None,
loudness_scale_db: Optional[float] = None
) -> None:
"""Edit speech representation on disk and save to disk
Arguments
loudness_file: Loudness file to edit
pitch_file: Pitch file to edit
periodicity_file: Periodicity file to edit
ppg_file: PPG file to edit
output_prefix: File to save output, minus extension
pitch_shift_cents: Amount of pitch-shifting in cents
time_stretch_ratio: Amount of time-stretching. Faster when above one.
loudness_scale_db: Loudness ratio editing in dB (not recommended; use loudness)
"""
promonet.edit.from_files_to_files
def from_files_to_files(
loudness_files: List[Union[str, bytes, os.PathLike]],
pitch_files: List[Union[str, bytes, os.PathLike]],
periodicity_files: List[Union[str, bytes, os.PathLike]],
ppg_files: List[Union[str, bytes, os.PathLike]],
output_prefixes: List[Union[str, bytes, os.PathLike]],
pitch_shift_cents: Optional[float] = None,
time_stretch_ratio: Optional[float] = None,
loudness_scale_db: Optional[float] = None
) -> None:
"""Edit speech representations on disk and save to disk
Arguments
loudness_files: Loudness files to edit
pitch_files: Pitch files to edit
periodicity_files: Periodicity files to edit
ppg_files: Phonetic posteriorgram files to edit
output_prefixes: Files to save output, minus extension
pitch_shift_cents: Amount of pitch-shifting in cents
time_stretch_ratio: Amount of time-stretching. Faster when above one.
loudness_scale_db: Loudness ratio editing in dB (not recommended; use loudness)
"""
Synthesis API
promonet.synthesize.from_features
def from_features(
loudness: torch.Tensor,
pitch: torch.Tensor,
periodicity: torch.Tensor,
ppg: torch.Tensor,
speaker: Union[int, torch.Tensor] = 0,
spectral_balance_ratio: float = 1.,
checkpoint: Optional[Union[str, os.PathLike]] = None,
gpu: Optional[int] = None) -> torch.Tensor:
"""Perform speech synthesis
Args:
loudness: The loudness contour
pitch: The pitch contour
periodicity: The periodicity contour
ppg: The phonetic posteriorgram
speaker: The speaker index
spectral_balance_ratio: > 1 for Alvin and the Chipmunks; < 1 for Patrick Star
checkpoint: The generator checkpoint
gpu: The GPU index
Returns
generated: The generated speech
"""
promonet.synthesize.from_file
def from_file(
loudness_file: Union[str, os.PathLike],
pitch_file: Union[str, os.PathLike],
periodicity_file: Union[str, os.PathLike],
ppg_file: Union[str, os.PathLike],
speaker: Union[int, torch.Tensor] = 0,
checkpoint: Optional[Union[str, os.PathLike]] = None,
gpu: Optional[int] = None
) -> torch.Tensor:
"""Perform speech synthesis from features on disk
Args:
loudness_file: The loudness file
pitch_file: The pitch file
periodicity_file: The periodicity file
ppg_file: The phonetic posteriorgram file
speaker: The speaker index
checkpoint: The generator checkpoint
gpu: The GPU index
Returns
generated: The generated speech
"""
promonet.synthesize.from_file_to_file
def from_file_to_file(
loudness_file: Union[str, os.PathLike],
pitch_file: Union[str, os.PathLike],
periodicity_file: Union[str, os.PathLike],
ppg_file: Union[str, os.PathLike],
output_file: Union[str, os.PathLike],
speaker: Union[int, torch.Tensor] = 0,
checkpoint: Optional[Union[str, os.PathLike]] = None,
gpu: Optional[int] = None
) -> None:
"""Perform speech synthesis from features on disk and save
Args:
loudness_file: The loudness file
pitch_file: The pitch file
periodicity_file: The periodicity file
ppg_file: The phonetic posteriorgram file
output_file: The file to save generated speech audio
speaker: The speaker index
checkpoint: The generator checkpoint
gpu: The GPU index
"""
promonet.synthesize.from_files_to_files
def from_files_to_files(
loudness_files: List[Union[str, os.PathLike]],
pitch_files: List[Union[str, os.PathLike]],
periodicity_files: List[Union[str, os.PathLike]],
ppg_files: List[Union[str, os.PathLike]],
output_files: List[Union[str, os.PathLike]],
speakers: Optional[Union[List[int], torch.Tensor]] = None,
checkpoint: Optional[Union[str, os.PathLike]] = None,
gpu: Optional[int] = None
) -> None:
"""Perform batched speech synthesis from features on disk and save
Args:
loudness_files: The loudness files
pitch_files: The pitch files
periodicity_files: The periodicity files
ppg_files: The phonetic posteriorgram files
output_files: The files to save generated speech audio
speakers: The speaker indices
checkpoint: The generator checkpoint
gpu: The GPU index
"""
Command-line interface (CLI)
Adaptation CLI
promonet.adapt
python -m promonet.adapt \
--name NAME \
--files FILES [FILES ...] \
[--checkpoint CHECKPOINT] \
[--gpu GPU]
Perform speaker adaptation
optional arguments:
-h, --help
show this help message and exit
--name NAME
The name of the speaker
--files FILES [FILES ...]
The audio files to use for adaptation
--checkpoint CHECKPOINT
The model checkpoint directory
--gpu GPU
The gpu to run adaptation on
Preprocessing CLI
promonet.preprocess
python -m promonet.preprocess \
[-h] \
--files FILES [FILES ...] \
[--output_prefixes OUTPUT_PREFIXES [OUTPUT_PREFIXES ...]] \
[--features {loudness,pitch,periodicity,ppg} [{loudness,pitch,periodicity,ppg} ...]] \
[--gpu GPU]
Preprocess
arguments:
--files FILES [FILES ...]
Audio files to preprocess
optional arguments:
-h, --help
show this help message and exit
--output_prefixes OUTPUT_PREFIXES [OUTPUT_PREFIXES ...]
Files to save features, minus extension
--features {loudness,pitch,periodicity,ppg} [{loudness,pitch,periodicity,ppg} ...]
The features to preprocess
--gpu GPU
The index of the gpu to use
Editing CLI
promonet.edit
python -m promonet.edit \
[-h] \
--loudness_files LOUDNESS_FILES [LOUDNESS_FILES ...] \
--pitch_files PITCH_FILES [PITCH_FILES ...] \
--periodicity_files PERIODICITY_FILES [PERIODICITY_FILES ...] \
--ppg_files PPG_FILES [PPG_FILES ...] \
--output_prefixes OUTPUT_PREFIXES [OUTPUT_PREFIXES ...] \
[--pitch_shift_cents PITCH_SHIFT_CENTS] \
[--time_stretch_ratio TIME_STRETCH_RATIO] \
[--loudness_scale_db LOUDNESS_SCALE_DB]
Edit speech representation
arguments:
--loudness_files LOUDNESS_FILES [LOUDNESS_FILES ...]
The loudness files to edit
--pitch_files PITCH_FILES [PITCH_FILES ...]
The pitch files to edit
--periodicity_files PERIODICITY_FILES [PERIODICITY_FILES ...]
The periodicity files to edit
--ppg_files PPG_FILES [PPG_FILES ...]
The ppg files to edit
--output_prefixes OUTPUT_PREFIXES [OUTPUT_PREFIXES ...]
The locations to save output files, minus extension
optional arguments:
-h, --help
show this help message and exit
--pitch_shift_cents PITCH_SHIFT_CENTS
Amount of pitch-shifting in cents
--time_stretch_ratio TIME_STRETCH_RATIO
Amount of time-stretching. Faster when above one.
--loudness_scale_db LOUDNESS_SCALE_DB
Loudness ratio editing in dB (not recommended; use loudness)
Synthesis CLI
promonet.synthesize
python -m promonet.synthesize \
--loudness_files LOUDNESS_FILES [LOUDNESS_FILES ...] \
--pitch_files PITCH_FILES [PITCH_FILES ...] \
--periodicity_files PERIODICITY_FILES [PERIODICITY_FILES ...] \
--ppg_files PPG_FILES [PPG_FILES ...] \
--output_files OUTPUT_FILES [OUTPUT_FILES ...] \
[--speakers SPEAKERS [SPEAKERS ...]] \
[--checkpoint CHECKPOINT] \
[--gpu GPU]
Synthesize speech from features
arguments:
--loudness_files LOUDNESS_FILES [LOUDNESS_FILES ...]
The loudness files
--pitch_files PITCH_FILES [PITCH_FILES ...]
The pitch files
--periodicity_files PERIODICITY_FILES [PERIODICITY_FILES ...]
The periodicity files
--ppg_files PPG_FILES [PPG_FILES ...]
The phonetic posteriorgram files
--output_files OUTPUT_FILES [OUTPUT_FILES ...]
The files to save the edited audio
optional arguments:
-h, --help
show this help message and exit
--speakers SPEAKERS [SPEAKERS ...]
The IDs of the speakers for voice conversion
--checkpoint CHECKPOINT
The generator checkpoint
--gpu GPU
The GPU index
Training
Download
Downloads, unzips, and formats datasets. Stores datasets in data/datasets/
.
Stores formatted datasets in data/cache/
.
python -m promonet.data.download --datasets <datasets>
Preprocess
Prepares features for training. Features are stored in data/cache/
.
python -m promonet.data.preprocess \
--datasets <datasets> \
--features <features> \
--gpu <gpu>
Partition
Partitions a dataset. You should not need to run this, as the partitions
used in our work are provided for each dataset in
promonet/assets/partitions/
.
python -m promonet.partition --datasets <datasets>
Train
Trains a model. Checkpoints and logs are stored in runs/
.
python -m promonet.train \
--config <config> \
--dataset <dataset> \
--gpu <gpu>
If the config file has been previously run, the most recent checkpoint will automatically be loaded and training will resume from that checkpoint.
Monitor
You can monitor training via tensorboard
.
tensorboard --logdir runs/ --port <port> --load_fast true
To use the torchutil
notification system to receive notifications for long
jobs (download, preprocess, train, and evaluate), set the
PYTORCH_NOTIFICATION_URL
environment variable to a supported webhook as
explained in the Apprise documentation.
Evaluate
Performs objective evaluation and generates examples for subjective evaluation.
Also performs benchmarking of generation speed. Results are stored in eval/
.
python -m promonet.evaluate \
--config <name> \
--datasets <datasets> \
--gpu <gpu>
Citation
IEEE
M. Morrison, C. Churchwell, N. Pruyne, and B. Pardo, "Fine-Grained and Interpretable Neural Speech Editing," Interspeech, September 2024.
BibTex
@inproceedings{morrison2024adaptive,
title={Fine-Grained and Interpretable Neural Speech Editing},
author={Morrison, Max and Churchwell, Cameron and Pruyne, Nathan and Pardo, Bryan},
booktitle={Interspeech},
month={September},
year={2024}
}
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file promonet-0.0.1.tar.gz
.
File metadata
- Download URL: promonet-0.0.1.tar.gz
- Upload date:
- Size: 3.0 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.10.13
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7864f980270abbbecb8d3f6192037941f8bf51c141ee5ec74f69d04c51feb594 |
|
MD5 | 5c38e036cb980668e32419f08b248c08 |
|
BLAKE2b-256 | 85836685c97d2b171d3d78585a96962edefb5252bc36b95f5ee0706c69f67bbb |
File details
Details for the file promonet-0.0.1-py3-none-any.whl
.
File metadata
- Download URL: promonet-0.0.1-py3-none-any.whl
- Upload date:
- Size: 2.9 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.10.13
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | fa22dc2ae34c05fe121687e6717b00e56b67e41796761f64f457e219b8413eb7 |
|
MD5 | fbbfeca73fbdf1c4b02178221db42af1 |
|
BLAKE2b-256 | a821c330d06735c9810bfc6cefbb67840f450cecbe346e2312814caa4b9dec83 |