Skip to main content

Python forced aligner

Project description

Python forced alignment

PyPI License Downloads

Forced alignment suite. Includes English grapheme-to-phoneme (G2P) and phoneme alignment from the following forced alignment tools.

  • RAD-TTS [1]
  • Montreal Forced Aligner (MFA) [2]
  • Penn Phonetic Forced Aligner (P2FA) [3]

RAD-TTS is used by default. Alignments can be saved to disk or accessed via the pypar.Alignment phoneme alignment representation. See pypar for more details.

pyfoal also includes the following

  • Converting alignments to and from a categorical representation suitable for training machine learning models (pyfoal.convert)
  • Natural interpolation of forced alignments for time-stretching speech (pyfoal.interpolate)

Table of contents

Installation

pip install pyfoal

MFA and P2FA both require additional installation steps found below.

Montreal Forced Aligner (MFA)

conda install -c conda-forge montreal-forced-aligner

Penn Phonetic Forced Aligner (P2FA)

P2FA depends on the Hidden Markov Model Toolkit (HTK), which has been tested on Mac OS and Linux using HTK version 3.4.0. There are known issues in using version 3.4.1 on Linux. HTK is released under a license that prohibits redistribution, so you must install HTK yourself and verify that the commands HCopy and HVite are available as system-wide binaries. After downloading HTK, I use the following for installation on Linux.

sudo apt-get install -y gcc-multilib libx11-dev
sudo chmod +x configure
./configure --disable-hslab
make all
sudo make install

For more help with HTK installation, see notes by Jaekoo Kang and Steve Rubin.

Inference

Force-align text and audio

import pyfoal

# Load text
text = pyfoal.load.text(text_file)

# Load and resample audio
audio = pyfoal.load.audio(audio_file)

# Select an aligner. One of ['mfa', 'p2fa', 'radtts' (default)].
aligner = 'radtts'

# For RAD-TTS, select a model checkpoint
checkpoint = pyfoal.DEFAULT_CHECKPOINT

# Select a GPU to run inference on
gpu = 0

alignment = pyfoal.from_text_and_audio(
    text,
    audio,
    pyfoal.SAMPLE_RATE,
    aligner=aligner,
    checkpoint=checkpoint,
    gpu=gpu)

Application programming interface

pyfoal.from_text_and_audio

"""Phoneme-level forced-alignment

Arguments
    text : string
        The speech transcript
    audio : torch.tensor(shape=(1, samples))
        The speech signal to process
    sample_rate : int
        The audio sampling rate

Returns
    alignment : pypar.Alignment
        The forced alignment
"""

pyfoal.from_file

"""Phoneme alignment from audio and text files

Arguments
    text_file : Path
        The corresponding transcript file
    audio_file : Path
        The audio file to process
    aligner : str
        The alignment method to use
    checkpoint : Path
        The checkpoint to use for neural methods
    gpu : int
        The index of the gpu to perform alignment on for neural methods

Returns
    alignment : Alignment
        The forced alignment
"""

pyfoal.from_file_to_file

"""Perform phoneme alignment from files and save to disk

Arguments
    text_file : Path
        The corresponding transcript file
    audio_file : Path
        The audio file to process
    output_file : Path
        The file to save the alignment
    aligner : str
        The alignment method to use
    checkpoint : Path
        The checkpoint to use for neural methods
    gpu : int
        The index of the gpu to perform alignment on for neural methods
"""

pyfoal.from_files_to_files

"""Perform parallel phoneme alignment from many files and save to disk

Arguments
    text_files : list
        The transcript files
    audio_files : list
        The corresponding speech audio files
    output_files : list
        The files to save the alignments
    aligner : str
        The alignment method to use
    num_workers : int
        Number of CPU cores to utilize. Defaults to all cores.
    checkpoint : Path
        The checkpoint to use for neural methods
    gpu : int
        The index of the gpu to perform alignment on for neural methods
"""

Command-line interface

python -m pyfoal
    [-h]
    --text_files TEXT_FILES [TEXT_FILES ...]
    --audio_files AUDIO_FILES [AUDIO_FILES ...]
    --output_files OUTPUT_FILES [OUTPUT_FILES ...]
    [--aligner ALIGNER]
    [--num_workers NUM_WORKERS]
    [--checkpoint CHECKPOINT]
    [--gpu GPU]

Arguments:
    -h, --help
        show this help message and exit
    --text_files TEXT_FILES [TEXT_FILES ...]
        The speech transcript files
    --audio_files AUDIO_FILES [AUDIO_FILES ...]
        The speech audio files
    --output_files OUTPUT_FILES [OUTPUT_FILES ...]
        The files to save the alignments
    --aligner ALIGNER
        The alignment method to use
    --num_workers NUM_WORKERS
        Number of CPU cores to utilize. Defaults to all cores.
    --checkpoint CHECKPOINT
        The checkpoint to use for neural methods
    --gpu GPU
        The index of the GPU to use for inference. Defaults to CPU.

Training

Download

python -m pyfoal.data.download

Downloads and uncompresses the arctic and libritts datasets used for training.

Preprocess

python -m pyfoal.data.preprocess

Converts each dataset to a common format on disk ready for training.

Partition

python -m pyfoal.partition

Generates train valid, and test partitions for arctic and libritts. Partitioning is deterministic given the same random seed. You do not need to run this step, as the original partitions are saved in pyfoal/assets/partitions.

Train

python -m pyfoal.train --config <config> --gpus <gpus>

Trains a model according to a given configuration on the libritts dataset. Uses a list of GPU indices as an argument, and uses distributed data parallelism (DDP) if more than one index is given. For example, --gpus 0 3 will train using DDP on GPUs 0 and 3.

Monitor

Run tensorboard --logdir runs/. If you are running training remotely, you must create a SSH connection with port forwarding to view Tensorboard. This can be done with ssh -L 6006:localhost:6006 <user>@<server-ip-address>. Then, open localhost:6006 in your browser.

Evaluate

python -m pyfal.evaluate \
    --config <config> \
    --checkpoint <checkpoint> \
    --gpu <gpu>

Evaluate a model. <checkpoint> is the checkpoint file to evaluate and <gpu> is the GPU index.

References

[1] R. Badlani, A. Łańcucki, K. J. Shih, R. Valle, W. Ping, and B. Catanzaro, "One TTS Alignment to Rule Them All," International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022.

[2] J. Yuan and M. Liberman, “Speaker identification on the scotus corpus,” Journal of the Acoustical Society of America, vol. 123, p. 3878, 2008.

[3] M. McAuliffe, M. Socolof, S. Mihuc, M. Wagner, and M. Sonderegger, "Montreal Forced Aligner: Trainable Text-Speech Alignment Using Kaldi," Interspeech, vol. 2017, p. 498-502. 2017.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyfoal-1.0.1.tar.gz (3.2 MB view details)

Uploaded Source

Built Distribution

pyfoal-1.0.1-py3-none-any.whl (3.3 MB view details)

Uploaded Python 3

File details

Details for the file pyfoal-1.0.1.tar.gz.

File metadata

  • Download URL: pyfoal-1.0.1.tar.gz
  • Upload date:
  • Size: 3.2 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.10.14

File hashes

Hashes for pyfoal-1.0.1.tar.gz
Algorithm Hash digest
SHA256 8c67293e7cdd9424aaebdf6193759f5b641bc83d60b6be26e56d58996c2dd722
MD5 821069330cd69b1a862657db6aeab3c6
BLAKE2b-256 2cc919cff9a8b51b078ebc5cc31930764a3c7ff09c7a924d53b4d9389e251735

See more details on using hashes here.

File details

Details for the file pyfoal-1.0.1-py3-none-any.whl.

File metadata

  • Download URL: pyfoal-1.0.1-py3-none-any.whl
  • Upload date:
  • Size: 3.3 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.10.14

File hashes

Hashes for pyfoal-1.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 f03072fc3322c8b935e250e5d64b1fdf283e898e67849890781773a78d3fa598
MD5 a90f51e948e31faa52073581ae72ee0d
BLAKE2b-256 db826e2d86a4cfa5e817b2e5f8231eda3f8cbe97c8b1d20d2480c4d0747034aa

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page