Crowdsourced and Automatic Speech Prominence Estimation

These details have not been verified by PyPI

Project links

Homepage

Project description

Crowdsourced and Automatic Speech Prominence Estimation

Annotation, training, evaluation and inference of speech prominence

Paper Website Dataset

Installation
Inference
- Application programming interface
- Command-line interface
Training
- Download
- Annotate
- Preprocess
- Partition
- Train
- Monitor
Evaluation
- Evaluate
- Analyze
Citation

Installation

pip install emphases

By default, we use the Penn Phonetic Forced Aligner (P2FA) via the pyfoal repo to perform word alignments. This requires installing HTK. See the HTK installation instructions provided by pyfoal. Alternatively, you can use a different forced aligner and either pass the alignment as a pypar.Alignment object or save the alignment as a .TextGrid file.

Inference

Perform automatic emphasis annotation using our best pretrained model

import emphases

# Text and audio of speech
text_file = 'example.txt'
audio_file = 'example.wav'

# Detect emphases
alignment, prominence = emphases.from_file(text_file, audio_file)

# Check which words were emphasized
for word, score in zip(alignment, prominence[0]):
    print(f'{word} has a prominence of {score}')

The alignment is a pypar.Alignment object.

Application programming interface

`emphases.from_alignment_and_audio`

def from_alignment_and_audio(
    alignment: pypar.Alignment,
    audio: torch.Tensor,
    sample_rate: int,
    checkpoint: Optional[Union[str, bytes, os.PathLike]] = None,
    batch_size: Optional[int] = None,
    gpu: Optional[int] = None
) -> Tuple[Type[pypar.Alignment], torch.Tensor]:
    """Produce emphasis scores for each word

    Args:
        alignment: The forced phoneme alignment
        audio: The speech waveform
        sample_rate: The audio sampling rate
        checkpoint: The model checkpoint to use for inference
        batch_size: The maximum number of frames per batch
        gpu: The index of the gpu to run inference on

    Returns:
        scores: The float-valued emphasis scores for each word
    """

`emphases.from_text_and_audio`

def from_text_and_audio(
    text: str,
    audio: torch.Tensor,
    sample_rate: int,
    checkpoint: Optional[Union[str, bytes, os.PathLike]] = None,
    batch_size: Optional[int] = None,
    gpu: Optional[int] = None
) -> Tuple[Type[pypar.Alignment], torch.Tensor]:
    """Produce emphasis scores for each word

    Args:
        text: The speech transcript
        audio: The speech waveform
        sample_rate: The audio sampling rate
        checkpoint: The model checkpoint to use for inference
        batch_size: The maximum number of frames per batch
        gpu: The index of the gpu to run inference on

    Returns:
        alignment: The forced phoneme alignment
        scores: The float-valued emphasis scores for each word
    """

`emphases.from_file`

def from_file(
    text_file: Union[str, bytes, os.PathLike],
    audio_file: Union[str, bytes, os.PathLike],
    checkpoint: Optional[Union[str, bytes, os.PathLike]] = None,
    batch_size: Optional[int] = None,
    gpu: Optional[int] = None
) -> Tuple[Type[pypar.Alignment], torch.Tensor]:
    """Produce emphasis scores for each word for files on disk

    Args:
        text_file: The speech transcript (.txt) or alignment (.TextGrid) file
        audio_file: The speech waveform audio file
        checkpoint: The model checkpoint to use for inference
        batch_size: The maximum number of frames per batch
        gpu: The index of the gpu to run inference on

    Returns:
        alignment: The forced phoneme alignment
        scores: The float-valued emphasis scores for each word
    """

`emphases.from_file_to_file`

def from_file_to_file(
    text_file: List[Union[str, bytes, os.PathLike]],
    audio_file: List[Union[str, bytes, os.PathLike]],
    output_prefix: Optional[List[Union[str, bytes, os.PathLike]]] = None,
    checkpoint: Optional[Union[str, bytes, os.PathLike]] = None,
    batch_size: Optional[int] = None,
    gpu: Optional[int] = None
) -> None:
    """Produce emphasis scores for each word for files on disk and save to disk

    Args:
        text_file: The speech transcript (.txt) or alignment (.TextGrid) file
        audio_file: The speech waveform audio file
        output_prefix: The output prefix. Defaults to text file stem.
        checkpoint: The model checkpoint to use for inference
        batch_size: The maximum number of frames per batch
        gpu: The index of the gpu to run inference on
    """

Emphases are saved as a list of five-tuples containing the word, start time, end time, a float-valued emphasis score, and a boolean that is true if the word is emphasized.

`emphases.from_files_to_files`

def from_files_to_files(
    text_files: List[Union[str, bytes, os.PathLike]],
    audio_files: List[Union[str, bytes, os.PathLike]],
    output_prefixes: Optional[List[Union[str, bytes, os.PathLike]]] = None,
    checkpoint: Optional[Union[str, bytes, os.PathLike]] = None,
    batch_size: Optional[int] = None,
    gpu: Optional[int] = None
) -> None:
    """Produce emphasis scores for each word for many files and save to disk

    Args:
        text_file: The speech transcript (.txt) or alignment (.TextGrid) files
        audio_files: The corresponding speech audio files
        output_prefixes: The output files. Defaults to text file stems.
        checkpoint: The model checkpoint to use for inference
        batch_size: The maximum number of frames per batch
        gpu: The index of the gpu to run inference on
    """

Command-line interface

python -m emphases
    [-h]
    --text_files TEXT_FILES [TEXT_FILES ...]
    --audio_files AUDIO_FILES [AUDIO_FILES ...]
    [--output_files OUTPUT_FILES [OUTPUT_FILES ...]]
    [--checkpoint CHECKPOINT]
    [--batch_size BATCH_SIZE]
    [--gpu GPU]

Determine which words in a speech file are emphasized

options:
  -h, --help            show this help message and exit
  --text_files TEXT_FILES [TEXT_FILES ...]
                        The speech transcript text files
  --audio_files AUDIO_FILES [AUDIO_FILES ...]
                        The corresponding speech audio files
  --output_files OUTPUT_FILES [OUTPUT_FILES ...]
                        The output files. Default is text files with json suffix.
  --checkpoint CHECKPOINT
                        The model checkpoint to use for inference
  --batch_size BATCH_SIZE
                        The maximum number of frames per batch
  --gpu GPU             The index of the gpu to run inference on

Training

Download data

python -m emphases.download --datasets <datasets>.

Downloads and uncompresses datasets.

N.B. We omit Buckeye for public release. This evaluation dataset can be made by downloading Buckeye and matching the files to the annotations. The process of matching the files to the annotations was done for us and is tricky to replicate exactly. However, due to licensing restrictions on Buckeye, we cannot legally distribute our private, aligned annotations.

Annotate data

Performing annotation requires first installing Reproducible Subjective Evaluation (ReSEval).

python -m emphases.annotate --datasets <datasets>

Launches a local web application to perform emphasis annotation, according to the ReSEval configuration file emphases/assets/configs/annotate.yaml. Requires ReSEval to be installed.

python -m emphases.annotate --datasets <datasets> --remote --production

Launches a crowdsourced emphasis annotation task, according to the ReSEval configuration file emphases/assets/configs/annotate.yaml. Requires ReSEval to be installed.

Partition data

python -m emphases.partition

Generates train, valid, and test partitions for all datasets. Partitioning is deterministic given the same random seed. You do not need to run this step, as the original partitions are saved in emphases/assets/partitions.

Preprocess

python -m emphases.preprocess

Train

python -m emphases.train --config <config> --dataset <dataset> --gpus <gpus>

Trains a model according to a given configuration. Uses a list of GPU indices as an argument, and uses distributed data parallelism (DDP) if more than one index is given. For example, --gpus 0 3 will train using DDP on GPUs 0 and 3.

Evaluation

Evaluate

python -m emphases.evaluate --config <config> --checkpoint <checkpoint> --gpu <gpu>

Monitor

Run tensorboard --logdir runs/. If you are running training remotely, you must create a SSH connection with port forwarding to view Tensorboard. This can be done with ssh -L 6006:localhost:6006 <user>@<server-ip-address>. Then, open localhost:6006 in your browser.

Citation

IEEE

M. Morrison, P. Pawar, N. Pruyne, J. Cole, and B. Pardo, "Crowdsourced and Automatic Speech Prominence Estimation," International Conference on Acoustics, Speech, & Signal Processing, 2024.

BibTex

@inproceedings{morrison2024crowdsourced,
    title={Crowdsourced and Automatic Speech Prominence Estimation},
    author={Morrison, Max and Pawar, Pranav and Pruyne, Nathan and Cole, Jennifer and Pardo, Bryan},
    booktitle={International Conference on Acoustics, Speech, & Signal Processing},
    year={2024}
}

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

This version

0.0.2

Apr 12, 2024

0.0.1

Dec 21, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

emphases-0.0.2.tar.gz (2.8 MB view details)

Uploaded Apr 12, 2024 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

emphases-0.0.2-py3-none-any.whl (2.8 MB view details)

Uploaded Apr 12, 2024 Python 3

File details

Details for the file emphases-0.0.2.tar.gz.

File metadata

Download URL: emphases-0.0.2.tar.gz
Upload date: Apr 12, 2024
Size: 2.8 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.0.0 CPython/3.10.14

File hashes

Hashes for emphases-0.0.2.tar.gz
Algorithm	Hash digest
SHA256	`3580e87af2a2998b18e24ed3f66a31a201d2be4df4d2a521268d92412319337c`
MD5	`2eff9521d074ee8e1f3dc3482670666c`
BLAKE2b-256	`284259b7099715dedee08703d171c6a3ef0cadaf61e195c59c4d7d06f1ae08f5`

See more details on using hashes here.

File details

Details for the file emphases-0.0.2-py3-none-any.whl.

File metadata

Download URL: emphases-0.0.2-py3-none-any.whl
Upload date: Apr 12, 2024
Size: 2.8 MB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.0.0 CPython/3.10.14

File hashes

Hashes for emphases-0.0.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`044ac54b27bf9d372abff2f3e6b8db331df7a8a1397c94af7b17435cc48dbcdb`
MD5	`d29969f91143e2f52bf0c0622205bc97`
BLAKE2b-256	`166c0e8fad540564b3b348b12c2411e55df5d5d6aaa7202d30d9510d9016a13e`

See more details on using hashes here.

emphases 0.0.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Crowdsourced and Automatic Speech Prominence Estimation

Table of contents

Installation

Inference

Application programming interface

emphases.from_alignment_and_audio

emphases.from_text_and_audio

emphases.from_file

emphases.from_file_to_file

emphases.from_files_to_files

Command-line interface

Training

Download data

Annotate data

Partition data

Preprocess

Train

Evaluation

Evaluate

Monitor

Citation

IEEE

BibTex

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`emphases.from_alignment_and_audio`

`emphases.from_text_and_audio`

`emphases.from_file`

`emphases.from_file_to_file`

`emphases.from_files_to_files`