paraboth

A Python package implementing Paraboth with some improvements: https://aclanthology.org/2023.swisstext-1.3.pdf.

These details have not been verified by PyPI

Project links

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Project description

# Paraboth

Implementation of [Paraboth: Evaluating Paraphrastic Similarity in ASR Output](https://aclanthology.org/2023.swisstext-1.3.pdf). Unlike the original approach, this version:
- Handles entire corpora by splitting into sentences and aligning them.
- Uses a Large Language Model API for paraphrasing instead of the [PRISM](https://github.com/thompsonb/prism) model.

Paraboth is especially useful for ASR systems where the transcription is also a translation at the same time, such as transcribing Swiss-German to Standard-German. Instead of comparing 1:1, it compares the paraphrases of the transcriptions to the paraphrases of the ground truth.

## Installation

```bash
pip install paraboth

How Alignment Works

The alignment process leverages a dynamic programming approach, similar to sequence alignment techniques, ensuring that sentences are matched optimally based on their embedding similarities. By computing embeddings for both ground truth and predicted sentences, we construct a similarity matrix that quantifies how closely each predicted sentence corresponds to each ground truth sentence. With these similarity scores, we apply a dynamic programming algorithm to find the sequence of matches (pairs of sentence indices) that yields the highest total similarity score.

Key steps:

Embedding: Each sentence in both ground truth and predictions is turned into an embedding vector using the configured Embedder.
Similarity Matrix: We compute a similarity matrix where each cell represents the similarity between a predicted sentence embedding and a ground truth sentence embedding.
Dynamic Programming Alignment: Using a SentenceAligner (from dtwsa), the algorithm:
- Iterates through the similarity matrix to find the best possible alignment.
- Matches sentences only if their similarity score meets a minimum threshold, ensuring low-quality matches are avoided.
- Employs a dynamic programming approach to find the highest-scoring path through the matrix, resulting in optimal sentence-to-sentence alignments.
Sliding Window (Corpus-Level): For corpora where sentences differ in length or segmentation, a sliding window mechanism is used before alignment to allow flexible matching. This approach mitigates heavy penalties when sentence counts differ, enabling more robust, context-aware alignment.

This combined approach ensures that the final alignment between predicted and ground truth texts is both meaningful and robust, providing an accurate foundation for evaluating paraphrastic similarity.

Usage Examples

Corpus-level comparison

python corpus_example.py \ # Corpus-level comparison.
--gt path/to/gt_corpus.txt \
--pred path/to/pred_corpus.txt \
--n_paraphrases 6 \ # Number of paraphrases to generate for each sentence.
--paraphrase_gt True \ # Whether to paraphrase the ground truth sentences.
--paraphrase_pred True \ # Whether to paraphrase the predicted sentences.
--window_size 3 \ # Size of the sliding window for corpus-level comparison.
--base_output_dir results \ # Directory to save the results.
--min_matching_value 0.5 # Minimum similarity score for an alignment to even happen (see DTWSA documentation).

Sentence-level comparison

python sentence_example.py \ # Sentence-level comparison.
--gt path/to/gt_sentences.txt \
--pred path/to/pred_sentences.txt \
--n_paraphrases 6 \ # Number of paraphrases to generate for each sentence.
--paraphrase_gt True \ # Whether to paraphrase the ground truth sentences.
--paraphrase_pred True \ # Whether to paraphrase the predicted sentences.
--base_output_dir results \ # Directory to save the results.

Creating Custom Embedder and Paraphraser Classes

By default, Embedder and Paraphraser use Azure OpenAI and a given paraphrasing prompt. If you want to integrate your own embeddings or paraphrasing logic, extend the following base classes and implement the required methods.

Base Embedder Example

class BaseEmbedder:
    def embed_chunks(self, text_chunks, batch_size=100):
        raise NotImplementedError("Implement this method.")

    def get_embeddings(self):
        raise NotImplementedError("Implement this method.")

    def save_embeddings(self, file_path):
        raise NotImplementedError("Implement this method.")

    def load_embeddings(self, file_path):
        raise NotImplementedError("Implement this method.")

Base Paraphraser Example

class BaseParaphraser:
    def paraphrase(self, sentence: str, n_sentences: int = 6):
        raise NotImplementedError("Implement this method.")

    def paraphrase_list(self, sentences: list, n_sentences=6, min_words=3):
        raise NotImplementedError("Implement this method.")

Project details

These details have not been verified by PyPI

Project links

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Release history Release notifications | RSS feed

This version

0.1.3

Dec 6, 2024

0.1.2

Dec 6, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

paraboth-0.1.3.tar.gz (17.4 kB view details)

Uploaded Dec 6, 2024 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

paraboth-0.1.3-py3-none-any.whl (18.9 kB view details)

Uploaded Dec 6, 2024 Python 3

File details

Details for the file paraboth-0.1.3.tar.gz.

File metadata

Download URL: paraboth-0.1.3.tar.gz
Upload date: Dec 6, 2024
Size: 17.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.0.1 CPython/3.11.10

File hashes

Hashes for paraboth-0.1.3.tar.gz
Algorithm	Hash digest
SHA256	`31be96b9b122dea55422cdc132489f0631133e9b141fc34973deec76602ff479`
MD5	`86340b05a1e90861d48dd2fed68ecfd8`
BLAKE2b-256	`1a80812ee6b15cb109a4c5701dd7ada290851302a3b61184e189b8a45d2fe9d2`

See more details on using hashes here.

File details

Details for the file paraboth-0.1.3-py3-none-any.whl.

File metadata

Download URL: paraboth-0.1.3-py3-none-any.whl
Upload date: Dec 6, 2024
Size: 18.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.0.1 CPython/3.11.10

File hashes

Hashes for paraboth-0.1.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`fd07e7e19ad755fae566cdcb28f259dc4750bcf83adaec82ce1e68dee57f6e7d`
MD5	`f923f46cc8095430f77aa9900fcdbcca`
BLAKE2b-256	`708e4956d42b8a62bc5790c29215ba586c912dc1aea5d59fa00764d85b3d3830`

See more details on using hashes here.

paraboth 0.1.3

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

How Alignment Works

Usage Examples

Corpus-level comparison

Sentence-level comparison

Creating Custom Embedder and Paraphraser Classes

Base Embedder Example

Base Paraphraser Example

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes