Skip to main content

A Python package implementing Paraboth with some improvements: https://aclanthology.org/2023.swisstext-1.3.pdf.

Project description

# Paraboth

Implementation of [Paraboth: Evaluating Paraphrastic Similarity in ASR Output](https://aclanthology.org/2023.swisstext-1.3.pdf). Unlike the original approach, this version:
- Handles entire corpora by splitting into sentences and aligning them.
- Uses a Large Language Model API for paraphrasing instead of the [PRISM](https://github.com/thompsonb/prism) model.

Paraboth is especially useful for ASR systems where the transcription is also a translation at the same time, such as transcribing Swiss-German to Standard-German. Instead of comparing 1:1, it compares the paraphrases of the transcriptions to the paraphrases of the ground truth.

## Installation

```bash
pip install paraboth

How Alignment Works

The alignment process leverages a dynamic programming approach, similar to sequence alignment techniques, ensuring that sentences are matched optimally based on their embedding similarities. By computing embeddings for both ground truth and predicted sentences, we construct a similarity matrix that quantifies how closely each predicted sentence corresponds to each ground truth sentence. With these similarity scores, we apply a dynamic programming algorithm to find the sequence of matches (pairs of sentence indices) that yields the highest total similarity score.

Key steps:

  1. Embedding: Each sentence in both ground truth and predictions is turned into an embedding vector using the configured Embedder.
  2. Similarity Matrix: We compute a similarity matrix where each cell represents the similarity between a predicted sentence embedding and a ground truth sentence embedding.
  3. Dynamic Programming Alignment: Using a SentenceAligner (from dtwsa), the algorithm:
    • Iterates through the similarity matrix to find the best possible alignment.
    • Matches sentences only if their similarity score meets a minimum threshold, ensuring low-quality matches are avoided.
    • Employs a dynamic programming approach to find the highest-scoring path through the matrix, resulting in optimal sentence-to-sentence alignments.
  4. Sliding Window (Corpus-Level): For corpora where sentences differ in length or segmentation, a sliding window mechanism is used before alignment to allow flexible matching. This approach mitigates heavy penalties when sentence counts differ, enabling more robust, context-aware alignment.

This combined approach ensures that the final alignment between predicted and ground truth texts is both meaningful and robust, providing an accurate foundation for evaluating paraphrastic similarity.

Usage Examples

Corpus-level comparison

python corpus_example.py \ # Corpus-level comparison.
--gt path/to/gt_corpus.txt \
--pred path/to/pred_corpus.txt \
--n_paraphrases 6 \ # Number of paraphrases to generate for each sentence.
--paraphrase_gt True \ # Whether to paraphrase the ground truth sentences.
--paraphrase_pred True \ # Whether to paraphrase the predicted sentences.
--window_size 3 \ # Size of the sliding window for corpus-level comparison.
--base_output_dir results \ # Directory to save the results.
--min_matching_value 0.5 # Minimum similarity score for an alignment to even happen (see DTWSA documentation).

Sentence-level comparison

python sentence_example.py \ # Sentence-level comparison.
--gt path/to/gt_sentences.txt \
--pred path/to/pred_sentences.txt \
--n_paraphrases 6 \ # Number of paraphrases to generate for each sentence.
--paraphrase_gt True \ # Whether to paraphrase the ground truth sentences.
--paraphrase_pred True \ # Whether to paraphrase the predicted sentences.
--base_output_dir results \ # Directory to save the results.

Creating Custom Embedder and Paraphraser Classes

By default, Embedder and Paraphraser use Azure OpenAI and a given paraphrasing prompt. If you want to integrate your own embeddings or paraphrasing logic, extend the following base classes and implement the required methods.

Base Embedder Example

class BaseEmbedder:
    def embed_chunks(self, text_chunks, batch_size=100):
        raise NotImplementedError("Implement this method.")

    def get_embeddings(self):
        raise NotImplementedError("Implement this method.")

    def save_embeddings(self, file_path):
        raise NotImplementedError("Implement this method.")

    def load_embeddings(self, file_path):
        raise NotImplementedError("Implement this method.")

Base Paraphraser Example

class BaseParaphraser:
    def paraphrase(self, sentence: str, n_sentences: int = 6):
        raise NotImplementedError("Implement this method.")

    def paraphrase_list(self, sentences: list, n_sentences=6, min_words=3):
        raise NotImplementedError("Implement this method.")

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

paraboth-0.1.3.tar.gz (17.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

paraboth-0.1.3-py3-none-any.whl (18.9 kB view details)

Uploaded Python 3

File details

Details for the file paraboth-0.1.3.tar.gz.

File metadata

  • Download URL: paraboth-0.1.3.tar.gz
  • Upload date:
  • Size: 17.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.0.1 CPython/3.11.10

File hashes

Hashes for paraboth-0.1.3.tar.gz
Algorithm Hash digest
SHA256 31be96b9b122dea55422cdc132489f0631133e9b141fc34973deec76602ff479
MD5 86340b05a1e90861d48dd2fed68ecfd8
BLAKE2b-256 1a80812ee6b15cb109a4c5701dd7ada290851302a3b61184e189b8a45d2fe9d2

See more details on using hashes here.

File details

Details for the file paraboth-0.1.3-py3-none-any.whl.

File metadata

  • Download URL: paraboth-0.1.3-py3-none-any.whl
  • Upload date:
  • Size: 18.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.0.1 CPython/3.11.10

File hashes

Hashes for paraboth-0.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 fd07e7e19ad755fae566cdcb28f259dc4750bcf83adaec82ce1e68dee57f6e7d
MD5 f923f46cc8095430f77aa9900fcdbcca
BLAKE2b-256 708e4956d42b8a62bc5790c29215ba586c912dc1aea5d59fa00764d85b3d3830

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page