A Python package implementing Paraboth with some improvements: https://aclanthology.org/2023.swisstext-1.3.pdf.
Project description
# Paraboth
Implementation of [Paraboth: Evaluating Paraphrastic Similarity in ASR Output](https://aclanthology.org/2023.swisstext-1.3.pdf). Unlike the original approach, this version:
- Handles entire corpora by splitting into sentences and aligning them.
- Uses a Large Language Model API for paraphrasing instead of the [PRISM](https://github.com/thompsonb/prism) model.
Paraboth is especially useful for ASR systems where the transcription is also a translation at the same time, such as transcribing Swiss-German to Standard-German. Instead of comparing 1:1, it compares the paraphrases of the transcriptions to the paraphrases of the ground truth.
## Installation
```bash
pip install paraboth
How Alignment Works
The alignment process leverages a dynamic programming approach, similar to sequence alignment techniques, ensuring that sentences are matched optimally based on their embedding similarities. By computing embeddings for both ground truth and predicted sentences, we construct a similarity matrix that quantifies how closely each predicted sentence corresponds to each ground truth sentence. With these similarity scores, we apply a dynamic programming algorithm to find the sequence of matches (pairs of sentence indices) that yields the highest total similarity score.
Key steps:
- Embedding: Each sentence in both ground truth and predictions is turned into an embedding vector using the configured
Embedder. - Similarity Matrix: We compute a similarity matrix where each cell represents the similarity between a predicted sentence embedding and a ground truth sentence embedding.
- Dynamic Programming Alignment: Using a
SentenceAligner(from dtwsa), the algorithm:- Iterates through the similarity matrix to find the best possible alignment.
- Matches sentences only if their similarity score meets a minimum threshold, ensuring low-quality matches are avoided.
- Employs a dynamic programming approach to find the highest-scoring path through the matrix, resulting in optimal sentence-to-sentence alignments.
- Sliding Window (Corpus-Level): For corpora where sentences differ in length or segmentation, a sliding window mechanism is used before alignment to allow flexible matching. This approach mitigates heavy penalties when sentence counts differ, enabling more robust, context-aware alignment.
This combined approach ensures that the final alignment between predicted and ground truth texts is both meaningful and robust, providing an accurate foundation for evaluating paraphrastic similarity.
Usage Examples
Corpus-level comparison
python corpus_example.py \ # Corpus-level comparison.
--gt path/to/gt_corpus.txt \
--pred path/to/pred_corpus.txt \
--n_paraphrases 6 \ # Number of paraphrases to generate for each sentence.
--paraphrase_gt True \ # Whether to paraphrase the ground truth sentences.
--paraphrase_pred True \ # Whether to paraphrase the predicted sentences.
--window_size 3 \ # Size of the sliding window for corpus-level comparison.
--base_output_dir results \ # Directory to save the results.
--min_matching_value 0.5 # Minimum similarity score for an alignment to even happen (see DTWSA documentation).
Sentence-level comparison
python sentence_example.py \ # Sentence-level comparison.
--gt path/to/gt_sentences.txt \
--pred path/to/pred_sentences.txt \
--n_paraphrases 6 \ # Number of paraphrases to generate for each sentence.
--paraphrase_gt True \ # Whether to paraphrase the ground truth sentences.
--paraphrase_pred True \ # Whether to paraphrase the predicted sentences.
--base_output_dir results \ # Directory to save the results.
Creating Custom Embedder and Paraphraser Classes
By default, Embedder and Paraphraser use Azure OpenAI and a given paraphrasing prompt. If you want to integrate your own embeddings or paraphrasing logic, extend the following base classes and implement the required methods.
Base Embedder Example
class BaseEmbedder:
def embed_chunks(self, text_chunks, batch_size=100):
raise NotImplementedError("Implement this method.")
def get_embeddings(self):
raise NotImplementedError("Implement this method.")
def save_embeddings(self, file_path):
raise NotImplementedError("Implement this method.")
def load_embeddings(self, file_path):
raise NotImplementedError("Implement this method.")
Base Paraphraser Example
class BaseParaphraser:
def paraphrase(self, sentence: str, n_sentences: int = 6):
raise NotImplementedError("Implement this method.")
def paraphrase_list(self, sentences: list, n_sentences=6, min_words=3):
raise NotImplementedError("Implement this method.")
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file paraboth-0.1.3.tar.gz.
File metadata
- Download URL: paraboth-0.1.3.tar.gz
- Upload date:
- Size: 17.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.0.1 CPython/3.11.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
31be96b9b122dea55422cdc132489f0631133e9b141fc34973deec76602ff479
|
|
| MD5 |
86340b05a1e90861d48dd2fed68ecfd8
|
|
| BLAKE2b-256 |
1a80812ee6b15cb109a4c5701dd7ada290851302a3b61184e189b8a45d2fe9d2
|
File details
Details for the file paraboth-0.1.3-py3-none-any.whl.
File metadata
- Download URL: paraboth-0.1.3-py3-none-any.whl
- Upload date:
- Size: 18.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.0.1 CPython/3.11.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
fd07e7e19ad755fae566cdcb28f259dc4750bcf83adaec82ce1e68dee57f6e7d
|
|
| MD5 |
f923f46cc8095430f77aa9900fcdbcca
|
|
| BLAKE2b-256 |
708e4956d42b8a62bc5790c29215ba586c912dc1aea5d59fa00764d85b3d3830
|