Skip to main content

A simple tool to compare textual data against validation sets.

Project description

Comparisonframe

Comparison Frame is designed to automate and streamline the process of comparing textual data, particularly focusing on various metrics such as character and word count, punctuation usage, and semantic similarity. It's particularly useful for scenarios where consistent text analysis is required, such as evaluating the performance of natural language processing models, monitoring content quality, or tracking changes in textual data over time using manual evaluation.

from comparisonframe import ComparisonFrame

Usage examples

The examples contain:

  1. creating validation set and saving it to be reused
  2. comparing newly generated data with expected results
  3. recording test statuses
  4. reseting statuses, flushing records and comparison results

1. Creating validation set

1.1 Initialize comparison class

comparer = ComparisonFrame(
    # optionally
    ## provide name of the model from sentence_transformer package
    model_name = "all-mpnet-base-v2",
    ## provide filenames to persist state
    record_file = "record_file.csv",  # file where queries and expected results are stored
    results_file = "comparison_results.csv", # file where comparison results will be stored
    embeddings_file = "embeddings.dill",
    ## provide soup for scraping if was already defined externally
    embedder = None
)
INFO:sentence_transformers.SentenceTransformer:Load pretrained SentenceTransformer: all-mpnet-base-v2
INFO:sentence_transformers.SentenceTransformer:Use pytorch device: cpu

1.2 Recording queries and expected responses (validation set)

comparer.record_query(query = "Black metal",
                      expected_text = "Black metal is an extreme subgenre of heavy metal music.")
comparer.record_query(query = "Tribulation",
                      expected_text = "Tribulation are a Swedish heavy metal band from Arvika that formed in 2005.")
Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches: 100%|██████████| 1/1 [00:00<00:00,  1.76it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00,  8.03it/s]

2. Comparing with expected results

2.1 Initialize new comparison class

comparer = ComparisonFrame()
INFO:sentence_transformers.SentenceTransformer:Load pretrained SentenceTransformer: all-mpnet-base-v2
INFO:sentence_transformers.SentenceTransformer:Use pytorch device: cpu

2.2 Show validation set

untested_queries = comparer.get_all_queries(
    ## optionall
    untested_only=True)
print(untested_queries)
['Black metal', 'Tribulation']
comparer.get_all_records()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
id timestamp query expected_text tested test_status
0 1 2023-11-04 03:28:48 Black metal Black metal is an extreme subgenre of heavy me... no NaN
1 2 2023-11-04 03:28:48 Tribulation Tribulation are a Swedish heavy metal band fro... no NaN

2.3 Compare newly generated with recorded

valid_answer_query_1 = "Black metal is an extreme subgenre of heavy metal music."
very_similar_answer_query_1 = "Black metal is a subgenre of heavy metal music."
unexpected_answer_query_1 = "Black metals are beautiful and are often used in jewelry design."
# with no entry to records
comparer.compare_with_record(query = "Black metal",
                             provided_text = valid_answer_query_1,
                             mark_as_tested=False)
comparer.compare_with_record(query = "Black metal",
                             provided_text = very_similar_answer_query_1,
                             mark_as_tested=False)
comparer.compare_with_record(query = "Black metal",
                             provided_text = unexpected_answer_query_1,
                             mark_as_tested=False)
Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches: 100%|██████████| 1/1 [00:00<00:00,  1.66it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00,  9.88it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 10.24it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 10.22it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 10.25it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 10.38it/s]

2.4 Check comparison results

comparer.get_comparison_results()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
query char_count_diff word_count_diff line_count_diff punctuation_diff semantic_similarity expected_text provided_text id
0 Black metal 0 0 0 0 1.000000 Black metal is an extreme subgenre of heavy me... Black metal is an extreme subgenre of heavy me... 1
1 Black metal 9 1 0 0 0.974236 Black metal is an extreme subgenre of heavy me... Black metal is a subgenre of heavy metal music. 1
2 Black metal 8 1 0 0 0.499244 Black metal is an extreme subgenre of heavy me... Black metals are beautiful and are often used ... 1

3. Record test statuses

comparer.compare_with_record(query = "Black metal",
                             provided_text = very_similar_answer_query_1,
                             mark_as_tested=True)
Batches: 100%|██████████| 1/1 [00:00<00:00,  9.84it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00,  8.83it/s]
comparer.get_all_records()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
id timestamp query expected_text tested test_status
0 1 2023-11-04 03:28:48 Black metal Black metal is an extreme subgenre of heavy me... yes pass
1 2 2023-11-04 03:28:48 Tribulation Tribulation are a Swedish heavy metal band fro... no NaN

4. Reseting and flushing results

4.1 Reselt test statuses

comparer.reset_record_statuses(
    # optionally
    record_ids = [1]
)
comparer.get_all_records()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
id timestamp query expected_text tested test_status
0 1 2023-11-04 03:28:48 Black metal Black metal is an extreme subgenre of heavy me... no NaN
1 2 2023-11-04 03:28:48 Tribulation Tribulation are a Swedish heavy metal band fro... no NaN

4.2 Flush comparison results

comparer.flush_comparison_results()
comparer.get_comparison_results()
ERROR:ComparisonFrame:No results file found. Please perform some comparisons first.

4.3 Flush records

comparer.flush_records()
comparer.get_all_records()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
id timestamp query expected_text tested test_status

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

comparisonframe-0.0.0.tar.gz (11.0 kB view hashes)

Uploaded Source

Built Distribution

comparisonframe-0.0.0-py3-none-any.whl (10.5 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page