Compare texts within a pandas DataFrame, highlighting changes and computing similarity ratios
Project description
A library designed to compare texts within a pandas DataFrame, highlighting changes and computing similarity ratios. It utilizes difflib.SequenceMatcher to perform detailed comparisons and generate results that can be easily interpreted and displayed in HTML format.
Features
Medium.com - How to compare texts in Pandas
- HTML Output: Generate an HTML representation of the comparison results, which highlights
additions
,deletions
, andmodifications
in the text. - Similarity Assessment: Compute a similarity ratio for each pair of compared texts, allowing quick assessment of text changes.
- Flexible Integration: Designed to work directly with pandas DataFrames, making it easy to integrate into existing data processing pipelines.
Usage
Installation:
pip install pandas-text-comparer
Necessary imports:
from IPython import display
import pandas as pd
from pandas_text_comparer import TextComparer
Nice-to-haves (not required). Read more here and here
from pandarallel import pandarallel # multi-core processing
from tqdm.auto import tqdm # progress bar
tqdm.pandas()
pandarallel.initialize(progress_bar=True)
1. Run the comparison
Specify the names of your columns and run the computation.
# A toy GPT-4 generated dataset. Replace with your data
df = pd.read_csv("https://github.com/n-splv/pandas-text-comparer/raw/main/data/demo/review-responses.csv.gz")
comparer = TextComparer(df, column_a="llm_response", column_b="human_response")
comparer.run()
2. Explore the difference
Generate an HTML table. It can be viewed with IPython.display
in Jupyter.
Alternatively, you can write it to a file and open in any web browser.
html = comparer.get_html()
display.HTML(html)
3. Sort by the severity of edits
Sort the result by ratio
- difflib.SequenceMatcher's metric of similarity between two texts, on the scale from 0 to 1. Higher values mean
that the texts are more similar.
html = comparer.get_html(sort_by_ratio="desc") # or "asc"
4. Add columns to the view
Add any columns from the original data to the HTML by simply passing a slice of
the DataFrame to get_html
method.
columns = ["review_id", "company_name"]
html = comparer.get_html(df[columns])
5. Filter rows to display
When you provide any pandas object with an index (i.e. pd.DataFrame, pd.Series or pd.Indes) as an argument to get_html
,
it is also used to filter the rows.
filt = df.company_name == "FitFusion"
# Filter rows & add columns
html = comparer.get_html(df[filt])
# Just filter rows
html = comparer.get_html(df[filt].index)
6. Save and load the results
A comparer stores its results in a DataFrame - comparer.result
. This data can be persisted and used later on to create
a new comparer. This way, you avoid the re-computation:
result_filepath = "data/comparer_result.csv"
comparer.result.to_csv(result_filepath)
# Don't forget to specify the index column
loaded_result = pd.read_csv(result_filepath, index_col=0)
new_comparer = TextComparer.from_result(loaded_result)
Also, if you need to further process your data based on the computed similarities of texts, just grab this column from the result:
df["similarity_ratio"] = comparer.result.ratio
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for pandas_text_comparer-0.1.1.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | 0461f3941ca9667b5902ea4b76f15848d3215cff46e995f28cf05acbf61e8879 |
|
MD5 | c9e744fa6a6e494216c49ff4fd88dfd4 |
|
BLAKE2b-256 | e324e23bc78336411562ccfd3068360ec36baeea5a9ba2d50e13b41ed7e89c42 |
Hashes for pandas_text_comparer-0.1.1-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | e7470d4ca47576ae7ab1933dadd59900cb5db2ed8d6d5081b7417aa4e0b29a33 |
|
MD5 | f53ff13938faeade933684a14dcb8a1d |
|
BLAKE2b-256 | bf20e299240a34d17e9747915062dec1fe1a28965b9ac782fd4e73df62afdc58 |