Skip to main content

Compare texts within a pandas DataFrame, highlighting changes and computing similarity ratios

Project description

logo

A library designed to compare texts within a pandas DataFrame, highlighting changes and computing similarity ratios. It utilizes difflib.SequenceMatcher to perform detailed comparisons and generate results that can be easily interpreted and displayed in HTML format.

Features

Medium.com - How to compare texts in Pandas

  • HTML Output: Generate an HTML representation of the comparison results, which highlights additions, deletions, and modifications in the text.
  • Similarity Assessment: Compute a similarity ratio for each pair of compared texts, allowing quick assessment of text changes.
  • Flexible Integration: Designed to work directly with pandas DataFrames, making it easy to integrate into existing data processing pipelines.

Usage

Installation:

pip install pandas-text-comparer

Necessary imports:

from IPython import display
import pandas as pd
from pandas_text_comparer import TextComparer

Nice-to-haves (not required). Read more here and here

from pandarallel import pandarallel  # multi-core processing
from tqdm.auto import tqdm  # progress bar

tqdm.pandas()
pandarallel.initialize(progress_bar=True)

1. Run the comparison

Specify the names of your columns and run the computation.

# A toy GPT-4 generated dataset. Replace with your data
df = pd.read_csv("https://github.com/n-splv/pandas-text-comparer/raw/main/data/demo/review-responses.csv.gz")

comparer = TextComparer(df, column_a="llm_response", column_b="human_response")
comparer.run()

2. Explore the difference

Generate an HTML table. It can be viewed with IPython.display in Jupyter. Alternatively, you can write it to a file and open in any web browser.

html = comparer.get_html()
display.HTML(html)
html-table-example

3. Sort by the severity of edits

Sort the result by ratio - difflib.SequenceMatcher's metric of similarity between two texts, on the scale from 0 to 1. Higher values mean that the texts are more similar.

html = comparer.get_html(sort_by_ratio="desc")  # or "asc"

4. Add columns to the view

Add any columns from the original data to the HTML by simply passing a slice of the DataFrame to get_html method.

columns = ["review_id", "company_name"]
html = comparer.get_html(df[columns])

5. Filter rows to display

When you provide any pandas object with an index (i.e. pd.DataFrame, pd.Series or pd.Indes) as an argument to get_html, it is also used to filter the rows.

filt = df.company_name == "FitFusion"

# Filter rows & add columns
html = comparer.get_html(df[filt])

# Just filter rows
html = comparer.get_html(df[filt].index)

6. Save and load the results

A comparer stores its results in a DataFrame - comparer.result. This data can be persisted and used later on to create a new comparer. This way, you avoid the re-computation:

result_filepath = "data/comparer_result.csv"
comparer.result.to_csv(result_filepath)

# Don't forget to specify the index column
loaded_result = pd.read_csv(result_filepath, index_col=0)
new_comparer = TextComparer.from_result(loaded_result)

Also, if you need to further process your data based on the computed similarities of texts, just grab this column from the result:

df["similarity_ratio"] = comparer.result.ratio

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pandas_text_comparer-0.1.1.tar.gz (5.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pandas_text_comparer-0.1.1-py3-none-any.whl (6.6 kB view details)

Uploaded Python 3

File details

Details for the file pandas_text_comparer-0.1.1.tar.gz.

File metadata

  • Download URL: pandas_text_comparer-0.1.1.tar.gz
  • Upload date:
  • Size: 5.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.2 CPython/3.10.3 Darwin/23.4.0

File hashes

Hashes for pandas_text_comparer-0.1.1.tar.gz
Algorithm Hash digest
SHA256 0461f3941ca9667b5902ea4b76f15848d3215cff46e995f28cf05acbf61e8879
MD5 c9e744fa6a6e494216c49ff4fd88dfd4
BLAKE2b-256 e324e23bc78336411562ccfd3068360ec36baeea5a9ba2d50e13b41ed7e89c42

See more details on using hashes here.

File details

Details for the file pandas_text_comparer-0.1.1-py3-none-any.whl.

File metadata

File hashes

Hashes for pandas_text_comparer-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 e7470d4ca47576ae7ab1933dadd59900cb5db2ed8d6d5081b7417aa4e0b29a33
MD5 f53ff13938faeade933684a14dcb8a1d
BLAKE2b-256 bf20e299240a34d17e9747915062dec1fe1a28965b9ac782fd4e73df62afdc58

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page