Skip to main content

Fast Semantic Text Deduplication

Project description

SemHash: Fast Semantic Text Deduplication

QuickstartMain FeaturesUsage Benchmarks

SemHash is a technique to efficiently deduplicate datasets based on semantic similarity. It uses a combination of lightning-fast embeddings through model2vec and ANN-based similarity search through vicinity.

Quickstart

Install the package with:

pip install semhash

Deduplicate a single dataset with the following code:

from model2vec import StaticModel
from semhash import SemHash

# Load an embedding model
model = StaticModel.from_pretrained("minishlab/potion-base-8M")

# Initialize a SemHash with the model
semhash = SemHash(model=model)

# Create some texts to deduplicate
texts = [
        "It's dangerous to go alone!",
        "It's dangerous to go alone!",  # Exact duplicate
        "It's not safe to go by yourself!",  # Semantically similar
]

# Deduplicate the texts
deduplicated_texts = semhash.fit_deduplicate(records=texts, threshold=0.5)

For more advanced usage, you can also deduplicate across multiple datasets, or deduplicate multi-column datasets. Examples are provided in the usage section.

Main Features

  • Fast: SemHash uses model2vec to embed texts and vicinity to perform similarity search, making it extremely fast.
  • Scalable: SemHash can deduplicate large datasets with millions of records thanks to the ANN backends in Vicinity.
  • Flexible: SemHash can be used to deduplicate a single dataset or across two datasets, and can also be used to deduplicate multi-column datasets (such as QA datasets).
  • Lightweight: SemHash is a lightweight package with minimal dependencies, making it easy to install and use.

Usage

Deduplicate a single dataset

The following code snippet shows how to deduplicate a single dataset using SemHash:

from model2vec import StaticModel
from semhash import SemHash

# Load an embedding model
model = StaticModel.from_pretrained("minishlab/potion-base-8M")

# Initialize a SemHash with the model
semhash = SemHash(model=model)

# Create some texts to deduplicate
texts = [
        "It's dangerous to go alone!",
        "It's dangerous to go alone!",  # Exact duplicate
        "It's not safe to go by yourself!",  # Semantically similar
]

# Deduplicate the texts
deduplicated_texts = semhash.fit_deduplicate(records=texts, threshold=0.5)
Deduplicate across two datasets

The following code snippet shows how to deduplicate across two datasets using SemHash (in this example, a training and test dataset):

from model2vec import StaticModel
from semhash import SemHash

# Load an embedding model
model = StaticModel.from_pretrained("minishlab/potion-base-8M")

# Initialize a SemHash with the model
semhash = SemHash(model=model)

# Create some texts to deduplicate
train = [
    "It's dangerous to go alone!",
    "It's a secret to everybody.",
    "Ganondorf has invaded Hyrule!",
]
test = [
    "It's dangerous to go alone!",  # Exact duplicate
    "It's not safe to go by yourself!",  # Semantically similar
    "The master sword seals the darkness",
]

# Fit on the training data
semhash.fit(records=train)
# Deduplicate the test data against the training data
deduplicated_texts = semhash.deduplicate(records=test, threshold=0.5)
Deduplicate multi-column datasets

The following code snippet shows how to deduplicate multi-column datasets using SemHash (in this example, a QA dataset with questions, contexts, and answers):

from model2vec import StaticModel
from semhash import SemHash

# Load an embedding model
model = StaticModel.from_pretrained("minishlab/potion-base-8M")

# Initialize a SemHash with the model and columns to deduplicate
semhash = SemHash(model=model, columns=["question", "context", "answer"])

# Create some texts to deduplicate
records = [
    {"question": "What is the hero's name?", "context": "The hero is Link", "answer": "Link"},
    {"question": "What is the hero's name?", "context": "The hero is Link", "answer": "Link"},  # Exact duplicate
    {
        "question": "Who is the protagonist?",
        "context": "In this story, Link is the hero",
        "answer": "Link",
    },  # Semantically similar
    {"question": "Who is the princess?", "context": "The princess is Zelda", "answer": "Zelda"},
]

# Deduplicate the records
deduplicated_records = semhash.fit_deduplicate(records=records, threshold=0.5)

Benchmarks

We've benchmarked SemHash on a variety of datasets to measure the deduplication performance and speed. The benchmarks were run with the following setup:

  • The benchmarks were all run on CPU
  • The benchmarks were all run with ann=True
  • The timings include the encoding time, index building time, and deduplication time.

Train Deduplication Benchmark

Dataset Original Train Size Deduplicated Train Size % Removed Deduplication Time (s)
bbc 1225 1144 6.61 0.23
senteval_cr 3012 2990 0.73 0.13
tweet_sentiment_extraction 27481 26695 2.86 1.64
emotion 16000 15696 1.90 0.66
amazon_counterfactual 5000 4992 0.16 0.31
ag_news 120000 106921 10.90 4.21
enron_spam 31716 20539 35.24 1.57
subj 8000 7990 0.12 0.57
sst5 8544 8526 0.21 0.55
20_newgroups 11314 10685 5.56 0.69
hatespeech_offensive 22783 22090 3.04 0.84
ade 17637 15718 10.88 0.68
imdb 25000 24832 0.67 1.65
massive_scenario 11514 9366 18.66 0.43
student 117519 63865 45.66 4.18
squad_v2 130319 115546 11.34 11.20
wikitext 1801350 884586 50.89 56.11

Train/Test Deduplication Benchmark

Dataset Train Size Test Size Deduplicated Test Size % Removed Deduplication Time (s)
bbc 1225 1000 875 12.50 0.39
senteval_cr 3012 753 750 0.40 0.11
tweet_sentiment_extraction 27481 3534 3411 3.48 0.86
emotion 16000 2000 1926 3.70 0.56
amazon_counterfactual 5000 5000 4990 0.20 0.48
ag_news 120000 7600 6197 18.46 3.38
enron_spam 31716 2000 1063 46.85 1.98
subj 8000 2000 1999 0.05 0.58
sst5 8544 2210 2205 0.23 0.56
20_newgroups 11314 7532 7310 2.95 2.26
hatespeech_offensive 22783 2000 1926 3.70 0.72
ade 17637 5879 4954 15.73 0.82
imdb 25000 25000 24805 0.78 2.65
massive_scenario 11514 2974 2188 26.43 0.43
student 117519 5000 2395 52.10 3.02
squad_v2 130319 11873 11869 0.03 9.11
wikitext 1801350 4358 3610 17.16 36.10

As can be seen, SemHash is extremely fast, and scales to large datasets with millions of records. There are some notable examples of train/test leakage, such as enron_spam and student, where the test dataset contains a significant amount of semantic overlap with the training dataset.

Reproducing the Benchmarks

To run the benchmarks yourself, you can use the following command:

python -m benchmarks.run_benchmarks

Optionally, the datasets can be updated in the datasets.py file.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

semhash-0.1.0.tar.gz (1.2 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

semhash-0.1.0-py3-none-any.whl (8.3 kB view details)

Uploaded Python 3

File details

Details for the file semhash-0.1.0.tar.gz.

File metadata

  • Download URL: semhash-0.1.0.tar.gz
  • Upload date:
  • Size: 1.2 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.0.1 CPython/3.10.14

File hashes

Hashes for semhash-0.1.0.tar.gz
Algorithm Hash digest
SHA256 3112eb1eee7244a351036d694efb070307f77ef4a74338cb9e307664dbae1827
MD5 e54747e245e6b78ba59f0380fc6595f2
BLAKE2b-256 b9d2930b55f3d675a9168557c057ac0788201ba52db382d2020c038e1f4b25f7

See more details on using hashes here.

File details

Details for the file semhash-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: semhash-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 8.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.0.1 CPython/3.10.14

File hashes

Hashes for semhash-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 bbc930f2f59efaabebc711ebe10a398f2e9e5d5f50edfd3cf70e5b0c0e6a6b99
MD5 20dba14eb4f7daeb27eb5bf26537d944
BLAKE2b-256 010612212209fea4d2a586bd589499d53cd38cdd0ceee1ce6ff0e4258b74b81d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page