A library for detecting verbatim-duplicated contents within a vast amount of documents

These details have not been verified by PyPI

Project description

D3lta

This repository is the official implementation of D3lta, a library for detecting duplicate verbatim contents within a vast amount of documents.

It distinguishes 3 types of duplicate contents : copypasta (almost exact duplicates), rewording and translation. You can run it on CPU.

💻 Installation

# PyPI is case insensitive, so d3lta == D3lta
pip install d3lta

🚀 Quick start

One can use directly semantic_faiss function from a Dataframe that contains texts. We use by default the embeddings from the Universal Sentence Encoder but one can use other models to calculate embeddings.

import pandas as pd
from d3lta.faissd3lta import *

examples_dataset = [
        "Je m'apelle Mimie et je fais du stop",
        "Je m'apelle Giselle et toi ?",
        "Les chats sont gris",
        "Cat's are grey, aren't they ?",
        "Cats are grey",
        "Les chats ne sont pas gris",
    ]
df = pd.DataFrame(examples_dataset, columns=["text_language_detect"])
df.index = df.index.astype(str)

matches, df_clusters = semantic_faiss(
    df=df.rename(columns={"text_language_detect": "original"}),
    min_size_txt=10,
    embeddings_to_save='myembeddings',
    threshold_grapheme=0.693,
    threshold_language=0.715,
    threshold_semantic=0.85,
)

>>>matches

  source target     score duplicates language_source           text_to_embed_source  text_grapheme_source language_target           text_to_embed_target   text_grapheme_target     dup_type  score_lev
0      2      3  0.745741        2-3              fr            Les chats sont gris      leschatssontgris              en  Cat's are grey, aren't they ?   catsaregreyarentthey  translation        NaN
1      2      4  0.955517        2-4              fr            Les chats sont gris      leschatssontgris              en                  Cats are grey            catsaregrey  translation        NaN
2      2      5  0.808805        2-5              fr            Les chats sont gris      leschatssontgris              fr     Les chats ne sont pas gris  leschatsnesontpasgris   copy-pasta   0.761905
5      3      5  0.833525        3-5              en  Cat's are grey, aren't they ?  catsaregreyarentthey              fr     Les chats ne sont pas gris  leschatsnesontpasgris  translation        NaN
8      4      5  0.767601        4-5              en                  Cats are grey           catsaregrey              fr     Les chats ne sont pas gris  leschatsnesontpasgris  translation        NaN

>>>df_clusters
                               original language                 text_grapheme                         text_to_embed                  text_language_detect  cluster
0  Je m'apelle Mimie et je fais du stop       fr  jemapellemimieetjefaisdustop  Je m'apelle Mimie et je fais du stop  Je m'apelle Mimie et je fais du stop      NaN
1          Je m'apelle Giselle et toi ?       fr         jemapellegiselleettoi          Je m'apelle Giselle et toi ?          Je m'apelle Giselle et toi ?      NaN
2                   Les chats sont gris       fr              leschatssontgris                   Les chats sont gris                   Les chats sont gris      0.0
3         Cat's are grey, aren't they ?       en          catsaregreyarentthey         Cat's are grey, aren't they ?         Cat's are grey, aren't they ?      0.0
4                         Cats are grey       en                   catsaregrey                         Cats are grey                         Cats are grey      0.0
5            Les chats ne sont pas gris       fr         leschatsnesontpasgris            Les chats ne sont pas gris            Les chats ne sont pas gris      0.0

Its also possible to use Faiss to find similar embeddings.

import pandas as pd
from d3lta.faissd3lta import *

examples_dataset = [
        "Je m'apelle Mimie et je fais du stop",
        "Je m'apelle Giselle et toi ?",
        "Les chats sont gris",
        "Cat's are grey, aren't they ?",
        "Cats are grey",
        "Les chats ne sont pas gris",
    ]
    
df_test = pd.DataFrame(
        examples_dataset,
        columns=["text_to_embed"],
        index=range(len(examples_dataset)),
    )  # index for checking that it has good ids
 df_emb = compute_embeddings(df_test)
 index_t = create_index_cosine(df_emb)

 test_dataset = pd.DataFrame([{"text_to_embed": "I gatti sono grigi"}])
 df_emb_test = compute_embeddings(test_dataset)

 limits, distances, indices = index_t.range_search(
     x=df_emb_test.to_numpy().reshape(1, -1), thresh=0.7
 )

>>>df_test.loc[indices]["text_to_embed"]

2              Les chats sont gris
3    Cat's are grey, aren't they ?
4                    Cats are grey
5       Les chats ne sont pas gris
Name: text_to_embed, dtype: object

It is also possible to use your own embedding (other than Universal Sentence Encoder). For example:

import pandas as pd
from sentence_transformers import SentenceTransformer
from d3lta.faissd3lta import *

examples_dataset = [
        "Je m'apelle Mimie et je fais du stop",
        "Je m'apelle Giselle et toi ?",
        "Les chats sont gris",
        "Cat's are grey, aren't they ?",
        "Cats are grey",
        "Les chats ne sont pas gris",
    ]
df = pd.DataFrame(examples_dataset, columns=["text_language_detect"])
df.index = df.index.astype(str)

model = SentenceTransformer('paraphrase-multilingual-mpnet-base-v2')
new_emb = model.encode(df['text_language_detect'].values.tolist())
df_emb = pd.DataFrame(new_emb, index=df.index)

matches, df_clusters = semantic_faiss(
    df=df.rename(columns={"text_language_detect": "original"}),
    min_size_txt=10,
    df_embeddings_use=df_emb,
    threshold_grapheme=0.693,
    threshold_language=0.715,
    threshold_semantic=0.85,
)

matches

📚 Synthetic dataset

The dataset is available in the 1.0.0 release. It contains the following files:

`synthetic_dataset_documents.csv`:

This file contains all seeds (real and original texts) and their generated variations (copy-pasta, rewording or translations). There are 2985 documents in this corpus dataset generated using a large language model.

Columns details:

doc_id (int): unique number associated to each text. All seed index are multiples of 10 and followed by their 9 transformations.
original (str): real or transformed text
text_type (str): dataset where the seed was extracted (books, news, tweets)
language (str): language of the text
prompt (str): prompt given to ChatGPT for "copypasta" and "rewording"
seed (bool): True if the text is one of the 300 initial texts from which the generation is from

The 300 initial texts (seeds) have been taken from three Kaggle datasets :

(For more info, please refer to the paper)

`synthetic_dataset_pairs_unbalanced.csv`:

This file contains the 1,497,547 annotated pairs of text of the synthetic dataset : 4,500 pairs of translation, 4,030 pairs of copy-pasta, 4017 pairs of rewording and 1,485,000 pairs of non duplicated content called "nomatch".

Column details:

source_target (str): unique id for the pair.
source (int): index of the first text of the pair in the synthetic_dataset_documents.csv
target (int): index of the second text of the pair in the synthetic_dataset_documents.csv
original_source (str): text of the source index
original_target (str): text of the target index
language_source (str): language of original_source
language_target (str): language of original_target
true_label (str): transformation relation that links both text of the pair i.e. the source and target texts are {true_label} of each other. The true_label can be "copypasta", "rewording" or "translation"

Notebooks

In the notebooks directory, you can find:

example_synthetic_dataset.ipynb: example of applying the D3lta methodology to the synthetic dataset, with a comparison to the true labels.

👩‍💻 Developing

Clone the repository

git clone https://github.com/VIGINUM-FR/D3lta

Navigate to the project

cd D3lta

Install the package

pip install -e .

Citation

If you find our paper and code useful in your research, please consider giving a star 🌟 and a citation 📝:

@misc{richard2023unmasking,
      title={Unmasking information manipulation: A quantitative approach to detecting Copy-pasta, Rewording, and Translation on Social Media}, 
      author={Manon Richard and Lisa Giordani and Cristian Brokate and Jean Liénard},
      year={2023},
      eprint={2312.17338},
      archivePrefix={arXiv},
      primaryClass={cs.SI},
      url={https://arxiv.org/abs/2312.17338}, 
}

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

1.0.2

Jul 29, 2025

1.0.1

Jul 29, 2025

1.0.0

May 9, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

d3lta-1.0.2.tar.gz (14.4 kB view details)

Uploaded Jul 29, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

d3lta-1.0.2-py3-none-any.whl (12.9 kB view details)

Uploaded Jul 29, 2025 Python 3

File details

Details for the file d3lta-1.0.2.tar.gz.

File metadata

Download URL: d3lta-1.0.2.tar.gz
Upload date: Jul 29, 2025
Size: 14.4 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for d3lta-1.0.2.tar.gz
Algorithm	Hash digest
SHA256	`afe97976bcb439082a8f92068c819449a3a25cab2190b5d50c1be10f22fa0997`
MD5	`0dcc4ef1378b56d91e580914bf2d87df`
BLAKE2b-256	`12af0fbf271bff17d871cc6e976ee8f17739e6f4f9c50f46e65c4219763a7f17`

See more details on using hashes here.

Provenance

The following attestation bundles were made for d3lta-1.0.2.tar.gz:

Publisher: publish-to-pypi.yml on VIGINUM-FR/D3lta

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: d3lta-1.0.2.tar.gz
- Subject digest: afe97976bcb439082a8f92068c819449a3a25cab2190b5d50c1be10f22fa0997
- Sigstore transparency entry: 325751853
- Sigstore integration time: Jul 29, 2025
Source repository:
- Permalink: VIGINUM-FR/D3lta@c4669687f2f7362d7f74f5d1c5bb2e7234d58612
- Branch / Tag: refs/tags/v1.0.2
- Owner: https://github.com/VIGINUM-FR
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish-to-pypi.yml@c4669687f2f7362d7f74f5d1c5bb2e7234d58612
- Trigger Event: release

File details

Details for the file d3lta-1.0.2-py3-none-any.whl.

File metadata

Download URL: d3lta-1.0.2-py3-none-any.whl
Upload date: Jul 29, 2025
Size: 12.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for d3lta-1.0.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`9f01d401d164a8fcd596a4607bb117b9f63bdc8044d8b14217519fbf47bff4aa`
MD5	`479508453ca1f692a4001ed328c2b195`
BLAKE2b-256	`58455262ecc9343233e2e1d4764fbebbda87bd5e9479987981152700997d8350`

See more details on using hashes here.

Provenance

The following attestation bundles were made for d3lta-1.0.2-py3-none-any.whl:

Publisher: publish-to-pypi.yml on VIGINUM-FR/D3lta

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: d3lta-1.0.2-py3-none-any.whl
- Subject digest: 9f01d401d164a8fcd596a4607bb117b9f63bdc8044d8b14217519fbf47bff4aa
- Sigstore transparency entry: 325751875
- Sigstore integration time: Jul 29, 2025
Source repository:
- Permalink: VIGINUM-FR/D3lta@c4669687f2f7362d7f74f5d1c5bb2e7234d58612
- Branch / Tag: refs/tags/v1.0.2
- Owner: https://github.com/VIGINUM-FR
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish-to-pypi.yml@c4669687f2f7362d7f74f5d1c5bb2e7234d58612
- Trigger Event: release

d3lta 1.0.2

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

D3lta

💻 Installation

🚀 Quick start

📚 Synthetic dataset

`synthetic_dataset_documents.csv`:

`synthetic_dataset_pairs_unbalanced.csv`:

Notebooks

👩‍💻 Developing

Citation

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

d3lta 1.0.2

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

D3lta

💻 Installation

🚀 Quick start

📚 Synthetic dataset

synthetic_dataset_documents.csv:

synthetic_dataset_pairs_unbalanced.csv:

Notebooks

👩‍💻 Developing

Citation

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

`synthetic_dataset_documents.csv`:

`synthetic_dataset_pairs_unbalanced.csv`: