Skip to main content

Fast contamination detection for ML training data - Python bindings for decon

Reason this release was yanked:

Please use 0.3.0.post4

Project description

Contamination Detection

Decon identifies documents contaminated with eval instances.

It uses simple token based sampling and counting methods, making it suitable for large datasets. It is deterministic with interpretable results.

Decon can produce contamination reports and cleaned datasets.

๐Ÿ This fork adds Python bindings โ€” the core Rust functionality is unchanged. Skip to Python Quick Start to get started, or see the Architecture section to understand how bindings are structured. For the full Python API signature, see crates/decon-py/src/lib.rs.

How Decon Works

Consider a 30GB web dataset in ~/sample-data that includes documents containing evaluation question text.

TRAINING DOC:

"... for ฮธ 30 c i ฮธ i0 4 for ฮธ 90 d i ฮธ is constant for all values of ฮธ the plane face of plano convex lens of focal length 20 cm is silvered this combination is equivalent to the type of mirror and its focal length is a convex f 20 c m b concave f 20 cm in a displacement method using convex lens two images are obtained for a separation of d between ..."

EVAL PROMPT: the plane face of plano convex lens of focal length 20 cm is silvered this combination is equivalent to the type of mirror and its focal length is

EVAL ANSWER: concave f 10 cm

We can identify the contamination locations running decon.

$ decon detect --training-dir ~/sample-data --evals-dir ~/references

Training files 4,487/4,487 [00:02:55/00:00:00] [โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ]

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚     Contamination Detection Results       โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ Training lines                  5,162,084 โ”‚
โ”‚ Processing rate                 34 ฮผs/doc โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ Index building time                38.59s โ”‚
โ”‚ Detection time                    175.69s โ”‚
โ”‚ Total time                        214.28s โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ Contaminated matches                7,699 โ”‚
โ”‚ Contaminated documents              1,851 โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

$ decon review --stats /tmp/decon-295c0cbd

=== TRAINING DOCUMENTS CONTAMINATED BY EVAL SUITE ===
(Each count represents unique training documents that need removal)

  sciq                                  652 โ”‚โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ”‚
  mmlu                                  278 โ”‚โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ                  โ”‚
  mmlu_pro                              211 โ”‚โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ                       โ”‚
  ai2_arc_easy                           83 โ”‚โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ                                 โ”‚
  super_gpqa                             65 โ”‚โ–ˆโ–ˆโ–ˆโ–ˆ                                   โ”‚

  ...

Quick Start

Python

Install via pip:

pip install decontaminate

Run contamination detection in Python:

import decon

# Configure detection
config = decon.Config(
    training_dir="/path/to/training/data",
    evals_dir="/path/to/eval/references",
    report_output_dir="/path/to/output",
)

# Run detection (automatically parallelized using all CPU cores)
report_dir = decon.detect(config)
print(f"Results written to: {report_dir}")
Additional Python API
import decon

# Tokenizer utilities
tokenizer = decon.Tokenizer("cl100k")  # Options: r50k, p50k, cl100k, o200k, uniseg
tokens = tokenizer.encode("hello world")  # [15339, 1917]
text = tokenizer.decode(tokens)           # "hello world"

# Text cleaning (normalizes punctuation/whitespace, lowercases)
cleaned = decon.clean_text("Hello,  World!")  # "hello world"

# All Config options
config = decon.Config(
    training_dir="/path/to/training",
    evals_dir="/path/to/evals",
    report_output_dir="/path/to/reports",
    ngram_size=5,                          # N-gram size for matching
    tokenizer="cl100k",                    # Tokenizer to use
    contamination_score_threshold=0.8,     # Detection threshold
    content_key="text",                    # JSON field containing text
    verbose=False,                         # Enable verbose output
    purify=False,                          # Create cleaned dataset
)

๐Ÿ“– Full API: See crates/decon-py/src/lib.rs for complete function signatures.

๐Ÿ“š Python Guide: See doc/python.md for detailed examples with CLI equivalents.


CLI (Rust)

# Clone and build. Requires rust 1.88
git clone https://github.com/allenai/decon
cd decon

# For full set of commands and options, help is available.
cargo run --release -- --help

# List current eval datasets in reference (small default set initially).
cargo run --release -- evals

# Run contamination detection.
cargo run --release -- detect --training-dir tests/fixtures/training/

# Create a clean copy (contaminated documents removed) of your dataset.
cargo run --release -- detect --training-dir tests/fixtures/training/ --purify

# Review report output. A decon detect run will report an output directory.
cargo run --release -- review /tmp/decon-output-directory

Sensible defaults are provided for decon parameters, with a single contamination_score_threshold that can be adjusted to desired sensitivity. Experimenting with these parameters on your own dataset and eval reference set is recommended.

Advanced Usage

Preparing Datasets

Training Documents

Decon operates on a directory containing jsonl files.

Each JSON object in the files must contain a field with a string value representing a training document [example].

Eval Suites

Decon runs against a reference set of eval suites that is also expected be a directory containing jsonl files [example].

Decon eval reference files have a normalized format including passage, question, answer keys as well as metadata for reporting. Decon includes tooling to generate reference files from hf datasets.

Eval Reference Set Curation

Three eval suites are included in the eval reference dataset by default, gsm8k, mmlu, and agi_eval.

It's likely you will want to build your own reference set with your evals of interest.

The decon evals command can process an extensible declarative yaml file to normalize huggingface datasets.

To download all the pre-configured evals included in the configuration file, run the following command. This requires python3 with the datasets library installed.

# Review current set of evals in reference
cargo run --release -- evals

# Download and normalize all evals configured in a config file
cargo run --release -- evals --download --config config/evals.yaml

See the Evaluation Dataset Guide for more information on preparing evaluation datasets.

Server

Decon can also be run as a server to facilitate distributing workloads.

# Launch a server
decon server --port 8080

An example orchestration script is provided which demonstrates one approach to batch retrieve a partition of documents, submit documents to the server, poll for job status, and upload reports and clean documents to a new location.

See deployment guide for details.

Reviewing Results

Decon includes tools for qualitative review and basic stats which can be filtered to analyze contamination.

# To qualitatively review individual matches
cargo run --release -- review /my-results-directory

# To see statistics
cargo run --release -- review --stats /my-results-directory

# To review with filters, e.g. specific eval with minimum score
cargo run --release -- review /my-results-directory --eval mmlu --min-score 0.9

# Compare results between different decontamination runs
cargo run --release -- compare /tmp/results-a /tmp/results-b

Decon reports are jsonl files which are ready for analysis beyond the provided tooling.

Architecture

This fork restructures decon as a Rust workspace with three crates:

Crate Source Description
decon-core crates/decon-core/ Core detection engine โ€” pure Rust library (unchanged from upstream)
decon-cli crates/decon-cli/ Command-line interface built on decon-core
decon-py crates/decon-py/ Python bindings via PyO3

How Python Bindings Work

The Python bindings are a thin wrapper around decon-core โ€” no detection logic is reimplemented in Python. Key files:

File Purpose
crates/decon-py/src/lib.rs PyO3 wrapper classes (PyConfig, PyTokenizer) and functions (detect, clean_text)
crates/decon-py/python/decon/__init__.py Python module re-exports
crates/decon-py/tests/test_parity.py Parity tests ensuring Python โ†” Rust equivalence

The detect() function releases the GIL via py.allow_threads(), enabling full utilization of Rayon's parallel processing on all CPU cores.

Building from Source

Rust CLI:

cargo build --release
# Binary at: target/release/decon

Python bindings (requires maturin):

cd crates/decon-py
maturin develop --release
# Or build wheels: maturin build --release

๐Ÿ“ฆ Detailed guide: See doc/building.md for cross-platform builds, troubleshooting, and CI/CD.

Requirements

  • Rust: 1.88+ (edition 2024)
  • Python: 3.12+ (for bindings)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

decontaminate-0.3.0.post3.tar.gz (141.8 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

decontaminate-0.3.0.post3-cp314-cp314-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (5.9 MB view details)

Uploaded CPython 3.14manylinux: glibc 2.17+ x86-64

decontaminate-0.3.0.post3-cp314-cp314-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (5.8 MB view details)

Uploaded CPython 3.14manylinux: glibc 2.17+ ARM64

decontaminate-0.3.0.post3-cp314-cp314-macosx_11_0_arm64.whl (5.5 MB view details)

Uploaded CPython 3.14macOS 11.0+ ARM64

decontaminate-0.3.0.post3-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (6.0 MB view details)

Uploaded CPython 3.13manylinux: glibc 2.17+ x86-64

decontaminate-0.3.0.post3-cp313-cp313-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (5.8 MB view details)

Uploaded CPython 3.13manylinux: glibc 2.17+ ARM64

decontaminate-0.3.0.post3-cp313-cp313-macosx_11_0_arm64.whl (5.5 MB view details)

Uploaded CPython 3.13macOS 11.0+ ARM64

decontaminate-0.3.0.post3-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (6.0 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.17+ x86-64

decontaminate-0.3.0.post3-cp312-cp312-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (5.8 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.17+ ARM64

decontaminate-0.3.0.post3-cp312-cp312-macosx_11_0_arm64.whl (5.5 MB view details)

Uploaded CPython 3.12macOS 11.0+ ARM64

File details

Details for the file decontaminate-0.3.0.post3.tar.gz.

File metadata

  • Download URL: decontaminate-0.3.0.post3.tar.gz
  • Upload date:
  • Size: 141.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for decontaminate-0.3.0.post3.tar.gz
Algorithm Hash digest
SHA256 95a901f635b7566cd5f78698974c84f0d924d8e1ae42f9bbabfe9bb341fea12a
MD5 335a025d7d41b12be4e33f62f19e2c78
BLAKE2b-256 7f43303b88a695aa6b1dae63da739db23503fa241dc82a2a40258f9968614f8d

See more details on using hashes here.

Provenance

The following attestation bundles were made for decontaminate-0.3.0.post3.tar.gz:

Publisher: release.yml on vincentzed/decon

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file decontaminate-0.3.0.post3-cp314-cp314-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for decontaminate-0.3.0.post3-cp314-cp314-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 42393181b0da12bd839b926561a0243b01ec2b6d12fd2b96fb9cac708099df7b
MD5 e265b8f80c431ea40be213b710f8f666
BLAKE2b-256 f9127403d4ee3b44da36176687086ec062794884ec0be0b75c4fe611f2f61f0f

See more details on using hashes here.

Provenance

The following attestation bundles were made for decontaminate-0.3.0.post3-cp314-cp314-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: release.yml on vincentzed/decon

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file decontaminate-0.3.0.post3-cp314-cp314-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for decontaminate-0.3.0.post3-cp314-cp314-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 f9830cfe04a76b2e22d2cdc41cbbfab2d5363a3c48c4b37a56c297ab0239f2ec
MD5 3e056e16d59c588e2692905f25dafb80
BLAKE2b-256 a4a475e257549da6d8631d126e76392ba782dbda8bd41df867a9c3f6f59efc46

See more details on using hashes here.

Provenance

The following attestation bundles were made for decontaminate-0.3.0.post3-cp314-cp314-manylinux_2_17_aarch64.manylinux2014_aarch64.whl:

Publisher: release.yml on vincentzed/decon

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file decontaminate-0.3.0.post3-cp314-cp314-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for decontaminate-0.3.0.post3-cp314-cp314-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 defee30f8e077e73ff5523b40915fcb761a7ba88ed55a9b3bd001acad8719ff5
MD5 aeb913696ce569885a4c37e469889074
BLAKE2b-256 6a898333cd648bc2bd545cb7d938aba96acd8f5e939d2bd8f9db2c53be3ef82a

See more details on using hashes here.

Provenance

The following attestation bundles were made for decontaminate-0.3.0.post3-cp314-cp314-macosx_11_0_arm64.whl:

Publisher: release.yml on vincentzed/decon

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file decontaminate-0.3.0.post3-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for decontaminate-0.3.0.post3-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 d39cb88c9b4df1b2a9b0a20c790c83249f1374f3022909152c6c2119f70662e2
MD5 d49c753b562d29487e39c8e85e0dff25
BLAKE2b-256 ca409bc3ada6251a5dc81301067c1df7ce4a4a203b747daece4b1f372eb58f93

See more details on using hashes here.

Provenance

The following attestation bundles were made for decontaminate-0.3.0.post3-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: release.yml on vincentzed/decon

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file decontaminate-0.3.0.post3-cp313-cp313-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for decontaminate-0.3.0.post3-cp313-cp313-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 0b9eba91795a46dd801739a3c028187de7ad7b8421308e3ec45314d73d453826
MD5 10d40d22aaa01ffccefede3e86e789eb
BLAKE2b-256 8b08a5a71078d9f1192a07a26ac8b227cf1084bf58dc190e570589740871e78f

See more details on using hashes here.

Provenance

The following attestation bundles were made for decontaminate-0.3.0.post3-cp313-cp313-manylinux_2_17_aarch64.manylinux2014_aarch64.whl:

Publisher: release.yml on vincentzed/decon

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file decontaminate-0.3.0.post3-cp313-cp313-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for decontaminate-0.3.0.post3-cp313-cp313-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 69ed33e2388cf23ef1c47b45a026b0b65f08c1623632292533648719ab07df18
MD5 62001280a43130e5f5771db4f42d656f
BLAKE2b-256 1ca50dc939a5a0f9d56e07f39a5d03270e52409401de5109a6a2c328618381d9

See more details on using hashes here.

Provenance

The following attestation bundles were made for decontaminate-0.3.0.post3-cp313-cp313-macosx_11_0_arm64.whl:

Publisher: release.yml on vincentzed/decon

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file decontaminate-0.3.0.post3-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for decontaminate-0.3.0.post3-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 47030e108bc735085cecfaac53c0d912475e218d803ce5fefdc7128c411a50ab
MD5 aceaacc9514abbff9d8c755122d3519d
BLAKE2b-256 de69b08be1db935898c6fb79b3bf6c09c524cf0ea9b88d5e4a138f690e37cd3a

See more details on using hashes here.

Provenance

The following attestation bundles were made for decontaminate-0.3.0.post3-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: release.yml on vincentzed/decon

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file decontaminate-0.3.0.post3-cp312-cp312-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for decontaminate-0.3.0.post3-cp312-cp312-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 8624548b821f5c7bb63f6114b467be320ed6361ba23527a25e62c76a8f25066b
MD5 f0643a60ae315476348b9d6e12ef4327
BLAKE2b-256 25bb33025b2585ee4ab8e28be5cef96787f0f2f377425eced5d24621504768c6

See more details on using hashes here.

Provenance

The following attestation bundles were made for decontaminate-0.3.0.post3-cp312-cp312-manylinux_2_17_aarch64.manylinux2014_aarch64.whl:

Publisher: release.yml on vincentzed/decon

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file decontaminate-0.3.0.post3-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for decontaminate-0.3.0.post3-cp312-cp312-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 7945e7a1d976557202d8dd4bb7f9a0faebb33d5f213e781268122c293e5ec746
MD5 6d28cc8a5666b21cb8574bbe8ffeff56
BLAKE2b-256 89543dac1b96be53d3d81acbdfa6679314c34013ceeff7467f05ac2cdc5cc55d

See more details on using hashes here.

Provenance

The following attestation bundles were made for decontaminate-0.3.0.post3-cp312-cp312-macosx_11_0_arm64.whl:

Publisher: release.yml on vincentzed/decon

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page