Skip to main content

Detect row overlap and leakage between dataset splits.

Project description

splitcheck

CI PyPI Python License: MIT

Detect rows that leak between dataset splits, and fail CI when they do.

A row that appears in both train and test inflates every metric and is easy to introduce: a careless concat, a re-export, a duplicated record. splitcheck compares your splits and reports how much of one appears in another, both as exact matches and after normalization, so cosmetic differences do not hide a leak.

$ splitcheck check train.csv test.csv --on text --max-leakage 0.0
target  in source  exact  normalized  leakage
test    train          1           3     2.1%
splitcheck: worst leakage 2.1%

Install

$ pip install splitcheck                 # from PyPI, once released
$ pip install git+https://github.com/jmweb-org/splitcheck   # latest, available now

Reads CSV, Parquet, JSON Lines, and plain text (one row per line) through polars.

Usage

$ splitcheck check train.csv test.csv                 # compare whole rows
$ splitcheck check train.csv val.csv test.csv         # all pairs at once
$ splitcheck check train.csv test.csv --on text       # compare one column
$ splitcheck check train.csv test.csv --json          # machine-readable
$ splitcheck check train.csv test.csv --max-leakage 0.01   # allow 1%
$ splitcheck check train.csv test.csv --no-check      # report without failing

By default any leakage fails the command (--max-leakage 0.0). Raise the limit to tolerate a known, small overlap.

In CI

- run: splitcheck check data/train.parquet data/test.parquet --on text

How matching works

Each row is reduced to a string: a single column with --on, or the whole row joined tab-separated. Two matches are reported:

  • exact: identical strings.
  • normalized: equal after case folding, punctuation removal and whitespace collapsing, which catches the same example re-saved with cosmetic changes.

Leakage is the fraction of the target split (for example test) whose rows also appear in a source split (for example train). Pairs are checked in both directions and sorted worst-first.

Exit codes

Code Meaning
0 Checked; leakage within the limit (or --no-check)
1 Leakage exceeded --max-leakage
2 Fewer than two files, or a file was missing or unsupported

Scope

Matching is exact or normalized-exact, not fuzzy. Near-duplicates that differ in wording (paraphrases) are not caught; see the issues for planned MinHash-based fuzzy matching.

License

MIT. See LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

splitcheck-0.2.0.tar.gz (9.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

splitcheck-0.2.0-py3-none-any.whl (9.7 kB view details)

Uploaded Python 3

File details

Details for the file splitcheck-0.2.0.tar.gz.

File metadata

  • Download URL: splitcheck-0.2.0.tar.gz
  • Upload date:
  • Size: 9.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.20 {"installer":{"name":"uv","version":"0.11.20","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for splitcheck-0.2.0.tar.gz
Algorithm Hash digest
SHA256 ccb3a8dfbe6f918e0fbfa62889a0b277c53eabe869673dbecad7b8da1bd3d3bd
MD5 9dc78e383426aa9b3d975bf862524a9a
BLAKE2b-256 4688bf230303bcbd714c16bd524d9519b8ca5eedb8a43c783982b9c9bd36ed5b

See more details on using hashes here.

File details

Details for the file splitcheck-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: splitcheck-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 9.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.20 {"installer":{"name":"uv","version":"0.11.20","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for splitcheck-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 8cc1d49e0bda5a6a8b9a8c23c4592ffc430309745b5f43a149262521f96b5160
MD5 0d44e65caec2b872e58f86a85c6b06e3
BLAKE2b-256 23918bfbcbe421647a1078ae45b7725f586d0433285e86c4efcb2e877faa1033

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page