Detect row overlap and leakage between dataset splits.
Project description
splitcheck
Detect rows that leak between dataset splits, and fail CI when they do.
A row that appears in both train and test inflates every metric and is easy to
introduce: a careless concat, a re-export, a duplicated record. splitcheck
compares your splits and reports how much of one appears in another, both as
exact matches and after normalization, so cosmetic differences do not hide a
leak.
$ splitcheck check train.csv test.csv --on text --max-leakage 0.0
target in source exact normalized leakage
test train 1 3 2.1%
splitcheck: worst leakage 2.1%
Install
$ pip install splitcheck # from PyPI, once released
$ pip install git+https://github.com/jmweb-org/splitcheck # latest, available now
Reads CSV, Parquet, JSON Lines, and plain text (one row per line) through polars.
Usage
$ splitcheck check train.csv test.csv # compare whole rows
$ splitcheck check train.csv val.csv test.csv # all pairs at once
$ splitcheck check train.csv test.csv --on text # compare one column
$ splitcheck check train.csv test.csv --json # machine-readable
$ splitcheck check train.csv test.csv --max-leakage 0.01 # allow 1%
$ splitcheck check train.csv test.csv --no-check # report without failing
By default any leakage fails the command (--max-leakage 0.0). Raise the limit
to tolerate a known, small overlap.
In CI
- run: splitcheck check data/train.parquet data/test.parquet --on text
How matching works
Each row is reduced to a string: a single column with --on, or the whole row
joined tab-separated. Two matches are reported:
- exact: identical strings.
- normalized: equal after case folding, punctuation removal and whitespace collapsing, which catches the same example re-saved with cosmetic changes.
Leakage is the fraction of the target split (for example test) whose rows also appear in a source split (for example train). Pairs are checked in both directions and sorted worst-first.
Exit codes
| Code | Meaning |
|---|---|
| 0 | Checked; leakage within the limit (or --no-check) |
| 1 | Leakage exceeded --max-leakage |
| 2 | Fewer than two files, or a file was missing or unsupported |
Scope
Matching is exact or normalized-exact, not fuzzy. Near-duplicates that differ in wording (paraphrases) are not caught; see the issues for planned MinHash-based fuzzy matching.
License
MIT. See LICENSE.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file splitcheck-0.2.0.tar.gz.
File metadata
- Download URL: splitcheck-0.2.0.tar.gz
- Upload date:
- Size: 9.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.11.20 {"installer":{"name":"uv","version":"0.11.20","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ccb3a8dfbe6f918e0fbfa62889a0b277c53eabe869673dbecad7b8da1bd3d3bd
|
|
| MD5 |
9dc78e383426aa9b3d975bf862524a9a
|
|
| BLAKE2b-256 |
4688bf230303bcbd714c16bd524d9519b8ca5eedb8a43c783982b9c9bd36ed5b
|
File details
Details for the file splitcheck-0.2.0-py3-none-any.whl.
File metadata
- Download URL: splitcheck-0.2.0-py3-none-any.whl
- Upload date:
- Size: 9.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.11.20 {"installer":{"name":"uv","version":"0.11.20","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8cc1d49e0bda5a6a8b9a8c23c4592ffc430309745b5f43a149262521f96b5160
|
|
| MD5 |
0d44e65caec2b872e58f86a85c6b06e3
|
|
| BLAKE2b-256 |
23918bfbcbe421647a1078ae45b7725f586d0433285e86c4efcb2e877faa1033
|