Safe, FPR-controlled, reproducible deduplication pipeline for bibliographic reference files, designed for systematic review workflows.
Project description
srdedupe — Safe Bibliographic Deduplication
Safe, reproducible deduplication for systematic reviews and bibliographic databases.
Parses and deduplicates bibliographic reference files (RIS, NBIB, BibTeX, WoS, EndNote) with FPR-controlled decision making, full audit trails, and deterministic outputs.
Installation
pip install srdedupe
Quick Start
Parse and export
from srdedupe import parse_file, parse_folder, write_jsonl
# Single file (format auto-detected)
records = parse_file("references.ris")
# Entire folder
records = parse_folder("data/", recursive=True)
# Export to JSONL
write_jsonl(records, "output.jsonl")
Deduplicate
from srdedupe import dedupe
result = dedupe("references.ris", output_dir="out", fpr_alpha=0.01)
print(f"Records: {result.total_records}")
print(f"Auto-merged clusters: {result.total_duplicates_auto}")
print(f"Review records: {result.total_review_records}")
print(f"Unique records: {result.total_unique_records}")
print(f"Dedup rate: {result.dedup_rate:.1%}")
print(f"Output: {result.output_files['deduplicated_ris']}")
CLI
# Parse to JSONL
srdedupe parse references.ris -o output.jsonl
srdedupe parse data/ -o records.jsonl --recursive
# Full deduplication pipeline
srdedupe deduplicate references.ris
srdedupe deduplicate data/ -o results --fpr-alpha 0.005 --verbose
How It Works
A 6-stage pipeline controlled by false positive rate (FPR):
- Parse & Normalize — Multi-format ingestion, field normalization
- Candidate Generation — High-recall blocking (DOI, PMID, year+title, LSH)
- Probabilistic Scoring — Fellegi-Sunter model with field-level comparisons
- Three-Way Decision — AUTO_DUP / REVIEW / AUTO_KEEP with Neyman-Pearson FPR control
- Global Clustering — Connected components with anti-transitivity checks
- Canonical Merge — Deterministic survivor selection and field merging
Pairs classified as REVIEW are preserved in output artifacts for manual inspection.
API Reference
parse_file(path, *, strict=True) -> list[CanonicalRecord]
- Parse a single bibliographic file. Format is auto-detected from file content.
parse_folder(path, *, pattern=None, recursive=False, strict=False) -> list[CanonicalRecord]
- Parse all supported files in a folder. Optional glob
pattern(e.g."*.ris").
write_jsonl(records, path, *, sort_keys=True) -> None
- Write records to JSONL file with deterministic field ordering.
dedupe(input_path, *, output_dir="out", fpr_alpha=0.01, t_low=0.3, t_high=None) -> PipelineResult
Run the full deduplication pipeline. Returns a PipelineResult with:
success,total_records,total_candidates,total_duplicates_auto,total_review_records,total_unique_records,dedup_rateoutput_files— dict mapping artifact names to file pathserror_message— error details ifsuccessis False
Advanced: PipelineConfig + run_pipeline
For full control (custom blockers, FS model path, audit logger):
from pathlib import Path
from srdedupe.engine import PipelineConfig, run_pipeline
config = PipelineConfig(
fpr_alpha=0.01,
t_low=0.3,
t_high=None,
candidate_blockers=["doi", "pmid", "year_title"],
output_dir=Path("out"),
)
result = run_pipeline(input_path=Path("references.ris"), config=config)
Supported Formats
| Format | Extensions |
|---|---|
| RIS | .ris |
| PubMed/NBIB | .nbib, .txt |
| BibTeX | .bib |
| Web of Science | .ciw |
| EndNote Tagged | .enw |
Pipeline Output Structure
out/
├── stage1/canonical_records.jsonl
├── stage2/candidate_pairs.jsonl
├── stage3/scored_pairs.jsonl
├── stage4/pair_decisions.jsonl
├── stage5/clusters.jsonl
├── artifacts/
│ ├── deduped_auto.ris
│ ├── merged_records.jsonl
│ ├── clusters_enriched.jsonl
│ ├── review_pending.ris (if review pairs exist)
│ └── singletons.ris (if singletons exist)
└── reports/
├── ingestion_report.json (folder input only)
└── merge_summary.json
Development
make dev # Install dependencies + pre-commit hooks
make test-fast # Quick validation while coding
make check # Lint + type check + format (before committing)
make test # Full test suite (417 tests, ≥80% coverage)
Documentation
- CONTRIBUTING.md — Code style, testing, contribution guidelines
License
MIT — see LICENSE.
Citation
@software{srdedupe2026,
author = {Lopes, Ennio Politi},
title = {srdedupe: Safe Bibliographic Deduplication},
year = {2026},
url = {https://github.com/enniolopes/srdedupe}
}
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file srdedupe-0.1.1.tar.gz.
File metadata
- Download URL: srdedupe-0.1.1.tar.gz
- Upload date:
- Size: 86.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e209890e5246c5843271f07e97e2128ee0121075b55311836fe5aa5e58e9c81d
|
|
| MD5 |
68fc5b0bab39faa235180e71d2d039cb
|
|
| BLAKE2b-256 |
18abebe2c274f8943473b141e53dbba0915ce2e307868fced317e2b1ea040ca8
|
Provenance
The following attestation bundles were made for srdedupe-0.1.1.tar.gz:
Publisher:
publish.yml on enniolopes/srdedupe
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
srdedupe-0.1.1.tar.gz -
Subject digest:
e209890e5246c5843271f07e97e2128ee0121075b55311836fe5aa5e58e9c81d - Sigstore transparency entry: 963672584
- Sigstore integration time:
-
Permalink:
enniolopes/srdedupe@31c3f529c55a05dbc726e117fb7c04915fa44d0f -
Branch / Tag:
refs/tags/v0.1.1 - Owner: https://github.com/enniolopes
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@31c3f529c55a05dbc726e117fb7c04915fa44d0f -
Trigger Event:
push
-
Statement type:
File details
Details for the file srdedupe-0.1.1-py3-none-any.whl.
File metadata
- Download URL: srdedupe-0.1.1-py3-none-any.whl
- Upload date:
- Size: 119.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5414b001328feea3f5df9dabc91b5b8dcea10ff0cb9af86405ed48fcd6965462
|
|
| MD5 |
1f497798f96792de5c6f13de254bdd6a
|
|
| BLAKE2b-256 |
a8da012740faf6d7a124943b7db24c381031365107b19fd5ecd83fed0c8e4336
|
Provenance
The following attestation bundles were made for srdedupe-0.1.1-py3-none-any.whl:
Publisher:
publish.yml on enniolopes/srdedupe
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
srdedupe-0.1.1-py3-none-any.whl -
Subject digest:
5414b001328feea3f5df9dabc91b5b8dcea10ff0cb9af86405ed48fcd6965462 - Sigstore transparency entry: 963672709
- Sigstore integration time:
-
Permalink:
enniolopes/srdedupe@31c3f529c55a05dbc726e117fb7c04915fa44d0f -
Branch / Tag:
refs/tags/v0.1.1 - Owner: https://github.com/enniolopes
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@31c3f529c55a05dbc726e117fb7c04915fa44d0f -
Trigger Event:
push
-
Statement type: