Skip to main content

Safe, FPR-controlled, reproducible deduplication pipeline for bibliographic reference files, designed for systematic review workflows.

Project description

srdedupe — Safe Bibliographic Deduplication

Python 3.11+ License: MIT CI codecov

Safe, reproducible deduplication for systematic reviews and bibliographic databases.

Parses and deduplicates bibliographic reference files (RIS, NBIB, BibTeX, WoS, EndNote) with FPR-controlled decision making, full audit trails, and deterministic outputs.

Installation

pip install srdedupe

Quick Start

Parse and export

from srdedupe import parse_file, parse_folder, write_jsonl

# Single file (format auto-detected)
records = parse_file("references.ris")

# Entire folder
records = parse_folder("data/", recursive=True)

# Export to JSONL
write_jsonl(records, "output.jsonl")

Deduplicate

from srdedupe import dedupe

result = dedupe("references.ris", output_dir="out", fpr_alpha=0.01)

print(f"Records: {result.total_records}")
print(f"Auto-merged clusters: {result.total_duplicates_auto}")
print(f"Review records: {result.total_review_records}")
print(f"Unique records: {result.total_unique_records}")
print(f"Dedup rate: {result.dedup_rate:.1%}")
print(f"Output: {result.output_files['deduplicated_ris']}")

CLI

# Parse to JSONL
srdedupe parse references.ris -o output.jsonl
srdedupe parse data/ -o records.jsonl --recursive

# Full deduplication pipeline
srdedupe deduplicate references.ris
srdedupe deduplicate data/ -o results --fpr-alpha 0.005 --verbose

How It Works

A 6-stage pipeline controlled by false positive rate (FPR):

  1. Parse & Normalize — Multi-format ingestion, field normalization
  2. Candidate Generation — High-recall blocking (DOI, PMID, year+title, LSH)
  3. Probabilistic Scoring — Fellegi-Sunter model with field-level comparisons
  4. Three-Way Decision — AUTO_DUP / REVIEW / AUTO_KEEP with Neyman-Pearson FPR control
  5. Global Clustering — Connected components with anti-transitivity checks
  6. Canonical Merge — Deterministic survivor selection and field merging

Pairs classified as REVIEW are preserved in output artifacts for manual inspection.

API Reference

parse_file(path, *, strict=True) -> list[CanonicalRecord]

  • Parse a single bibliographic file. Format is auto-detected from file content.

parse_folder(path, *, pattern=None, recursive=False, strict=False) -> list[CanonicalRecord]

  • Parse all supported files in a folder. Optional glob pattern (e.g. "*.ris").

write_jsonl(records, path, *, sort_keys=True) -> None

  • Write records to JSONL file with deterministic field ordering.

dedupe(input_path, *, output_dir="out", fpr_alpha=0.01, t_low=0.3, t_high=None) -> PipelineResult

Run the full deduplication pipeline. Returns a PipelineResult with:

  • success, total_records, total_candidates, total_duplicates_auto, total_review_records, total_unique_records, dedup_rate
  • output_files — dict mapping artifact names to file paths
  • error_message — error details if success is False

Advanced: PipelineConfig + run_pipeline

For full control (custom blockers, FS model path, audit logger):

from pathlib import Path
from srdedupe.engine import PipelineConfig, run_pipeline

config = PipelineConfig(
    fpr_alpha=0.01,
    t_low=0.3,
    t_high=None,
    candidate_blockers=["doi", "pmid", "year_title"],
    output_dir=Path("out"),
)

result = run_pipeline(input_path=Path("references.ris"), config=config)

Supported Formats

Format Extensions
RIS .ris
PubMed/NBIB .nbib, .txt
BibTeX .bib
Web of Science .ciw
EndNote Tagged .enw

Pipeline Output Structure

out/
├── stage1/canonical_records.jsonl
├── stage2/candidate_pairs.jsonl
├── stage3/scored_pairs.jsonl
├── stage4/pair_decisions.jsonl
├── stage5/clusters.jsonl
├── artifacts/
│   ├── deduped_auto.ris
│   ├── merged_records.jsonl
│   ├── clusters_enriched.jsonl
│   ├── review_pending.ris  (if review pairs exist)
│   └── singletons.ris      (if singletons exist)
└── reports/
    ├── ingestion_report.json  (folder input only)
    └── merge_summary.json

Development

make dev           # Install dependencies + pre-commit hooks
make test-fast     # Quick validation while coding
make check         # Lint + type check + format (before committing)
make test          # Full test suite (417 tests, ≥80% coverage)

Documentation

License

MIT — see LICENSE.

Citation

@software{srdedupe2026,
  author = {Lopes, Ennio Politi},
  title = {srdedupe: Safe Bibliographic Deduplication},
  year = {2026},
  url = {https://github.com/enniolopes/srdedupe}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

srdedupe-0.1.1.tar.gz (86.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

srdedupe-0.1.1-py3-none-any.whl (119.1 kB view details)

Uploaded Python 3

File details

Details for the file srdedupe-0.1.1.tar.gz.

File metadata

  • Download URL: srdedupe-0.1.1.tar.gz
  • Upload date:
  • Size: 86.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for srdedupe-0.1.1.tar.gz
Algorithm Hash digest
SHA256 e209890e5246c5843271f07e97e2128ee0121075b55311836fe5aa5e58e9c81d
MD5 68fc5b0bab39faa235180e71d2d039cb
BLAKE2b-256 18abebe2c274f8943473b141e53dbba0915ce2e307868fced317e2b1ea040ca8

See more details on using hashes here.

Provenance

The following attestation bundles were made for srdedupe-0.1.1.tar.gz:

Publisher: publish.yml on enniolopes/srdedupe

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file srdedupe-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: srdedupe-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 119.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for srdedupe-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 5414b001328feea3f5df9dabc91b5b8dcea10ff0cb9af86405ed48fcd6965462
MD5 1f497798f96792de5c6f13de254bdd6a
BLAKE2b-256 a8da012740faf6d7a124943b7db24c381031365107b19fd5ecd83fed0c8e4336

See more details on using hashes here.

Provenance

The following attestation bundles were made for srdedupe-0.1.1-py3-none-any.whl:

Publisher: publish.yml on enniolopes/srdedupe

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page