Skip to main content

Safe, FPR-controlled, reproducible deduplication pipeline for bibliographic reference files, designed for systematic review workflows.

Project description

srdedupe — Safe Bibliographic Deduplication

Python 3.11+ License: MIT CI Codecov

Safe, reproducible deduplication for systematic reviews and bibliographic databases.

Parses and deduplicates bibliographic reference files (RIS, NBIB, BibTeX, WoS, EndNote) with FPR-controlled decision making, full audit trails, and deterministic outputs.

Installation

pip install srdedupe

Quick Start

Parse and export

from srdedupe import parse_file, parse_folder, write_jsonl

# Single file (format auto-detected)
records = parse_file("references.ris")

# Entire folder
records = parse_folder("data/", recursive=True)

# Export to JSONL
write_jsonl(records, "output.jsonl")

Deduplicate

from srdedupe import dedupe

result = dedupe("references.ris", output_dir="out", fpr_alpha=0.01)

print(f"Records: {result.total_records}")
print(f"Auto-merged: {result.total_duplicates_auto}")
print(f"Review required: {result.total_review_pairs}")
print(f"Output: {result.output_files['deduplicated_ris']}")

CLI

# Parse to JSONL
srdedupe parse references.ris -o output.jsonl
srdedupe parse data/ -o records.jsonl --recursive

# Full deduplication pipeline
srdedupe deduplicate references.ris
srdedupe deduplicate data/ -o results --fpr-alpha 0.005 --verbose

How It Works

A 6-stage pipeline controlled by false positive rate (FPR):

  1. Parse & Normalize — Multi-format ingestion, field normalization
  2. Candidate Generation — High-recall blocking (DOI, PMID, year+title, LSH)
  3. Probabilistic Scoring — Fellegi-Sunter model with field-level comparisons
  4. Three-Way Decision — AUTO_DUP / REVIEW / AUTO_KEEP with Neyman-Pearson FPR control
  5. Global Clustering — Connected components with anti-transitivity checks
  6. Canonical Merge — Deterministic survivor selection and field merging

Pairs classified as REVIEW are preserved in output artifacts for manual inspection.

API Reference

parse_file(path, *, strict=True) -> list[CanonicalRecord]

  • Parse a single bibliographic file. Format is auto-detected from file content.

parse_folder(path, *, pattern=None, recursive=False, strict=False) -> list[CanonicalRecord]

  • Parse all supported files in a folder. Optional glob pattern (e.g. "*.ris").

write_jsonl(records, path, *, sort_keys=True) -> None

  • Write records to JSONL file with deterministic field ordering.

dedupe(input_path, *, output_dir="out", fpr_alpha=0.01, t_low=0.3, t_high=None) -> PipelineResult

Run the full deduplication pipeline. Returns a PipelineResult with:

  • success, total_records, total_candidates, total_duplicates_auto, total_review_pairs
  • output_files — dict mapping artifact names to file paths
  • error_message — error details if success is False

Advanced: PipelineConfig + run_pipeline

For full control (custom blockers, FS model path, audit logger):

from pathlib import Path
from srdedupe.engine import PipelineConfig, run_pipeline

config = PipelineConfig(
    fpr_alpha=0.01,
    t_low=0.3,
    t_high=None,
    candidate_blockers=["doi", "pmid", "year_title"],
    output_dir=Path("out"),
)

result = run_pipeline(input_path=Path("references.ris"), config=config)

Supported Formats

Format Extensions
RIS .ris
PubMed/NBIB .nbib, .txt
BibTeX .bib
Web of Science .ciw
EndNote Tagged .enw

Pipeline Output Structure

out/
├── stage1/canonical_records.jsonl
├── stage2/candidate_pairs.jsonl
├── stage3/scored_pairs.jsonl
├── stage4/pair_decisions.jsonl
├── stage5/clusters.jsonl
└── artifacts/
    ├── deduped_auto.ris
    ├── merged_records.jsonl
    └── clusters_enriched.jsonl

Development

make dev           # Install dependencies + pre-commit hooks
make test-fast     # Quick validation while coding
make check         # Lint + type check + format (before committing)
make test          # Full test suite (417 tests, ≥80% coverage)

Documentation

License

MIT — see LICENSE.

Citation

@software{srdedupe2026,
  author = {Lopes, Ennio Politi},
  title = {srdedupe: Safe Bibliographic Deduplication},
  year = {2026},
  url = {https://github.com/enniolopes/srdedupe}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

srdedupe-0.1.0.tar.gz (84.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

srdedupe-0.1.0-py3-none-any.whl (117.3 kB view details)

Uploaded Python 3

File details

Details for the file srdedupe-0.1.0.tar.gz.

File metadata

  • Download URL: srdedupe-0.1.0.tar.gz
  • Upload date:
  • Size: 84.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for srdedupe-0.1.0.tar.gz
Algorithm Hash digest
SHA256 ef83f0b6a7fd9906516be6a842d2bad8d4468fe4ec14eba6ada4821ddb44e522
MD5 72360991feb5753e2be698bc3acecea5
BLAKE2b-256 a3b5091b404589951ddfef66c5a6d6567d82b6a511ee72f7cd929d3bb0b872a4

See more details on using hashes here.

Provenance

The following attestation bundles were made for srdedupe-0.1.0.tar.gz:

Publisher: publish.yml on enniolopes/srdedupe

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file srdedupe-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: srdedupe-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 117.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for srdedupe-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 c0917d25d604d3c4364eb127166f3f00d44d307344d9c2abaa01e7adef3f6a88
MD5 beb792cf03f455d9ea32bbd262fc9793
BLAKE2b-256 83101869bc4c3083505523e8bb8c1f8d6d79b6ec4bec98f4fee52b4bc3c56cc9

See more details on using hashes here.

Provenance

The following attestation bundles were made for srdedupe-0.1.0-py3-none-any.whl:

Publisher: publish.yml on enniolopes/srdedupe

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page