Skip to main content

Incremental mirror for PubMed, PMC, FDA, and ClinicalTrials.gov

Project description

litsync — incremental PubMed + PMC + FDA + ClinicalTrials.gov mirror

A modern, daily-runnable CLI for mirroring bulk biomedical datasets. It tracks every file in a SQLite state DB so re-runs do the minimum work: already-verified immutable files are skipped with no network request beyond the directory/manifest listing.

Install

pip install -e .

Or use the Makefile:

make install
make dev

Quick start

litsync --data-root /data/literature --email you@institute.org

Common options:

litsync --data-root /data/literature --email you@institute.org \
  --sources pubmed pmc fda clinicaltrials \
  --fda-endpoints drug/event drug/label
--sources pubmed pmc fda clinicaltrials   # which corpora (default: all four)
--fda-endpoints drug/event drug/label     # default: all openFDA endpoints
--pmc-groups oa_comm oa_noncomm oa_other
--pmc-formats xml txt                     # default: xml
--workers 4                               # concurrent downloads (keep modest; be polite)
--dry-run                                 # plan only, download nothing
--reverify                                # re-download local files (integrity audit)
--prune                                   # delete local files no longer on the server
--count-articles                          # count articles in already-downloaded files (no network)
--no-rich                                 # disable Rich progress bars / tables

On-disk layout

/data/literature/
  pubmed/baseline/                    pubmed26nXXXX.xml.gz (+ .md5 verified)
  pubmed/updatefiles/                 daily citation deltas
  pmc/oa_bulk/<group>/<fmt>/          baseline + dated incremental .tar.gz
  pmc/oa_file_list.csv                PMCID <-> PMID id map
  fda/<category>/<endpoint>/          openFDA bulk snapshot zips + extracted JSON
  clinicaltrials/ctg-public-xml.zip   ClinicalTrials.gov full XML dump
  clinicaltrials/ctg-public-xml/      extracted study XML files
  _state/state.sqlite                 file ledger (status, size, mtime, md5, etag, attempts)
  _state/logs/                        dated run logs
  _state/litsync.lock                 run lock (prevents overlapping cron runs)

Cron (daily 02:30)

30 2 * * *  /path/to/venv/bin/litsync --data-root /data/literature --email you@institute.org >> /data/literature/_state/cron.log 2>&1

Extract corpus to sharded JSONL

litsync-extract --data-root /data/literature --out /data/corpus \
  --sources pubmed pmc fda clinicaltrials

Or with Make:

make extract DATA_ROOT=/data/literature CORPUS_OUT=/data/corpus
make extract-test DATA_ROOT=/data/literature

Integrity model

  • PubMed: every .xml.gz is verified against its NCBI .md5 sidecar.
  • PMC: bulk packages have no md5 sidecar, so they are verified by Content-Length and an ETag is recorded for change detection.
  • openFDA / ClinicalTrials.gov: these sources publish full snapshots. The downloader detects changed snapshots via ETag / Last-Modified / Content-Length and only re-downloads when the snapshot changes. When a snapshot changes it is extracted again next to the zip file.
  • Downloads are atomic (.part -> rename) and resumable via HTTP Range.
  • Exit code is non-zero if any file failed, so cron/monitoring can alert.

Notes on sources

  • openFDA bulk data is zipped JSON. The manifest is fetched from https://api.fda.gov/download.json. Each endpoint partition becomes one downloaded/extracted unit.
  • ClinicalTrials.gov bulk data is the full public XML dump from https://clinicaltrials.gov/api/legacy/public-xml?format=zip. One XML file per study.
  • Both sources are snapshots, not daily deltas. Daily runs are still cheap because unchanged snapshots are skipped; changed snapshots are replaced in full.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

litsync-0.0.3.tar.gz (26.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

litsync-0.0.3-py3-none-any.whl (30.1 kB view details)

Uploaded Python 3

File details

Details for the file litsync-0.0.3.tar.gz.

File metadata

  • Download URL: litsync-0.0.3.tar.gz
  • Upload date:
  • Size: 26.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for litsync-0.0.3.tar.gz
Algorithm Hash digest
SHA256 8eff1c481b4008fb0d4d7fe3225563f5e9cf9b4a5ea4d0e5efa13ed97258c7fa
MD5 bfe25bf34873134f333682766ce4430a
BLAKE2b-256 549effd87ede445fabc58bc8d893bb0c84861d589b9e5621d93120a16b35c55e

See more details on using hashes here.

Provenance

The following attestation bundles were made for litsync-0.0.3.tar.gz:

Publisher: publish.yml on Takshan/LitSync

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file litsync-0.0.3-py3-none-any.whl.

File metadata

  • Download URL: litsync-0.0.3-py3-none-any.whl
  • Upload date:
  • Size: 30.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for litsync-0.0.3-py3-none-any.whl
Algorithm Hash digest
SHA256 b28dfc0e47c73f49ce3611989b561d0ee2f2cfca3ab8ef4bf1da6d9288ca5a51
MD5 dbdb61009a950f052c0f4a54eebbb567
BLAKE2b-256 d4c2fd39e4450866456ab13d1cca2b58df5eedbd8b40b76d276a3f0eec628905

See more details on using hashes here.

Provenance

The following attestation bundles were made for litsync-0.0.3-py3-none-any.whl:

Publisher: publish.yml on Takshan/LitSync

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page