Skip to main content

Incremental mirror for PubMed, PMC, FDA, and ClinicalTrials.gov

Project description

litsync — incremental PubMed + PMC + FDA + ClinicalTrials.gov mirror

A modern, daily-runnable CLI for mirroring bulk biomedical datasets. It tracks every file in a SQLite state DB so re-runs do the minimum work: already-verified immutable files are skipped with no network request beyond the directory/manifest listing.

Install

pip install -e .

Or use the Makefile:

make install
make dev

Quick start

litsync --data-root /data/literature --email you@institute.org

Common options:

litsync --data-root /data/literature --email you@institute.org \
  --sources pubmed pmc fda clinicaltrials \
  --fda-endpoints drug/event drug/label
--sources pubmed pmc fda clinicaltrials   # which corpora (default: all four)
--fda-endpoints drug/event drug/label     # default: all openFDA endpoints
--pmc-groups oa_comm oa_noncomm oa_other
--pmc-formats xml txt                     # default: xml
--workers 4                               # concurrent downloads (keep modest; be polite)
--dry-run                                 # plan only, download nothing
--reverify                                # re-download local files (integrity audit)
--prune                                   # delete local files no longer on the server
--count-articles                          # count articles in already-downloaded files (no network)
--no-rich                                 # disable Rich progress bars / tables

On-disk layout

/data/literature/
  pubmed/baseline/                    pubmed26nXXXX.xml.gz (+ .md5 verified)
  pubmed/updatefiles/                 daily citation deltas
  pmc/oa_bulk/<group>/<fmt>/          baseline + dated incremental .tar.gz
  pmc/oa_file_list.csv                PMCID <-> PMID id map
  fda/<category>/<endpoint>/          openFDA bulk snapshot zips + extracted JSON
  clinicaltrials/ctg-public-xml.zip   ClinicalTrials.gov full XML dump
  clinicaltrials/ctg-public-xml/      extracted study XML files
  _state/state.sqlite                 file ledger (status, size, mtime, md5, etag, attempts)
  _state/logs/                        dated run logs
  _state/litsync.lock                 run lock (prevents overlapping cron runs)

Cron (daily 02:30)

30 2 * * *  /path/to/venv/bin/litsync --data-root /data/literature --email you@institute.org >> /data/literature/_state/cron.log 2>&1

Extract corpus to sharded JSONL

litsync-extract --data-root /data/literature --out /data/corpus \
  --sources pubmed pmc fda clinicaltrials

Or with Make:

make extract DATA_ROOT=/data/literature CORPUS_OUT=/data/corpus
make extract-test DATA_ROOT=/data/literature

Integrity model

  • PubMed: every .xml.gz is verified against its NCBI .md5 sidecar.
  • PMC: bulk packages have no md5 sidecar, so they are verified by Content-Length and an ETag is recorded for change detection.
  • openFDA / ClinicalTrials.gov: these sources publish full snapshots. The downloader detects changed snapshots via ETag / Last-Modified / Content-Length and only re-downloads when the snapshot changes. When a snapshot changes it is extracted again next to the zip file.
  • Downloads are atomic (.part -> rename) and resumable via HTTP Range.
  • Exit code is non-zero if any file failed, so cron/monitoring can alert.

Notes on sources

  • openFDA bulk data is zipped JSON. The manifest is fetched from https://api.fda.gov/download.json. Each endpoint partition becomes one downloaded/extracted unit.
  • ClinicalTrials.gov bulk data is the full public XML dump from https://clinicaltrials.gov/api/legacy/public-xml?format=zip. One XML file per study.
  • Both sources are snapshots, not daily deltas. Daily runs are still cheap because unchanged snapshots are skipped; changed snapshots are replaced in full.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

litsync-0.0.2.tar.gz (20.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

litsync-0.0.2-py3-none-any.whl (24.3 kB view details)

Uploaded Python 3

File details

Details for the file litsync-0.0.2.tar.gz.

File metadata

  • Download URL: litsync-0.0.2.tar.gz
  • Upload date:
  • Size: 20.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for litsync-0.0.2.tar.gz
Algorithm Hash digest
SHA256 7bd26e29fcb3bf0f12a083b070a1675cf5fe65f63190828273c0beb26fe61a10
MD5 e4d5ddbd1b1040787e47cc1882c3b48a
BLAKE2b-256 aa247114398f60eb74b61e0bd20ac04be9cf8de748c51fd5b6645b6a022564a2

See more details on using hashes here.

Provenance

The following attestation bundles were made for litsync-0.0.2.tar.gz:

Publisher: publish.yml on Takshan/LitSync

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file litsync-0.0.2-py3-none-any.whl.

File metadata

  • Download URL: litsync-0.0.2-py3-none-any.whl
  • Upload date:
  • Size: 24.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for litsync-0.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 dfe938b8a2cd65da91bcaba7dea54526573349943bacd1bef86ea39cfa63c2f2
MD5 1efb0018777e28188e35d15d889d2c9a
BLAKE2b-256 e0c1a71d9f57e49053a7d1bd715a787c4e4b78b43a200a7935a66c13dd305554

See more details on using hashes here.

Provenance

The following attestation bundles were made for litsync-0.0.2-py3-none-any.whl:

Publisher: publish.yml on Takshan/LitSync

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page