Incremental mirror for PubMed, PMC, FDA, and ClinicalTrials.gov
Project description
litsync — incremental PubMed + PMC + FDA + ClinicalTrials.gov mirror
A modern, daily-runnable CLI for mirroring bulk biomedical datasets. It tracks every file in a SQLite state DB so re-runs do the minimum work: already-verified immutable files are skipped with no network request beyond the directory/manifest listing.
Install
pip install -e .
Or use the Makefile:
make install
make dev
Quick start
litsync --data-root /data/literature --email you@institute.org
Common options:
litsync --data-root /data/literature --email you@institute.org \
--sources pubmed pmc fda clinicaltrials \
--fda-endpoints drug/event drug/label
--sources pubmed pmc fda clinicaltrials # which corpora (default: all four)
--fda-endpoints drug/event drug/label # default: all openFDA endpoints
--pmc-groups oa_comm oa_noncomm oa_other
--pmc-formats xml txt # default: xml
--workers 4 # concurrent downloads (keep modest; be polite)
--dry-run # plan only, download nothing
--reverify # re-download local files (integrity audit)
--prune # delete local files no longer on the server
--count-articles # count articles in already-downloaded files (no network)
--no-rich # disable Rich progress bars / tables
On-disk layout
/data/literature/
pubmed/baseline/ pubmed26nXXXX.xml.gz (+ .md5 verified)
pubmed/updatefiles/ daily citation deltas
pmc/oa_bulk/<group>/<fmt>/ baseline + dated incremental .tar.gz
pmc/oa_file_list.csv PMCID <-> PMID id map
fda/<category>/<endpoint>/ openFDA bulk snapshot zips + extracted JSON
clinicaltrials/ctg-public-xml.zip ClinicalTrials.gov full XML dump
clinicaltrials/ctg-public-xml/ extracted study XML files
_state/state.sqlite file ledger (status, size, mtime, md5, etag, attempts)
_state/logs/ dated run logs
_state/litsync.lock run lock (prevents overlapping cron runs)
Cron (daily 02:30)
30 2 * * * /path/to/venv/bin/litsync --data-root /data/literature --email you@institute.org >> /data/literature/_state/cron.log 2>&1
Extract corpus to sharded JSONL
litsync-extract --data-root /data/literature --out /data/corpus \
--sources pubmed pmc fda clinicaltrials
Or with Make:
make extract DATA_ROOT=/data/literature CORPUS_OUT=/data/corpus
make extract-test DATA_ROOT=/data/literature
Integrity model
- PubMed: every
.xml.gzis verified against its NCBI.md5sidecar. - PMC: bulk packages have no md5 sidecar, so they are verified by
Content-Lengthand anETagis recorded for change detection. - openFDA / ClinicalTrials.gov: these sources publish full snapshots. The downloader
detects changed snapshots via
ETag/Last-Modified/Content-Lengthand only re-downloads when the snapshot changes. When a snapshot changes it is extracted again next to the zip file. - Downloads are atomic (
.part-> rename) and resumable via HTTP Range. - Exit code is non-zero if any file failed, so cron/monitoring can alert.
Notes on sources
- openFDA bulk data is zipped JSON. The manifest is fetched from
https://api.fda.gov/download.json. Each endpoint partition becomes one downloaded/extracted unit. - ClinicalTrials.gov bulk data is the full public XML dump from
https://clinicaltrials.gov/api/legacy/public-xml?format=zip. One XML file per study. - Both sources are snapshots, not daily deltas. Daily runs are still cheap because unchanged snapshots are skipped; changed snapshots are replaced in full.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file litsync-0.0.3.tar.gz.
File metadata
- Download URL: litsync-0.0.3.tar.gz
- Upload date:
- Size: 26.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8eff1c481b4008fb0d4d7fe3225563f5e9cf9b4a5ea4d0e5efa13ed97258c7fa
|
|
| MD5 |
bfe25bf34873134f333682766ce4430a
|
|
| BLAKE2b-256 |
549effd87ede445fabc58bc8d893bb0c84861d589b9e5621d93120a16b35c55e
|
Provenance
The following attestation bundles were made for litsync-0.0.3.tar.gz:
Publisher:
publish.yml on Takshan/LitSync
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
litsync-0.0.3.tar.gz -
Subject digest:
8eff1c481b4008fb0d4d7fe3225563f5e9cf9b4a5ea4d0e5efa13ed97258c7fa - Sigstore transparency entry: 1895034774
- Sigstore integration time:
-
Permalink:
Takshan/LitSync@be7c63f61366d13a37556945491f82aeb6cf511e -
Branch / Tag:
refs/tags/v0.0.3 - Owner: https://github.com/Takshan
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@be7c63f61366d13a37556945491f82aeb6cf511e -
Trigger Event:
push
-
Statement type:
File details
Details for the file litsync-0.0.3-py3-none-any.whl.
File metadata
- Download URL: litsync-0.0.3-py3-none-any.whl
- Upload date:
- Size: 30.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b28dfc0e47c73f49ce3611989b561d0ee2f2cfca3ab8ef4bf1da6d9288ca5a51
|
|
| MD5 |
dbdb61009a950f052c0f4a54eebbb567
|
|
| BLAKE2b-256 |
d4c2fd39e4450866456ab13d1cca2b58df5eedbd8b40b76d276a3f0eec628905
|
Provenance
The following attestation bundles were made for litsync-0.0.3-py3-none-any.whl:
Publisher:
publish.yml on Takshan/LitSync
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
litsync-0.0.3-py3-none-any.whl -
Subject digest:
b28dfc0e47c73f49ce3611989b561d0ee2f2cfca3ab8ef4bf1da6d9288ca5a51 - Sigstore transparency entry: 1895034927
- Sigstore integration time:
-
Permalink:
Takshan/LitSync@be7c63f61366d13a37556945491f82aeb6cf511e -
Branch / Tag:
refs/tags/v0.0.3 - Owner: https://github.com/Takshan
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@be7c63f61366d13a37556945491f82aeb6cf511e -
Trigger Event:
push
-
Statement type: