Incremental mirror for PubMed, PMC, FDA, and ClinicalTrials.gov

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

These details have not been verified by PyPI

Project description

litsync — incremental PubMed + PMC + FDA + ClinicalTrials.gov mirror

A modern, daily-runnable CLI for mirroring bulk biomedical datasets. It tracks every file in a SQLite state DB so re-runs do the minimum work: already-verified immutable files are skipped with no network request beyond the directory/manifest listing.

Install

pip install -e .

Or use the Makefile:

make install
make dev

Quick start

litsync --data-root /data/literature --email you@institute.org

Common options:

litsync --data-root /data/literature --email you@institute.org \
  --sources pubmed pmc fda clinicaltrials \
  --fda-endpoints drug/event drug/label

--sources pubmed pmc fda clinicaltrials   # which corpora (default: all four)
--fda-endpoints drug/event drug/label     # default: all openFDA endpoints
--pmc-groups oa_comm oa_noncomm oa_other
--pmc-formats xml txt                     # default: xml
--workers 4                               # concurrent downloads (keep modest; be polite)
--dry-run                                 # plan only, download nothing
--reverify                                # re-download local files (integrity audit)
--prune                                   # delete local files no longer on the server
--count-articles                          # count articles in already-downloaded files (no network)
--no-rich                                 # disable Rich progress bars / tables

On-disk layout

/data/literature/
  pubmed/baseline/                    pubmed26nXXXX.xml.gz (+ .md5 verified)
  pubmed/updatefiles/                 daily citation deltas
  pmc/oa_bulk/<group>/<fmt>/          baseline + dated incremental .tar.gz
  pmc/oa_file_list.csv                PMCID <-> PMID id map
  fda/<category>/<endpoint>/          openFDA bulk snapshot zips + extracted JSON
  clinicaltrials/ctg-public-xml.zip   ClinicalTrials.gov full XML dump
  clinicaltrials/ctg-public-xml/      extracted study XML files
  _state/state.sqlite                 file ledger (status, size, mtime, md5, etag, attempts)
  _state/logs/                        dated run logs
  _state/litsync.lock                 run lock (prevents overlapping cron runs)

Cron (daily 02:30)

30 2 * * *  /path/to/venv/bin/litsync --data-root /data/literature --email you@institute.org >> /data/literature/_state/cron.log 2>&1

Extract corpus to sharded JSONL

litsync-extract --data-root /data/literature --out /data/corpus \
  --sources pubmed pmc fda clinicaltrials

Or with Make:

make extract DATA_ROOT=/data/literature CORPUS_OUT=/data/corpus
make extract-test DATA_ROOT=/data/literature

Integrity model

PubMed: every .xml.gz is verified against its NCBI .md5 sidecar.
PMC: bulk packages have no md5 sidecar, so they are verified by Content-Length and an ETag is recorded for change detection.
openFDA / ClinicalTrials.gov: these sources publish full snapshots. The downloader detects changed snapshots via ETag / Last-Modified / Content-Length and only re-downloads when the snapshot changes. When a snapshot changes it is extracted again next to the zip file.
Downloads are atomic (.part -> rename) and resumable via HTTP Range.
Exit code is non-zero if any file failed, so cron/monitoring can alert.

Notes on sources

openFDA bulk data is zipped JSON. The manifest is fetched from https://api.fda.gov/download.json. Each endpoint partition becomes one downloaded/extracted unit.
ClinicalTrials.gov bulk data is the full public XML dump from https://clinicaltrials.gov/api/legacy/public-xml?format=zip. One XML file per study.
Both sources are snapshots, not daily deltas. Daily runs are still cheap because unchanged snapshots are skipped; changed snapshots are replaced in full.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

Takshan

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.0.3

Jun 21, 2026

0.0.2

Jun 12, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

litsync-0.0.3.tar.gz (26.0 kB view details)

Uploaded Jun 21, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

litsync-0.0.3-py3-none-any.whl (30.1 kB view details)

Uploaded Jun 21, 2026 Python 3

File details

Details for the file litsync-0.0.3.tar.gz.

File metadata

Download URL: litsync-0.0.3.tar.gz
Upload date: Jun 21, 2026
Size: 26.0 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for litsync-0.0.3.tar.gz
Algorithm	Hash digest
SHA256	`8eff1c481b4008fb0d4d7fe3225563f5e9cf9b4a5ea4d0e5efa13ed97258c7fa`
MD5	`bfe25bf34873134f333682766ce4430a`
BLAKE2b-256	`549effd87ede445fabc58bc8d893bb0c84861d589b9e5621d93120a16b35c55e`

See more details on using hashes here.

Provenance

The following attestation bundles were made for litsync-0.0.3.tar.gz:

Publisher: publish.yml on Takshan/LitSync

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: litsync-0.0.3.tar.gz
- Subject digest: 8eff1c481b4008fb0d4d7fe3225563f5e9cf9b4a5ea4d0e5efa13ed97258c7fa
- Sigstore transparency entry: 1895034774
- Sigstore integration time: Jun 21, 2026
Source repository:
- Permalink: Takshan/LitSync@be7c63f61366d13a37556945491f82aeb6cf511e
- Branch / Tag: refs/tags/v0.0.3
- Owner: https://github.com/Takshan
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@be7c63f61366d13a37556945491f82aeb6cf511e
- Trigger Event: push

File details

Details for the file litsync-0.0.3-py3-none-any.whl.

File metadata

Download URL: litsync-0.0.3-py3-none-any.whl
Upload date: Jun 21, 2026
Size: 30.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for litsync-0.0.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`b28dfc0e47c73f49ce3611989b561d0ee2f2cfca3ab8ef4bf1da6d9288ca5a51`
MD5	`dbdb61009a950f052c0f4a54eebbb567`
BLAKE2b-256	`d4c2fd39e4450866456ab13d1cca2b58df5eedbd8b40b76d276a3f0eec628905`

See more details on using hashes here.

Provenance

The following attestation bundles were made for litsync-0.0.3-py3-none-any.whl:

Publisher: publish.yml on Takshan/LitSync

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: litsync-0.0.3-py3-none-any.whl
- Subject digest: b28dfc0e47c73f49ce3611989b561d0ee2f2cfca3ab8ef4bf1da6d9288ca5a51
- Sigstore transparency entry: 1895034927
- Sigstore integration time: Jun 21, 2026
Source repository:
- Permalink: Takshan/LitSync@be7c63f61366d13a37556945491f82aeb6cf511e
- Branch / Tag: refs/tags/v0.0.3
- Owner: https://github.com/Takshan
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@be7c63f61366d13a37556945491f82aeb6cf511e
- Trigger Event: push

litsync 0.0.3

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

litsync — incremental PubMed + PMC + FDA + ClinicalTrials.gov mirror

Install

Quick start

On-disk layout

Cron (daily 02:30)

Extract corpus to sharded JSONL

Integrity model

Notes on sources

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance