Skip to main content

Bulk, parallel, resumable harvester for the Europe PMC corpus

Project description

europepmc-bulk

PyPI Python License

Bulk, parallel, resumable harvester for the Europe PMC corpus.

europepmc-bulk complements the existing pyeuropepmc package — pyeuropepmc is great for ad-hoc search and per-article analysis; europepmc-bulk is built for harvesting the entire 40M-article corpus with cursor pagination, atomic file writes, resume state, and threaded parallelism.

Features

  • REST search with cursor-mark pagination
  • Bulk FTP/HTTPS downloads of full-text archives, text-mined CSVs, ID mappings
  • Annotations API batch collection
  • OAI-PMH incremental updates
  • JATS XML parsing
  • Atomic file writes for crash safety
  • Persistent resume state (interrupt and resume any harvest)
  • Token-bucket rate limiter (default 10 req/s, configurable)
  • Threaded parallel harvest with shared rate limiter
  • Optional async HTTP client (pip install "europepmc-bulk[async]")
  • Click CLI mirror of the Python API

Install

pip install europepmc-bulk
# or with async client
pip install "europepmc-bulk[async]"

Quick start

from europepmc_bulk import Config, AbstractHarvester

config = Config(base_dir="./epmc-data")
harvester = AbstractHarvester(config)
harvester.harvest_year(2024, output_format="json")
# CLI equivalent
europepmc-bulk harvest-abstracts --start-year 2024 --end-year 2024 --format json

See docs for full usage.

License

MIT — see LICENSE.

Citing Europe PMC

If you use this package to collect data from Europe PMC, please cite:

The Europe PMC Consortium. Europe PMC: a full-text literature database for the life sciences and platform for innovation. Nucleic Acids Research, 2014.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

europepmc_bulk-0.1.0.tar.gz (22.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

europepmc_bulk-0.1.0-py3-none-any.whl (21.3 kB view details)

Uploaded Python 3

File details

Details for the file europepmc_bulk-0.1.0.tar.gz.

File metadata

  • Download URL: europepmc_bulk-0.1.0.tar.gz
  • Upload date:
  • Size: 22.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for europepmc_bulk-0.1.0.tar.gz
Algorithm Hash digest
SHA256 698ea3e40dc6294d8a15b04494d42f17ec653ad561b2a3d5389dff50e93490e1
MD5 14922a8e967ce079a8000f8c8c8875dd
BLAKE2b-256 8bf8f9efcb46b04935095f84597bdc0b3ce070e94aa8e8c36778470e3827c93c

See more details on using hashes here.

Provenance

The following attestation bundles were made for europepmc_bulk-0.1.0.tar.gz:

Publisher: publish.yml on Tianyi-Billy-Ma/europepmc-bulk

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file europepmc_bulk-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: europepmc_bulk-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 21.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for europepmc_bulk-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 cb2fd7f3c8487d8816d5ac3a0d1b254e2fdbbbc6cd9c1098d5c79b07237cebf2
MD5 5246b53f2a0cbce5c99a8aa0de82ffdf
BLAKE2b-256 14258a4bd7bcc69a719fcec0c059b89a27a31dfa6bfa7cc95700aa38284e44a8

See more details on using hashes here.

Provenance

The following attestation bundles were made for europepmc_bulk-0.1.0-py3-none-any.whl:

Publisher: publish.yml on Tianyi-Billy-Ma/europepmc-bulk

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page