Skip to main content

Bulk, parallel, resumable harvester for the Europe PMC corpus

Project description

europepmc-bulk

PyPI Python License

Bulk, parallel, resumable harvester for the Europe PMC corpus.

europepmc-bulk complements the existing pyeuropepmc package — pyeuropepmc is great for ad-hoc search and per-article analysis; europepmc-bulk is built for harvesting the entire 40M-article corpus with cursor pagination, atomic file writes, resume state, and threaded parallelism.

Features

  • REST search with cursor-mark pagination
  • Bulk FTP/HTTPS downloads of full-text archives, text-mined CSVs, ID mappings
  • Annotations API batch collection
  • OAI-PMH incremental updates
  • JATS XML parsing
  • Atomic file writes for crash safety
  • Persistent resume state (interrupt and resume any harvest)
  • Token-bucket rate limiter (default 10 req/s, configurable)
  • Threaded parallel harvest with shared rate limiter
  • Optional async HTTP client (pip install "europepmc-bulk[async]")
  • Click CLI mirror of the Python API

Install

pip install europepmc-bulk
# or with async client
pip install "europepmc-bulk[async]"

Quick start

from europepmc_bulk import Config, AbstractHarvester

config = Config(base_dir="./epmc-data")
harvester = AbstractHarvester(config)
harvester.harvest_year(2024, output_format="json")
# CLI equivalent
europepmc-bulk harvest-abstracts --start-year 2024 --end-year 2024 --format json

See docs for full usage.

License

MIT — see LICENSE.

Citing Europe PMC

If you use this package to collect data from Europe PMC, please cite:

The Europe PMC Consortium. Europe PMC: a full-text literature database for the life sciences and platform for innovation. Nucleic Acids Research, 2014.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

europepmc_bulk-0.1.1.tar.gz (22.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

europepmc_bulk-0.1.1-py3-none-any.whl (20.6 kB view details)

Uploaded Python 3

File details

Details for the file europepmc_bulk-0.1.1.tar.gz.

File metadata

  • Download URL: europepmc_bulk-0.1.1.tar.gz
  • Upload date:
  • Size: 22.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for europepmc_bulk-0.1.1.tar.gz
Algorithm Hash digest
SHA256 09ff78f02d90e339f8b1f6c2f39bfffb80f71c6f945174ff583563cda9341c8a
MD5 b6a44a67ebaab8825764cab89b8405c5
BLAKE2b-256 6c09403df8e2ddf20bfbf61d6a56c0a6b4e26fb56d017ccc33b041d3d7ee84b6

See more details on using hashes here.

Provenance

The following attestation bundles were made for europepmc_bulk-0.1.1.tar.gz:

Publisher: publish.yml on Tianyi-Billy-Ma/europepmc-bulk

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file europepmc_bulk-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: europepmc_bulk-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 20.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for europepmc_bulk-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 66ef431618f6f51f4f62385c56a7abb19a62c412ff83a0443918f22c7433a5bc
MD5 4e5c94fce66af1392dfb203e6e46d52d
BLAKE2b-256 fbf2f4f6a443327629f188c7bc7097a1982bfa08404b809804fdc674d92f2c8a

See more details on using hashes here.

Provenance

The following attestation bundles were made for europepmc_bulk-0.1.1-py3-none-any.whl:

Publisher: publish.yml on Tianyi-Billy-Ma/europepmc-bulk

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page