Bulk, parallel, resumable harvester for the Europe PMC corpus
Project description
europepmc-bulk
Bulk, parallel, resumable harvester for the Europe PMC corpus.
europepmc-bulk complements the existing pyeuropepmc package — pyeuropepmc is great for ad-hoc search and per-article analysis; europepmc-bulk is built for harvesting the entire 40M-article corpus with cursor pagination, atomic file writes, resume state, and threaded parallelism.
Features
- REST search with cursor-mark pagination
- Bulk FTP/HTTPS downloads of full-text archives, text-mined CSVs, ID mappings
- Annotations API batch collection
- OAI-PMH incremental updates
- JATS XML parsing
- Atomic file writes for crash safety
- Persistent resume state (interrupt and resume any harvest)
- Token-bucket rate limiter (default 10 req/s, configurable)
- Threaded parallel harvest with shared rate limiter
- Optional async HTTP client (
pip install "europepmc-bulk[async]") - Click CLI mirror of the Python API
Install
pip install europepmc-bulk
# or with async client
pip install "europepmc-bulk[async]"
Quick start
from europepmc_bulk import Config, AbstractHarvester
config = Config(base_dir="./epmc-data")
harvester = AbstractHarvester(config)
harvester.harvest_year(2024, output_format="json")
# CLI equivalent
europepmc-bulk harvest-abstracts --start-year 2024 --end-year 2024 --format json
See docs for full usage.
License
MIT — see LICENSE.
Citing Europe PMC
If you use this package to collect data from Europe PMC, please cite:
The Europe PMC Consortium. Europe PMC: a full-text literature database for the life sciences and platform for innovation. Nucleic Acids Research, 2014.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file europepmc_bulk-0.1.1.tar.gz.
File metadata
- Download URL: europepmc_bulk-0.1.1.tar.gz
- Upload date:
- Size: 22.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
09ff78f02d90e339f8b1f6c2f39bfffb80f71c6f945174ff583563cda9341c8a
|
|
| MD5 |
b6a44a67ebaab8825764cab89b8405c5
|
|
| BLAKE2b-256 |
6c09403df8e2ddf20bfbf61d6a56c0a6b4e26fb56d017ccc33b041d3d7ee84b6
|
Provenance
The following attestation bundles were made for europepmc_bulk-0.1.1.tar.gz:
Publisher:
publish.yml on Tianyi-Billy-Ma/europepmc-bulk
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
europepmc_bulk-0.1.1.tar.gz -
Subject digest:
09ff78f02d90e339f8b1f6c2f39bfffb80f71c6f945174ff583563cda9341c8a - Sigstore transparency entry: 1428782578
- Sigstore integration time:
-
Permalink:
Tianyi-Billy-Ma/europepmc-bulk@2cdd568b4695a267e9ceb33c779be9be50b902f5 -
Branch / Tag:
refs/tags/v0.1.1 - Owner: https://github.com/Tianyi-Billy-Ma
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@2cdd568b4695a267e9ceb33c779be9be50b902f5 -
Trigger Event:
push
-
Statement type:
File details
Details for the file europepmc_bulk-0.1.1-py3-none-any.whl.
File metadata
- Download URL: europepmc_bulk-0.1.1-py3-none-any.whl
- Upload date:
- Size: 20.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
66ef431618f6f51f4f62385c56a7abb19a62c412ff83a0443918f22c7433a5bc
|
|
| MD5 |
4e5c94fce66af1392dfb203e6e46d52d
|
|
| BLAKE2b-256 |
fbf2f4f6a443327629f188c7bc7097a1982bfa08404b809804fdc674d92f2c8a
|
Provenance
The following attestation bundles were made for europepmc_bulk-0.1.1-py3-none-any.whl:
Publisher:
publish.yml on Tianyi-Billy-Ma/europepmc-bulk
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
europepmc_bulk-0.1.1-py3-none-any.whl -
Subject digest:
66ef431618f6f51f4f62385c56a7abb19a62c412ff83a0443918f22c7433a5bc - Sigstore transparency entry: 1428782639
- Sigstore integration time:
-
Permalink:
Tianyi-Billy-Ma/europepmc-bulk@2cdd568b4695a267e9ceb33c779be9be50b902f5 -
Branch / Tag:
refs/tags/v0.1.1 - Owner: https://github.com/Tianyi-Billy-Ma
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@2cdd568b4695a267e9ceb33c779be9be50b902f5 -
Trigger Event:
push
-
Statement type: