Aggregate publication metadata from bioRxiv, OpenAlex, and more.
Project description
pub-lake
Aggregate publication metadata from bioRxiv, OpenAlex, and more.
- PyPI package: https://pypi.org/project/pub-lake/
- Free software: MIT License
Features
- bioRxiv preprints: fetch metadata for preprints from the bioRxiv API and enrich it with OpenAlex topics.
- Checkpointed ingestion: resume interrupted data ingestions without duplicating data.
Installation
With uv (recommended):
uv add pub-lake
With pip:
pip install pub-lake
Usage
Command-Line Interface
# ingest preprints from the given dates into the database
uv run python -m pub_lake preprints fetch --start "2025-01-02" --end "2025-01-04" --polite "user@example.com"
# list preprints available in the database
uv run python -m pub_lake preprints list [--start "2025-01-02"] [--end "2025-01-04"]
Python API
from datetime import date
from pub_lake import config
from pub_lake.elt.pipeline import ingest_preprints
from pub_lake.interface.preprints import get_preprints
from pub_lake.models.preprints import DateInterval
# ingest preprints from these dates into the database
config.POLITE_EMAIL = "user@example.com"
interval = DateInterval(start=date(2025, 1, 2), end=date(2025, 1, 4))
ingest_preprints(interval)
# list preprints available in the database
preprints = get_preprints(interval)
print(preprints.df.to_string())
How it works
The package follows an ELT (Extract, Load, Transform) architecture and stores data in a relational database (SQLite by default). Key steps:
- Extract: Fetch raw metadata from bioRxiv and OpenAlex APIs.
- Load: Store the raw metadata in the database.
- Transform: Clean, normalize, and aggregate the data.
Data can then be queried and returns a unified view of publication metadata.
Ingestion Pipeline
Data ingestion uses the medallion architecture with Bronze, Silver, and Gold layers.
- Raw preprint data is fetched from external sources (bioRxiv, OpenAlex) and loaded as-is into separate bronze-layer tables for each source.
- Bronze preprints from each source are cleaned and deduplicated into a single silver-layer table with one row per source per preprint.
- Silver preprints from each source are aggregated into a gold-layer table with one row per preprint, combining metadata from all sources.
architecture-beta
service date_interval(database)[date_interval]
group sources(cloud)[Sources]
service biorxiv(server)[bioRxiv] in sources
service openalex(server)[OpenAlex] in sources
date_interval:T --> L:biorxiv
date_interval:R --> L:openalex
group bronze(database)[Bronze]
service bronze_biorxiv(database)[bronze_biorxiv] in bronze
service bronze_openalex(database)[bronze_openalex] in bronze
biorxiv:R --> L:bronze_biorxiv
openalex:R --> L:bronze_openalex
group silver(database)[Silver]
service silver_preprints(database)[silver_preprints] in silver
bronze_biorxiv:R --> T:silver_preprints
bronze_openalex:R --> L:silver_preprints
group gold(database)[Gold]
service gold_preprints(database)[gold_preprints] in gold
silver_preprints:R --> L:gold_preprints
Benefits of this architecture:
- Modularity: Each layer-to-layer transformation can be tested and run independently. Adding columns to the silver & gold layers does not require re-ingesting bronze data.
- Data Provenance: Raw data is preserved in the bronze layer for auditing. Gold-layer data can be traced back to its bronze source rows.
Drawbacks:
- Storage Overhead: Storing multiple layers increases database size.
Development
Project Structure
src/pub_lake/ has the following structure:
cli/: command-line interface for interacting with the package.elt/: core logic for the Extract, Load, Transform pipeline.models/: database schema and data models.interface/: methods for querying the final, cleaned data.config.py: configuration, such as database connections and API keys.
Releasing a New Version
To release a new version of pub-lake to PyPI:
- Update the version number:
uv run --group publish bump2version --current-version [major|minor|patch] pyproject.toml
- Update the uv.lock file:
uv lock - Update the changelog in
docs/history.md. - Build the distribution:
just clean uv run --group publish python -m build
- Check the distribution:
uv run --group publish twine check dist/*
- Upload to TestPyPI (optional but recommended):
uv run --group publish twine upload --repository testpypi dist/*
- Upload to PyPI:
uv run --group publish twine upload dist/*
Credits
This package was created with Cookiecutter and the audreyfeldroy/cookiecutter-pypackage project template.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pub_lake-0.1.1.tar.gz.
File metadata
- Download URL: pub_lake-0.1.1.tar.gz
- Upload date:
- Size: 23.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9e7e2c6e4aefe862b2f04ef1cd6d19e50b3005506cdd9ddfcfd321d23ea5bd3c
|
|
| MD5 |
d2d4aa3f1dc208c4f761cffcb1aca0d8
|
|
| BLAKE2b-256 |
5cebbf54401841120439e76efd6dd37a2c0cb5900656eec0e7bd698f62623053
|
File details
Details for the file pub_lake-0.1.1-py3-none-any.whl.
File metadata
- Download URL: pub_lake-0.1.1-py3-none-any.whl
- Upload date:
- Size: 20.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
128e7ae701529bad45c7050ab86bc9f629ba8dde13c5af30e5b2ce19310e7a02
|
|
| MD5 |
cf652a41d97a200419236f175e951449
|
|
| BLAKE2b-256 |
e683fe0bffdef5e225d306955b98a8a9ffcbd9f13d766d4d36e54dffd8dbd1bd
|