Skip to main content

Aggregate publication metadata from bioRxiv, OpenAlex, and more.

Project description

pub-lake

PyPI version

Aggregate publication metadata from bioRxiv, OpenAlex, and more.

Features

  1. bioRxiv preprints: fetch metadata for preprints from the bioRxiv API and enrich it with OpenAlex topics.
  2. Checkpointed ingestion: resume interrupted data ingestions without duplicating data.

Installation

With uv (recommended):

uv add pub-lake

With pip:

pip install pub-lake

Usage

Command-Line Interface

# ingest preprints from the given dates into the database
uv run python -m pub_lake preprints fetch --start "2025-01-02" --end "2025-01-04" --polite "user@example.com"

# list preprints available in the database
uv run python -m pub_lake preprints list [--start "2025-01-02"] [--end "2025-01-04"]

Python API

from datetime import date
from pub_lake import config
from pub_lake.elt.pipeline import ingest_preprints
from pub_lake.interface.preprints import get_preprints
from pub_lake.models.preprints import DateInterval

# ingest preprints from these dates into the database
config.POLITE_EMAIL = "user@example.com"
interval = DateInterval(start=date(2025, 1, 2), end=date(2025, 1, 4))
ingest_preprints(interval)

# list preprints available in the database
preprints = get_preprints(interval)
print(preprints.df.to_string())

How it works

The package follows an ELT (Extract, Load, Transform) architecture and stores data in a relational database (SQLite by default). Key steps:

  1. Extract: Fetch raw metadata from bioRxiv and OpenAlex APIs.
  2. Load: Store the raw metadata in the database.
  3. Transform: Clean, normalize, and aggregate the data.

Data can then be queried and returns a unified view of publication metadata.

Ingestion Pipeline

Data ingestion uses the medallion architecture with Bronze, Silver, and Gold layers.

  1. Raw preprint data is fetched from external sources (bioRxiv, OpenAlex) and loaded as-is into separate bronze-layer tables for each source.
  2. Bronze preprints from each source are cleaned and deduplicated into a single silver-layer table with one row per source per preprint.
  3. Silver preprints from each source are aggregated into a gold-layer table with one row per preprint, combining metadata from all sources.
architecture-beta
    service date_interval(database)[date_interval]

    group sources(cloud)[Sources]

    service biorxiv(server)[bioRxiv] in sources
    service openalex(server)[OpenAlex] in sources

    date_interval:T --> L:biorxiv
    date_interval:R --> L:openalex


    group bronze(database)[Bronze]

    service bronze_biorxiv(database)[bronze_biorxiv] in bronze
    service bronze_openalex(database)[bronze_openalex] in bronze

    biorxiv:R --> L:bronze_biorxiv
    openalex:R --> L:bronze_openalex


    group silver(database)[Silver]

    service silver_preprints(database)[silver_preprints] in silver
    bronze_biorxiv:R --> T:silver_preprints
    bronze_openalex:R --> L:silver_preprints


    group gold(database)[Gold]

    service gold_preprints(database)[gold_preprints] in gold
    silver_preprints:R --> L:gold_preprints

Benefits of this architecture:

  • Modularity: Each layer-to-layer transformation can be tested and run independently. Adding columns to the silver & gold layers does not require re-ingesting bronze data.
  • Data Provenance: Raw data is preserved in the bronze layer for auditing. Gold-layer data can be traced back to its bronze source rows.

Drawbacks:

  • Storage Overhead: Storing multiple layers increases database size.

Development

Project Structure

src/pub_lake/ has the following structure:

  • cli/: command-line interface for interacting with the package.
  • elt/: core logic for the Extract, Load, Transform pipeline.
  • models/: database schema and data models.
  • interface/: methods for querying the final, cleaned data.
  • config.py: configuration, such as database connections and API keys.

Releasing a New Version

To release a new version of pub-lake to PyPI:

  1. Update the version number:
    uv run --group publish bump2version --current-version [major|minor|patch] pyproject.toml
    
  2. Update the uv.lock file:
    uv lock
    
  3. Update the changelog in docs/history.md.
  4. Build the distribution:
    just clean
    uv run --group publish python -m build
    
  5. Check the distribution:
    uv run --group publish twine check dist/*
    
  6. Upload to TestPyPI (optional but recommended):
    uv run --group publish twine upload --repository testpypi dist/*
    
  7. Upload to PyPI:
    uv run --group publish twine upload dist/*
    

Credits

This package was created with Cookiecutter and the audreyfeldroy/cookiecutter-pypackage project template.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pub_lake-0.1.1.tar.gz (23.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pub_lake-0.1.1-py3-none-any.whl (20.6 kB view details)

Uploaded Python 3

File details

Details for the file pub_lake-0.1.1.tar.gz.

File metadata

  • Download URL: pub_lake-0.1.1.tar.gz
  • Upload date:
  • Size: 23.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.11

File hashes

Hashes for pub_lake-0.1.1.tar.gz
Algorithm Hash digest
SHA256 9e7e2c6e4aefe862b2f04ef1cd6d19e50b3005506cdd9ddfcfd321d23ea5bd3c
MD5 d2d4aa3f1dc208c4f761cffcb1aca0d8
BLAKE2b-256 5cebbf54401841120439e76efd6dd37a2c0cb5900656eec0e7bd698f62623053

See more details on using hashes here.

File details

Details for the file pub_lake-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: pub_lake-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 20.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.11

File hashes

Hashes for pub_lake-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 128e7ae701529bad45c7050ab86bc9f629ba8dde13c5af30e5b2ce19310e7a02
MD5 cf652a41d97a200419236f175e951449
BLAKE2b-256 e683fe0bffdef5e225d306955b98a8a9ffcbd9f13d766d4d36e54dffd8dbd1bd

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page