Aggregate publication metadata from bioRxiv, OpenAlex, and more.

These details have not been verified by PyPI

Project links

Project description

pub-lake

PyPI version

Aggregate publication metadata from bioRxiv, OpenAlex, and more.

PyPI package: https://pypi.org/project/pub-lake/
Free software: MIT License

Features

bioRxiv preprints: fetch metadata for preprints from the bioRxiv API and enrich it with OpenAlex topics.
Checkpointed ingestion: resume interrupted data ingestions without duplicating data.

Installation

With uv (recommended):

uv add pub-lake

With pip:

pip install pub-lake

Usage

Command-Line Interface

# ingest preprints from the given dates into the database
uv run python -m pub_lake preprints fetch --start "2025-01-02" --end "2025-01-04" --polite "user@example.com"

# list preprints available in the database
uv run python -m pub_lake preprints list [--start "2025-01-02"] [--end "2025-01-04"]

Python API

from datetime import date
from pub_lake import config
from pub_lake.elt.pipeline import ingest_preprints
from pub_lake.interface.preprints import get_preprints
from pub_lake.models.preprints import DateInterval

# ingest preprints from these dates into the database
config.POLITE_EMAIL = "user@example.com"
interval = DateInterval(start=date(2025, 1, 2), end=date(2025, 1, 4))
ingest_preprints(interval)

# list preprints available in the database
preprints = get_preprints(interval)
print(preprints.df.to_string())

How it works

The package follows an ELT (Extract, Load, Transform) architecture and stores data in a relational database (SQLite by default). Key steps:

Extract: Fetch raw metadata from bioRxiv and OpenAlex APIs.
Load: Store the raw metadata in the database.
Transform: Clean, normalize, and aggregate the data.

Data can then be queried and returns a unified view of publication metadata.

Ingestion Pipeline

Data ingestion uses the medallion architecture with Bronze, Silver, and Gold layers.

Raw preprint data is fetched from external sources (bioRxiv, OpenAlex) and loaded as-is into separate bronze-layer tables for each source.
Bronze preprints from each source are cleaned and deduplicated into a single silver-layer table with one row per source per preprint.
Silver preprints from each source are aggregated into a gold-layer table with one row per preprint, combining metadata from all sources.

architecture-beta
    service date_interval(database)[date_interval]

    group sources(cloud)[Sources]

    service biorxiv(server)[bioRxiv] in sources
    service openalex(server)[OpenAlex] in sources

    date_interval:T --> L:biorxiv
    date_interval:R --> L:openalex


    group bronze(database)[Bronze]

    service bronze_biorxiv(database)[bronze_biorxiv] in bronze
    service bronze_openalex(database)[bronze_openalex] in bronze

    biorxiv:R --> L:bronze_biorxiv
    openalex:R --> L:bronze_openalex


    group silver(database)[Silver]

    service silver_preprints(database)[silver_preprints] in silver
    bronze_biorxiv:R --> T:silver_preprints
    bronze_openalex:R --> L:silver_preprints


    group gold(database)[Gold]

    service gold_preprints(database)[gold_preprints] in gold
    silver_preprints:R --> L:gold_preprints

Benefits of this architecture:

Modularity: Each layer-to-layer transformation can be tested and run independently. Adding columns to the silver & gold layers does not require re-ingesting bronze data.
Data Provenance: Raw data is preserved in the bronze layer for auditing. Gold-layer data can be traced back to its bronze source rows.

Drawbacks:

Storage Overhead: Storing multiple layers increases database size.

Development

Project Structure

src/pub_lake/ has the following structure:

cli/: command-line interface for interacting with the package.
elt/: core logic for the Extract, Load, Transform pipeline.
models/: database schema and data models.
interface/: methods for querying the final, cleaned data.
config.py: configuration, such as database connections and API keys.

Releasing a New Version

To release a new version of pub-lake to PyPI:

Update the version number:

uv run --group publish bump2version --current-version [major|minor|patch] pyproject.toml

Update the uv.lock file:
```
uv lock
```
Update the changelog in docs/history.md.

Build the distribution:

just clean
uv run --group publish python -m build

Check the distribution:

uv run --group publish twine check dist/*

Upload to TestPyPI (optional but recommended):

uv run --group publish twine upload --repository testpypi dist/*

Upload to PyPI:

uv run --group publish twine upload dist/*

Credits

This package was created with Cookiecutter and the audreyfeldroy/cookiecutter-pypackage project template.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.1

Nov 4, 2025

0.1.0

Nov 3, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pub_lake-0.1.1.tar.gz (23.8 kB view details)

Uploaded Nov 4, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pub_lake-0.1.1-py3-none-any.whl (20.6 kB view details)

Uploaded Nov 4, 2025 Python 3

File details

Details for the file pub_lake-0.1.1.tar.gz.

File metadata

Download URL: pub_lake-0.1.1.tar.gz
Upload date: Nov 4, 2025
Size: 23.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.11

File hashes

Hashes for pub_lake-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`9e7e2c6e4aefe862b2f04ef1cd6d19e50b3005506cdd9ddfcfd321d23ea5bd3c`
MD5	`d2d4aa3f1dc208c4f761cffcb1aca0d8`
BLAKE2b-256	`5cebbf54401841120439e76efd6dd37a2c0cb5900656eec0e7bd698f62623053`

See more details on using hashes here.

File details

Details for the file pub_lake-0.1.1-py3-none-any.whl.

File metadata

Download URL: pub_lake-0.1.1-py3-none-any.whl
Upload date: Nov 4, 2025
Size: 20.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.11

File hashes

Hashes for pub_lake-0.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`128e7ae701529bad45c7050ab86bc9f629ba8dde13c5af30e5b2ce19310e7a02`
MD5	`cf652a41d97a200419236f175e951449`
BLAKE2b-256	`e683fe0bffdef5e225d306955b98a8a9ffcbd9f13d766d4d36e54dffd8dbd1bd`

See more details on using hashes here.

pub-lake 0.1.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

pub-lake

Features

Installation

Usage

Command-Line Interface

Python API

How it works

Ingestion Pipeline

Development

Project Structure

Releasing a New Version

Credits

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes