Skip to main content

Aggregate publication metadata from bioRxiv, OpenAlex, and more.

Project description

pub-lake

PyPI version Documentation Status

Aggregate publication metadata from bioRxiv, OpenAlex, and more.

Features

  1. bioRxiv preprints: fetch metadata for preprints from the bioRxiv API and enrich it with OpenAlex topics.

How it works

The package follows an ELT (Extract, Load, Transform) architecture and stores data in a relational database (SQLite by default). Key steps:

  1. Extract: Fetch raw metadata from bioRxiv and OpenAlex APIs.
  2. Load: Store the raw metadata in the database.
  3. Transform: Clean, normalize, and aggregate the data.

Data can then be queried and returns a unified view of publication metadata.

Installation

uv add pub-lake

See docs/installation.md for more details.

Usage

# ingest preprints from the given dates into the database
uv run python -m pub_lake preprints fetch --start "2025-01-02" --end "2025-01-04" --polite "eidens@embl.de"

# list preprints available in the database
uv run python -m pub_lake preprints list [--start "2025-01-02"] [--end "2025-01-04"]

See docs/usage.md for more details.

Development

Project Structure

src/pub_lake/ has the following structure:

  • cli.py: main entry point for the command-line interface.
  • elt/: core logic for the Extract, Load, Transform pipeline.
    • extract/: fetching data from external sources (e.g., bioRxiv, OpenAlex).
    • load/: loading raw data into the database.
    • transform/: cleaning and normalizing the loaded data.
  • models/: database schema and data models.
  • interface/: methods for querying the final, cleaned data.
  • config.py: configuration, such as database connections and API keys.

Credits

This package was created with Cookiecutter and the audreyfeldroy/cookiecutter-pypackage project template.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pub_lake-0.1.0.tar.gz (21.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pub_lake-0.1.0-py3-none-any.whl (19.4 kB view details)

Uploaded Python 3

File details

Details for the file pub_lake-0.1.0.tar.gz.

File metadata

  • Download URL: pub_lake-0.1.0.tar.gz
  • Upload date:
  • Size: 21.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.11

File hashes

Hashes for pub_lake-0.1.0.tar.gz
Algorithm Hash digest
SHA256 21c657c4051f2d8b8902f390f5d8737b9c08b8c5e779b7a44d61c3c8fe9a2586
MD5 0606dbd4275c3a12313eadfbce1529ff
BLAKE2b-256 56e3c77f18bb04d1e394993b69269d7fa1af8bd9fd1f23b74fdee884ff331474

See more details on using hashes here.

File details

Details for the file pub_lake-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: pub_lake-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 19.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.11

File hashes

Hashes for pub_lake-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 11f38402e96e92e71fea9be3a2db63d9d8b836d9e3d6a794f176d05a9fe09c0f
MD5 ea1afc11e80fad9f0a480e58857886fd
BLAKE2b-256 bb704a2195a70c6c242c97b1687218d46f633bd218b24692e7c51b1a1e56202f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page