Aggregate publication metadata from bioRxiv, OpenAlex, and more.
Project description
pub-lake
Aggregate publication metadata from bioRxiv, OpenAlex, and more.
- PyPI package: https://pypi.org/project/pub-lake/
- Free software: MIT License
- Documentation: https://pub-lake.readthedocs.io.
Features
- bioRxiv preprints: fetch metadata for preprints from the bioRxiv API and enrich it with OpenAlex topics.
How it works
The package follows an ELT (Extract, Load, Transform) architecture and stores data in a relational database (SQLite by default). Key steps:
- Extract: Fetch raw metadata from bioRxiv and OpenAlex APIs.
- Load: Store the raw metadata in the database.
- Transform: Clean, normalize, and aggregate the data.
Data can then be queried and returns a unified view of publication metadata.
Installation
uv add pub-lake
See docs/installation.md for more details.
Usage
# ingest preprints from the given dates into the database
uv run python -m pub_lake preprints fetch --start "2025-01-02" --end "2025-01-04" --polite "eidens@embl.de"
# list preprints available in the database
uv run python -m pub_lake preprints list [--start "2025-01-02"] [--end "2025-01-04"]
See docs/usage.md for more details.
Development
Project Structure
src/pub_lake/ has the following structure:
cli.py: main entry point for the command-line interface.elt/: core logic for the Extract, Load, Transform pipeline.extract/: fetching data from external sources (e.g., bioRxiv, OpenAlex).load/: loading raw data into the database.transform/: cleaning and normalizing the loaded data.
models/: database schema and data models.interface/: methods for querying the final, cleaned data.config.py: configuration, such as database connections and API keys.
Credits
This package was created with Cookiecutter and the audreyfeldroy/cookiecutter-pypackage project template.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pub_lake-0.1.0.tar.gz.
File metadata
- Download URL: pub_lake-0.1.0.tar.gz
- Upload date:
- Size: 21.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
21c657c4051f2d8b8902f390f5d8737b9c08b8c5e779b7a44d61c3c8fe9a2586
|
|
| MD5 |
0606dbd4275c3a12313eadfbce1529ff
|
|
| BLAKE2b-256 |
56e3c77f18bb04d1e394993b69269d7fa1af8bd9fd1f23b74fdee884ff331474
|
File details
Details for the file pub_lake-0.1.0-py3-none-any.whl.
File metadata
- Download URL: pub_lake-0.1.0-py3-none-any.whl
- Upload date:
- Size: 19.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
11f38402e96e92e71fea9be3a2db63d9d8b836d9e3d6a794f176d05a9fe09c0f
|
|
| MD5 |
ea1afc11e80fad9f0a480e58857886fd
|
|
| BLAKE2b-256 |
bb704a2195a70c6c242c97b1687218d46f633bd218b24692e7c51b1a1e56202f
|