No project description provided
Project description
neoval-py-utils
neoval-py-utils is a python utilities package developed by Neoval to assist with the Extract, Load and Transform (ELT/ETL) of data from Google Cloud Platform (GCP) services.
The main difference between this utilities package and BigQuery provided APIs is a faster export. Running a BigQuery extract_job to a bucket and downloading it is faster and can be improved by increasing the machine's download speed. We also use of local caching so that the same query will not needless be repeatedly be executed / downloaded. With this package the user can also create, databases that can be embedded to a machine for a website or application.
Functionalities include:
- exporter
- Exporting data from BigQuery(bq) to a pandas DataFrame, pyArrow Table or Google Cloud Storage (GCS).
- Can be a bq query or a bq table.
- ipdb
- Building and preparing embedded in-process databases (IPDB) from BigQuery datasets.
- Supports SQLite and DuckDB and configured with a YAML file please see examples below.
- Supports templating for transformations post initial build.
Development
All development must take place on a feature branch and a pull request is required; a user is not allowed to commit directly to main
. The automated workflow in this repo (using python-semantic-release
) requires the use of angular style commit messages to update the package version and CHANGELOG
. All commits must be formatted in this way before a user is able to merge a PR; a user who may want to develop without using this format for all commits can simply squash non-angular commit messages prior to merge. A PR may only be merged by the rebase and merge
method. This is to ensure that only angular style commits end up on main
.
Upon merge to main
, the deploy
workflow will facilitate the following:
- bump the version in
pyproject.toml
- update the
CHANGELOG
using all commits added - tag and release, if required
- publish to PyPi
Getting Started
Prerequisites
TODO
Tests
For the integration tests to pass you will need to be authenticated with a Google project. With storage admin and bigquery job permissions.
You can auth with GOOGLE_APPLICATION_CREDENTIALS
as an environment variable or by
running gcloud auth application-default login
.
Specify gcp project with gcloud config set project <project-id>
.
Run unit and integration tests with poetry run task test
.
To run with coverage tests with poetry run task test-with-coverage
.
Usage
TODO installation with pipy
Assuming that installed neoval-py-utils
is successfully as a dependency and have permissions to gcp storage and bigquery.
Examples of usage
Export BQ datasets or Queries >> Dataframe or GCS
from neoval_py_utils.exporter import Exporter
# To query a bigquery table and return a polar dataframe. Caches results, keeps for default 12 hours.
exporter = Exporter() # To use cache, pass path to the constructor. Eg Exporter(cache_dir=./cache)
pl_df = exporter.export("SELECT word FROM `bigquery-public-data.samples.shakespeare` GROUP BY word ORDER BY word DESC LIMIT 3")
# `export` is aliased by `<` operator. Will give same results as above.
pl_df = exporter < "SELECT word FROM `bigquery-public-data.samples.shakespeare` GROUP BY word ORDER BY word DESC LIMIT 3"
# To export a whole table
al_pl_df = exporter.export("bigquery-public-data.samples.shakespeare")
# To export bigquery table to a parquet file in a gcp storage bucket. Returns a list of blobs.
blobs = exporter.bq_to_gcs("my-dataset.my-table")
Create In-process(Embedded) Databases
# Python cli example to build in-process db
poetry run ipdb build <DBT_DATASET> <GCLOUD_PROJECT_ID> <DB_PATH> <CONFIG_PATH> --upload-bucket <UPLOAD_BUCKET>
# If you would like to run it in locally in this repo, you can run
# Upload bucket is optional, this will upload the in-process db to the specified bucket.
# Ensure your PYTHONPATH=./src
poetry run ipdb build samples bigquery-public-data tests/artifacts/in_process_db tests/resources/good.config.yaml
Example of config.yaml
sqlite:
- name: shakespeare
primary_key: null
duckdb:
- name: shakespeare
primary_key: null
description: "Word counts from Shakespeare work - gcp public dataset"
# To apply sql templates after the in-process db is built
poetry run ipdb prepare <DBT_DATASET> <GCLOUD_PROJECT_ID> <DB_PATH> <TEMPLATES_PATH>
# If you would like to run it in locally in this repo, you can run
poetry run ipdb prepare samples bigquery-public-data tests/artifacts/in_process_db tests/resources/templates
# For more info you can run
poetry run ipdb --help # which will return
Usage: ipdb [OPTIONS] COMMAND [ARGS]...
╭─ Commands ────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ build Build the in process database(s). │
│ make-config Prints a default configuration to be used with the build command. │
│ prepare Run scripts to add views/virtual tables/etc. to the database(s). │
╰───────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file neoval_py_utils-0.3.3.tar.gz
.
File metadata
- Download URL: neoval_py_utils-0.3.3.tar.gz
- Upload date:
- Size: 17.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.0.0 CPython/3.12.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7555aa14ea2856a222ea4d6ae0027a9e15a771a34619a376ea5cb9849f34f8a8 |
|
MD5 | a750ccf3470ba1410b89b66139f2fa3a |
|
BLAKE2b-256 | 77a2ae12c23ee545646aeb5296a88e506c8865ec87e7c84e8239a4826c21d31e |
File details
Details for the file neoval_py_utils-0.3.3-py3-none-any.whl
.
File metadata
- Download URL: neoval_py_utils-0.3.3-py3-none-any.whl
- Upload date:
- Size: 17.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.0.0 CPython/3.12.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | fa87d033ab380ed369d0813569eae34fd2872979c72d3c681047e710756ab2bb |
|
MD5 | 9adf187f095bafc27dc75bf08d581663 |
|
BLAKE2b-256 | 141ed6ca97d1f98e2adc5225bdd0fa83a3458fcb09afdb7996aaf63060d345b5 |