Skip to main content

Building blocks for Data Engineering

Project description

phidata

Building Blocks for Data Engineering

version pythonversion downloads build-status test-status


A python library of data engineering building blocks

Python library of OSS data tools, use it deliver high-quality data products on the cheap.

Honestly our goal is just to save money and run OSS on the cheap. So we run stuff locally using docker and in production on AWS. Because we're running OSS, we're OSS too with a MPL-2.0 license.

How it works

  • Phidata converts infrastructure, tools and data assets into python classes.
  • These classes are then put together to build data platforms, ML Apis, AI Apps, etc.
  • Run your platform locally for development using docker with phi ws up dev:docker
  • Run it in production on AWS: phi ws up prod:aws

Advantages

  • Automate the grunt work
  • Recipes for common data tasks
  • Everything is version controlled: Infra, Apps and Workflows
  • Equal dev and production environments for data development at scale
  • Multiple teams working together share code and define dependencies in a pythonic way
  • Formatting (black), linting (ruff), type-checking (mypy) and testing (pytest) included

More Information:


Quickstart

Let's build a data product using crypto data. Open the Terminal and follow along to download sample data and analyze it in a jupyter notebook.

Setup

Create a python virtual environment

python3 -m venv ~/.venvs/dpenv
source ~/.venvs/dpenv/bin/activate

Install and initialize phidata

pip install phidata
phi init

If you encounter errors, try updating pip using python -m pip install --upgrade pip

Create workspace

Workspace is a directory containing the source code for your data platform. Run phi ws init to create a new workspace.

Press Enter to select the default name (data-platform) and template (aws-data-platform)

phi ws init

cd into the new workspace directory

cd data-platform

Run your first workflow

The first step of building a data product is collecting the data. The workflows/crypto/prices.py file contains an example task for downloading crypto data locally to a CSV file. Run it using

phi wf run crypto/prices

Note how we define the output as a CsvTableLocal object with partitions and pre-write checks

# Step 1: Define CsvTableLocal for storing data
# Path: `storage/tables/crypto_prices`
crypto_prices_local = CsvTableLocal(
    name="crypto_prices",
    database="crypto",
    partitions=["ds"],
    write_checks=[NotEmpty()],
)

Checkout data-platform/storage/tables/crypto_prices for the CSVs

Run your first App

Docker is a great tool for testing locally. Your workspace comes pre-configured with a jupyter notebook for analyzing data. Install docker desktop and after the engine is running, start the workspace using

phi ws up

Press Enter to confirm. Verify the container is running using the docker dashboard or docker ps

docker ps --format 'table {{.Names}}\t{{.Image}}'

NAMES               IMAGE
jupyter-container   phidata/jupyter-aws-dp:dev

Jupyter UI

Open localhost:8888 in a new tab to view the jupyterlab UI. Password: admin

Navigate to notebooks/examples/crypto_prices.ipynb and run all cells.

Shutdown

Play around and then stop the workspace using

phi ws down

Next

Checkout the documentation for more information or chat with us on discord


Project details


Release history Release notifications | RSS feed

This version

1.5.1

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

phidata-1.5.1.tar.gz (476.4 kB view hashes)

Uploaded Source

Built Distribution

phidata-1.5.1-py3-none-any.whl (688.0 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page