Skip to main content

Building blocks for Data Engineering

Project description

phidata

Building Blocks for Data Engineering

version pythonversion downloads build-status test-status


Phidata is a set of building blocks for data engineering

It makes data tools plug-n-play so teams can deliver high-quality, reliable data products

How it works

  • You start with a codebase that has data tools pre-configured
  • Enable the Apps you need - Airflow, Superset, Jupyter, MLFlow
  • Build data products (tables, metrics) in a dev environment running locally on docker
  • Write pipelines in python or SQL. Use GPT-3 to generate boilerplate code
  • Run production on AWS. Infrastructure is also pre-configured

Advantages

  • Automate the grunt work
  • Recipes for common data tasks
  • Everything is version controlled: Infra, Apps and Workflows
  • Equal dev and production environments for data development at scale
  • Multiple teams working together share code and define dependencies in a pythonic way
  • Formatting (black), linting (ruff), type-checking (mypy) and testing (pytest) included

More Information:


Quickstart

Let's build a data product using crypto data. Open the Terminal and follow along to download sample data and analyze it in a jupyter notebook.

Setup

Create a python virtual environment

python3 -m venv ~/.venvs/dpenv
source ~/.venvs/dpenv/bin/activate

Install and initialize phidata

pip install phidata
phi init

If you encounter errors, try updating pip using python -m pip install --upgrade pip

Create workspace

Workspace is a directory containing the source code for your data platform. Run phi ws init to create a new workspace.

Press Enter to select the default name (data-platform) and template (aws-data-platform)

phi ws init

cd into the new workspace directory

cd data-platform

Run your first workflow

The first step of building a data product is collecting the data. The workflows/crypto/prices.py file contains an example task for downloading crypto data locally to a CSV file. Run it using

phi wf run crypto/prices

Note how we define the output as a CsvTableLocal object with partitions and pre-write checks

# Step 1: Define CsvTableLocal for storing data
# Path: `storage/tables/crypto_prices`
crypto_prices_local = CsvTableLocal(
    name="crypto_prices",
    database="crypto",
    partitions=["ds"],
    write_checks=[NotEmpty()],
)

Checkout data-platform/storage/tables/crypto_prices for the CSVs

Run your first App

Docker is a great tool for testing locally. Your workspace comes pre-configured with a jupyter notebook for analyzing data. Install docker desktop and after the engine is running, start the workspace using

phi ws up

Press Enter to confirm. Verify the container is running using the docker dashboard or docker ps

docker ps --format 'table {{.Names}}\t{{.Image}}'

NAMES               IMAGE
jupyter-container   phidata/jupyter-aws-dp:dev

Jupyter UI

Open localhost:8888 in a new tab to view the jupyterlab UI. Password: admin

Navigate to notebooks/examples/crypto_prices.ipynb and run all cells.

Shutdown

Play around and then stop the workspace using

phi ws down

Next

Checkout the documentation for more information or chat with us on discord


Project details


Release history Release notifications | RSS feed

This version

1.3.0

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

phidata-1.3.0.tar.gz (369.1 kB view hashes)

Uploaded Source

Built Distribution

phidata-1.3.0-py3-none-any.whl (545.2 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page