Skip to main content

A toolkit for data engineering

Project description

phidata

Build data products as code

version pythonversion downloads build-status test-status


Phidata is a toolkit for building high-quality, reliable data products.

Use phidata to create tables, metrics and dashboards for analytics and machine learning.

Features:

  • Build data products as code.
  • Build a data platform with dev and production environments.
  • Manage tables as python objects and build a data lake as code.
  • Run Airflow and Superset locally on docker and production on aws.
  • Manage everything in 1 codebase using engineering best practices.

More Information:


Quick start

This guide shows how to:

  1. Run Airflow, Superset, Jupyter and Postgres locally on docker.
  2. Run workflows and create postgres tables.

To follow along, you need:

Install phidata

Create a python virtual environment

python3 -m venv ~/.venvs/dpenv
source ~/.venvs/dpenv/bin/activate

Install and initialize phidata

pip install phidata
phi init

Create workspace

Workspace is the directory containing the code for your data platform. It is version controlled using git and shared with your team.

Run phi ws init to create a new workspace in the current directory. Press enter to create a default workspace using the aws blueprint.

phi ws init

cd into directory

cd data-platform

Run Apps

Apps are open-source tools like airflow, superset and jupyter that run the data products.

Open workspace/settings.py and enable the apps you want to run (line 24). Note: Each app uses a lot of memory so you may need to increase the memory allocated to docker.

pg_dbs_enabled: bool = True
superset_enabled: bool = True
jupyter_enabled: bool = True
airflow_enabled: bool = True
traefik_enabled: bool = True

Then run phi ws up to create docker resources. Give 5 minutes for containers to start and the apps to initialize.

phi ws up

Deploying workspace: data-platform

--**-- Docker env: dev
--**-- Confirm resources:
  -+-> Network: starter-aws
  -+-> Container: dev-pg-starter-aws-container
  -+-> Container: airflow-db-starter-aws-container
  -+-> Container: airflow-redis-starter-aws-container
  -+-> Container: airflow-ws-container
  -+-> Container: airflow-scheduler-container
  -+-> Container: airflow-worker-container
  -+-> Container: jupyter-container
  -+-> Container: superset-db-starter-aws-container
  -+-> Container: superset-redis-starter-aws-container
  -+-> Container: superset-ws-container
  -+-> Container: superset-init-container
  -+-> Container: traefik

Network: starter-aws
Total 13 resources
Confirm deploy [Y/n]:

Checkout Superset

Open localhost:8410 in your browser to view the superset UI.

  • User: admin
  • Pass: admin
  • Logs: docker logs -f superset-ws-container

Checkout Airflow

Open localhost:8310 in a separate browser or private window to view the Airflow UI.

  • User: admin
  • Pass: admin
  • Logs: docker logs -f airflow-ws-container

Checkout Jupyter

Open localhost:8888 in a browser to view the jupyterlab UI.

  • Pass: admin
  • Logs: docker logs -f jupyter-container

Run workflows

Install dependencies

Before running workflows, we need to install dependencies like pandas and sqlalchemy. The workspace includes a script to install dependencies, run it:

./scripts/install.sh

Or install dependencies manually using pip:

pip install --editable ".[dev]"

Download crypto prices to a file

The workflows/crypto/prices.py file contains a task that pulls crypto prices from coingecko.com and stores them at storage/crypto/crypto_prices.csv. Run it using the phi wf run command:

phi wf run crypto/prices

Note how we define the file as a File object:

crypto_prices_file = File(
    name="crypto_prices.csv",
    file_dir="crypto",
)

While this works as a toy example, storing data locally is not of much use. We want to either load this data to a database or store it in cloud storage like s3.

Let's load this data to a postgres table running locally on docker.

Download crypto prices to a postgres table

The workflows/crypto/prices_pg.py file contains a workflow that loads crypto price data to a postgres table: crypto_prices_daily. Run it using the phi wf run command:

phi wf run crypto/prices_pg

We define the table using a PostgresTable object:

crypto_prices_daily_pg = PostgresTable(
    name="crypto_prices_daily",
    db_app=PG_DB_APP,
    airflow_conn_id=PG_DB_CONN_ID,
)

You can now query the table using the database tool of your choice.

Credentials:

  • Host: 127.0.0.1
  • Port: 5432
  • User: starter-aws
  • Pass: starter-aws
  • Database: starter-aws

We're big fans of TablePlus for database management.

Next steps

  1. Deploy to AWS.
  2. Enable traefik and use airflow.dp and superset.dp local domains.
  3. Read the documentation to learn more about phidata.

Shutdown workspace

Shut down all resources using phi ws down:

phi ws down

or shut down using the app name:

phi ws down --app jupyter

phi ws down --app airflow

phi ws down --app superset

More Information:


Project details


Release history Release notifications | RSS feed

This version

0.4.5

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

phidata-0.4.5.tar.gz (343.7 kB view hashes)

Uploaded Source

Built Distribution

phidata-0.4.5-py3-none-any.whl (508.1 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page