A toolkit for data engineering
Project description
phidata
Build data products as code
Phidata is a toolkit for building high-quality, reliable data products.
Use phidata to create tables, metrics and dashboards for analytics and machine learning.
Features:
- Build data products as code.
- Build a data platform with dev and production environments.
- Manage tables as python objects and build a data lake as code.
- Run Airflow and Superset locally on docker and production on aws.
- Manage everything in 1 codebase using engineering best practices.
More Information:
- Website: phidata.com
- Documentation: https://docs.phidata.com
- Chat: Discord
Quick start
This guide shows how to:
- Run Airflow, Superset, Jupyter and Postgres locally on docker.
- Run workflows and create postgres tables.
To follow along, you need:
- python 3.7+
- docker desktop
Install phidata
Create a python virtual environment
python3 -m venv ~/.venvs/dpenv
source ~/.venvs/dpenv/bin/activate
Install and initialize phidata
pip install phidata
phi init
Create workspace
Workspace is the directory containing the code for your data platform. It is version controlled using git and shared with your team.
Run phi ws init
to create a new workspace in the current directory. Press enter to create a default workspace using the aws
blueprint.
phi ws init
cd into directory
cd data-platform
Run Apps
Apps are open-source tools like airflow, superset and jupyter that run the data products.
Open workspace/settings.py and enable the apps you want to run (line 24). Note: Each app uses a lot of memory so you may need to increase the memory allocated to docker.
pg_dbs_enabled: bool = True
superset_enabled: bool = True
jupyter_enabled: bool = True
airflow_enabled: bool = True
traefik_enabled: bool = True
Then run phi ws up
to create docker resources. Give 5 minutes for containers to start and the apps to initialize.
phi ws up
Deploying workspace: data-platform
--**-- Docker env: dev
--**-- Confirm resources:
-+-> Network: starter-aws
-+-> Container: dev-pg-starter-aws-container
-+-> Container: airflow-db-starter-aws-container
-+-> Container: airflow-redis-starter-aws-container
-+-> Container: airflow-ws-container
-+-> Container: airflow-scheduler-container
-+-> Container: airflow-worker-container
-+-> Container: jupyter-container
-+-> Container: superset-db-starter-aws-container
-+-> Container: superset-redis-starter-aws-container
-+-> Container: superset-ws-container
-+-> Container: superset-init-container
-+-> Container: traefik
Network: starter-aws
Total 13 resources
Confirm deploy [Y/n]:
Checkout Superset
Open localhost:8410 in your browser to view the superset UI.
- User: admin
- Pass: admin
- Logs:
docker logs -f superset-ws-container
Checkout Airflow
Open localhost:8310 in a separate browser or private window to view the Airflow UI.
- User: admin
- Pass: admin
- Logs:
docker logs -f airflow-ws-container
Checkout Jupyter
Open localhost:8888 in a browser to view the jupyterlab UI.
- Pass: admin
- Logs:
docker logs -f jupyter-container
Run workflows
Install dependencies
Before running workflows, we need to install dependencies like pandas
and sqlalchemy
.
The workspace includes a script to install dependencies, run it:
./scripts/install.sh
Or install dependencies manually using pip:
pip install --editable ".[dev]"
Download crypto prices to a file
The workflows/crypto/prices.py file contains a task that pulls crypto prices from coingecko.com and stores them at storage/crypto/crypto_prices.csv
. Run it using the phi wf run
command:
phi wf run crypto/prices
Note how we define the file as a File object:
crypto_prices_file = File(
name="crypto_prices.csv",
file_dir="crypto",
)
While this works as a toy example, storing data locally is not of much use. We want to either load this data to a database or store it in cloud storage like s3.
Let's load this data to a postgres table running locally on docker.
Download crypto prices to a postgres table
The workflows/crypto/prices_pg.py file contains a workflow that loads crypto price data to a postgres table: crypto_prices_daily
. Run it using the phi wf run
command:
phi wf run crypto/prices_pg
We define the table using a PostgresTable object:
crypto_prices_daily_pg = PostgresTable(
name="crypto_prices_daily",
db_app=PG_DB_APP,
airflow_conn_id=PG_DB_CONN_ID,
)
You can now query the table using the database tool of your choice.
Credentials:
- Host: 127.0.0.1
- Port: 5432
- User: starter-aws
- Pass: starter-aws
- Database: starter-aws
We're big fans of TablePlus for database management.
Next steps
- Deploy to AWS.
- Enable traefik and use
airflow.dp
andsuperset.dp
local domains. - Read the documentation to learn more about phidata.
Shutdown workspace
Shut down all resources using phi ws down
:
phi ws down
or shut down using the app name:
phi ws down --app jupyter
phi ws down --app airflow
phi ws down --app superset
More Information:
- Website: phidata.com
- Documentation: https://docs.phidata.com
- Chat: Discord
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.