Building blocks for Data Engineering
Project description
phidata
Building Blocks for Data Engineering
Phidata is a set of building blocks for data engineering
It makes data tools plug-n-play so teams can deliver high-quality, reliable data products
How it works
- You start with a codebase that has data tools pre-configured
- Enable the Apps you need - Airflow, Superset, Jupyter, MLFlow
- Build data products (tables, metrics) in a dev environment running locally on docker
- Write pipelines in python or SQL. Use GPT-3 to generate boilerplate code
- Run production on AWS. Infrastructure is also pre-configured
Advantages
- Automate the grunt work
- Recipes for common data tasks
- Everything is version controlled: Infra, Apps and Workflows
- Equal
dev
andproduction
environments for data development at scale - Multiple teams working together share code and define dependencies in a pythonic way
- Formatting (
black
), linting (ruff
), type-checking (mypy
) and testing (pytest
) included
More Information:
- Website: phidata.com
- Documentation: https://docs.phidata.com
- Chat: Discord
Quickstart
Let's build a data product using crypto data. Open the Terminal
and follow along to download sample data and analyze it in a jupyter notebook.
Setup
Create a python virtual environment
python3 -m venv ~/.venvs/dpenv
source ~/.venvs/dpenv/bin/activate
Install and initialize phidata
pip install phidata
phi init
If you encounter errors, try updating pip using
python -m pip install --upgrade pip
Create workspace
Workspace is a directory containing the source code for your data platform. Run phi ws init
to create a new workspace.
Press Enter to select the default name (data-platform
) and template (aws-data-platform
)
phi ws init
cd into the new workspace directory
cd data-platform
Run your first workflow
The first step of building a data product is collecting the data. The workflows/crypto/prices.py
file contains an example task for downloading crypto data locally to a CSV file. Run it using
phi wf run crypto/prices
Note how we define the output as a CsvTableLocal
object with partitions and pre-write checks
# Step 1: Define CsvTableLocal for storing data
# Path: `storage/tables/crypto_prices`
crypto_prices_local = CsvTableLocal(
name="crypto_prices",
database="crypto",
partitions=["ds"],
write_checks=[NotEmpty()],
)
Checkout data-platform/storage/tables/crypto_prices
for the CSVs
Run your first App
Docker is a great tool for testing locally. Your workspace comes pre-configured with a jupyter notebook for analyzing data. Install docker desktop and after the engine is running, start the workspace using
phi ws up
Press Enter to confirm. Verify the container is running using the docker dashboard or docker ps
docker ps --format 'table {{.Names}}\t{{.Image}}'
NAMES IMAGE
jupyter-container phidata/jupyter-aws-dp:dev
Jupyter UI
Open localhost:8888 in a new tab to view the jupyterlab UI. Password: admin
Navigate to notebooks/examples/crypto_prices.ipynb
and run all cells.
Shutdown
Play around and then stop the workspace using
phi ws down
Next
Checkout the documentation for more information or chat with us on discord
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.