No project description provided
Project description
AWS Data Platform
This repo contains the code for building a data platform on AWS.
We enable 2 data environments:
- dev: A development env running on docker
- prd: A production env running on aws + k8s
Setup
- Create + activate a virtual env:
python3 -m venv ~/.venvs/dpenv
source ~/.venvs/dpenv/bin/activate
- Install + init
phidata
:
pip install phidata
phi init
from the
data-platform
dir:
- Setup workspace:
phi ws setup
- Copy
workspace/example_secrets
toworkspace/secrets
:
cp -r workspace/example_secrets workspace/secrets
- Deploy dev containers to docker using:
phi ws up
phi
will create the following resources:
- Container:
dev-pg-dp-container
- Network:
dp
Optional: If something fails, try running again with debug logs:
phi ws up -d
Optional: Create .env
file:
cp example.env .env
Using the dev environment
The workspace/dev directory contains the code for the dev resources. The workspace/settings.py file can be used to enable the open-source applications like:
- Postgres App: for storing dev data (runs 1 container)
- Airflow App: for running dags & pipelines (runs 5 containers)
- Superset App: for visualizing dev data (runs 4 containers)
Update the workspace/settings.py file and run:
phi ws up
TIP: The phi ws ...
commands use --env dev
and --config docker
by default. Set in the workspace/config.py
file.
Running phi ws up
is equivalent to running phi ws up --env dev --config docker
Run Airflow
- Set
airflow_enabled = True
in workspace/settings.py and runphi ws up
- Check out the airflow webserver running in the
airflow-ws-container
:
- url:
http://localhost:8310/
- user:
admin
- pass:
admin
Superset webserver
- Set
superset_enabled = True
in workspace/settings.py and runphi ws up
- Check out the superset webserver running in the
superset-ws-container
:
- url:
http://localhost:8410/
- user:
admin
- pass:
admin
Format + lint workspace
Format with black
& lint with mypy
using:
./scripts/format.sh
If you need to install packages, run:
pip install black mypy
Upgrading phidata version
activate virtualenv:
source ~/.venvs/dpenv/bin/activate
- Upgrade phidata:
pip install phidata --upgrade
- Rebuild local images & recreate containers:
CACHE=f phi ws up --env dev --config docker
Optional: Install workspace locally
Install the workspace & python packages locally in your virtual env using:
./scripts/install.sh
This will:
- Install python packages from
requirements.txt
- Install python project in
--editable
mode - Install
requirements-airflow.txt
without dependencies for code completion
This enables:
- Running
black
&mypy
locally - Running workflows locally
- Editor auto-completion
Add python packages
Following PEP-631, we should add dependencies to the pyproject.toml file.
To add a new package:
- Add the module to the pyproject.toml file.
- Run:
./scripts/upgrade.sh
. This script updates therequirements.txt
file. - Optional: Run:
./scripts/install.sh
to install the new dependencies in a local virtual env. - Run
CACHE=f phi ws up
to recreate images + containers
Adding airflow providers
Airflow requirements are stored in the workspace/dev/airflow_resources/requirements-airflow.txt file.
To add new airflow providers:
- Add the module to the workspace/dev/airflow_resources/requirements-airflow.txt file.
- Optional: Run:
./scripts/install.sh
to install the new dependencies in a local virtual env. - Run
CACHE=f phi ws up --name airflow
to recreate images + containers
To force recreate all images & containers, use the CACHE
env variable
CACHE=false phi ws up \
--env dev \
--config docker \
--type image|container \
--name airflow|superset|pg \
--app airflow|superset
Shut down workspace
phi ws down
Restart all resources
phi ws restart
Restart all containers
phi ws restart --type container
Restart traefik app
phi ws restart --app traefik
Restart airflow app
phi ws restart --app airflow
Add environment/secret variables to your apps
The containers read env using the env_file
and secrets using the secrets_file
params.
These files are stored in the workspace/env or workspace/secrets directories.
Airflow
To add env variables to your airflow containers:
- Update the workspace/env/dev_airflow_env.yml file.
- Restart all airflow containers using:
phi ws restart --name airflow --type container
To add secret variables to your airflow containers:
- Update the workspace/secrets/dev_airflow_secrets.yml file.
- Restart all airflow containers using:
phi ws restart --name airflow --type container
Test a DAG
# ssh into airflow-worker | airflow-ws
docker exec -it airflow-ws-container zsh
docker exec -it airflow-worker-container zsh
# Test run the DAGs using module name
python -m workflow.dir.file
# Test run the DAG file
python /mnt/workspaces/data-platform/workflow/dir/file.py
# List DAGs
airflow dags list
# List tasks in DAG
airflow tasks list \
-S /mnt/workspaces/data-platform/workflow/dir/file.py \
-t dag_name
# Test airflow task
airflow tasks test dag_name task_name 2022-07-01
Recreate everything
Notes:
- Use
data-platform
as the working directory - Deactivate existing venv using
deactivate
if needed
echo "*- Deleting venv"
rm -rf ~/.venvs/dpenv
echo "*- Deleting af-db-dp-volume volume"
docker volume rm af-db-dp-volume
echo "*- Recreating venv"
python3 -m venv ~/.venvs/dpenv
source ~/.venvs/dpenv/bin/activate
echo "*- Install phi"
pip install phidata
phi init
echo "*- Setup + deploying workspace"
phi ws setup
CACHE=f phi ws up
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.