Skip to main content

No project description provided

Project description

AWS Data Platform

This repo contains the code for building a data platform on AWS.

We enable 2 data environments:

  1. dev: A development env running on docker
  2. prd: A production env running on aws + k8s

Setup

  1. Create + activate a virtual env:
python3 -m venv ~/.venvs/dpenv
source ~/.venvs/dpenv/bin/activate
  1. Install + init phidata:
pip install phidata
phi init

from the data-platform dir:

  1. Setup workspace:
phi ws setup
  1. Copy workspace/example_secrets to workspace/secrets:
cp -r workspace/example_secrets workspace/secrets
  1. Deploy dev containers to docker using:
phi ws up

phi will create the following resources:

  • Container: dev-pg-dp-container
  • Network: dp

Optional: If something fails, try running again with debug logs:

phi ws up -d

Optional: Create .env file:

cp example.env .env

Using the dev environment

The workspace/dev directory contains the code for the dev resources. The workspace/settings.py file can be used to enable the open-source applications like:

  1. Postgres App: for storing dev data (runs 1 container)
  2. Airflow App: for running dags & pipelines (runs 5 containers)
  3. Superset App: for visualizing dev data (runs 4 containers)

Update the workspace/settings.py file and run:

phi ws up

TIP: The phi ws ... commands use --env dev and --config docker by default. Set in the workspace/config.py file.

Running phi ws up is equivalent to running phi ws up --env dev --config docker

Run Airflow

  1. Set airflow_enabled = True in workspace/settings.py and run phi ws up
  2. Check out the airflow webserver running in the airflow-ws-container:
  • url: http://localhost:8310/
  • user: admin
  • pass: admin

Superset webserver

  1. Set superset_enabled = True in workspace/settings.py and run phi ws up
  2. Check out the superset webserver running in the superset-ws-container:
  • url: http://localhost:8410/
  • user: admin
  • pass: admin

Format + lint workspace

Format with black & lint with mypy using:

./scripts/format.sh

If you need to install packages, run:

pip install black mypy

Upgrading phidata version

activate virtualenv: source ~/.venvs/dpenv/bin/activate

  1. Upgrade phidata:
pip install phidata --upgrade
  1. Rebuild local images & recreate containers:
CACHE=f phi ws up --env dev --config docker

Optional: Install workspace locally

Install the workspace & python packages locally in your virtual env using:

./scripts/install.sh

This will:

  1. Install python packages from requirements.txt
  2. Install python project in --editable mode
  3. Install requirements-airflow.txt without dependencies for code completion

This enables:

  1. Running black & mypy locally
  2. Running workflows locally
  3. Editor auto-completion

Add python packages

Following PEP-631, we should add dependencies to the pyproject.toml file.

To add a new package:

  1. Add the module to the pyproject.toml file.
  2. Run: ./scripts/upgrade.sh. This script updates the requirements.txt file.
  3. Optional: Run: ./scripts/install.sh to install the new dependencies in a local virtual env.
  4. Run CACHE=f phi ws up to recreate images + containers

Adding airflow providers

Airflow requirements are stored in the workspace/dev/airflow_resources/requirements-airflow.txt file.

To add new airflow providers:

  1. Add the module to the workspace/dev/airflow_resources/requirements-airflow.txt file.
  2. Optional: Run: ./scripts/install.sh to install the new dependencies in a local virtual env.
  3. Run CACHE=f phi ws up --name airflow to recreate images + containers

To force recreate all images & containers, use the CACHE env variable

CACHE=false phi ws up \
  --env dev \
  --config docker \
  --type image|container \
  --name airflow|superset|pg \
  --app airflow|superset

Shut down workspace

phi ws down

Restart all resources

phi ws restart

Restart all containers

phi ws restart --type container

Restart traefik app

phi ws restart --app traefik

Restart airflow app

phi ws restart --app airflow

Add environment/secret variables to your apps

The containers read env using the env_file and secrets using the secrets_file params. These files are stored in the workspace/env or workspace/secrets directories.

Airflow

To add env variables to your airflow containers:

  1. Update the workspace/env/dev_airflow_env.yml file.
  2. Restart all airflow containers using: phi ws restart --name airflow --type container

To add secret variables to your airflow containers:

  1. Update the workspace/secrets/dev_airflow_secrets.yml file.
  2. Restart all airflow containers using: phi ws restart --name airflow --type container

Test a DAG

# ssh into airflow-worker | airflow-ws
docker exec -it airflow-ws-container zsh
docker exec -it airflow-worker-container zsh

# Test run the DAGs using module name
python -m workflow.dir.file

# Test run the DAG file
python /mnt/workspaces/data-platform/workflow/dir/file.py

# List DAGs
airflow dags list

# List tasks in DAG
airflow tasks list \
  -S /mnt/workspaces/data-platform/workflow/dir/file.py \
  -t dag_name

# Test airflow task
airflow tasks test dag_name task_name 2022-07-01

Recreate everything

Notes:

  • Use data-platform as the working directory
  • Deactivate existing venv using deactivate if needed
echo "*- Deleting venv"
rm -rf ~/.venvs/dpenv

echo "*- Deleting af-db-dp-volume volume"
docker volume rm af-db-dp-volume

echo "*- Recreating venv"
python3 -m venv ~/.venvs/dpenv
source ~/.venvs/dpenv/bin/activate

echo "*- Install phi"
pip install phidata
phi init

echo "*- Setup + deploying workspace"
phi ws setup
CACHE=f phi ws up

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

awsdp-0.1.0.tar.gz (5.3 kB view details)

Uploaded Source

Built Distribution

awsdp-0.1.0-py3-none-any.whl (5.9 kB view details)

Uploaded Python 3

File details

Details for the file awsdp-0.1.0.tar.gz.

File metadata

  • Download URL: awsdp-0.1.0.tar.gz
  • Upload date:
  • Size: 5.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.9.12

File hashes

Hashes for awsdp-0.1.0.tar.gz
Algorithm Hash digest
SHA256 ebd618bf78e389095266f695d46e9268a617a5cf79aff06dd003c0ce373bc8d2
MD5 df6503c77c1085e456874115d2f5d1ed
BLAKE2b-256 6291865705c4c2be2fd4b0c65abda33eb35d70143b3a8f374c5738b60abf8bab

See more details on using hashes here.

File details

Details for the file awsdp-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: awsdp-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 5.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.9.12

File hashes

Hashes for awsdp-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 112eef8359f94e61a61e5673f80617469bce6b2cef1b87ef63745a577a285b81
MD5 8396d135055fa8e445077ba180f7a006
BLAKE2b-256 7a0bbf164f0780739c173ec875f903f5a22ea7fb7a6d0db52d8bb4462cfdb5d4

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page