Skip to main content

Infrastructure for AI applications and machine learning pipelines

Project description

PackYak image

PyPI version

Packyak AWS CDK

PackYak is a next-generation framework for building and deploying Data Lakehouses in AWS with a Git-like versioned developer workflow that simplifies how Data Scientists and Data Engineers collaborate.

It enables you to deploy your entire Data Lakehouse, ETL and Machine Learning platforms on AWS with no external dependencies, maintain your Data Tables with Git-like versioning semantics and scale data production with Dagster-like Software-defined Asset Graphs.

It combines 5 key technologies into one framework that makes scaling Data Lakehouses and Data Science teams dead simple:

  1. Git-like versioning of Data Tables with Project Nessie - no more worrying about the version of data, simply use branches, tags and commits to freeze data or roll back mistakes.
  2. Software-defined Assets (as seen in Dagster) - think of your data pipelines in terms of the data it produces. Greatly simplify how data is produced, modified over time and backfilled in the event of errors.
  3. Infrastructure-as-Code (AWS CDK and Pulumi) - deploy in minutes and manage it all yourself with minimal effort.
  4. Apache Spark - write your ETL as simple python processes that are then scaled automatically over a managed AWS EMR Spark Cluster.
  5. Streamlit - build Streamlit applications that integrate the Data Lakehouse and Apache Spark to provide interactive reports and exploratory tools over the versioned data lake.

Get Started

Install Docker

If you haven't already, install Docker.

Install Python Poetry & Plugins

# Install the Python Poetry CLI
curl -sSL https://install.python-poetry.org | python3 -

# Add the export plugin to generate narrow requirements.txt
poetry self add poetry-plugin-export

Install the packyak CLI:

pip install packyak

Create a new Project

packyak new my-project
cd ./my-project

Deploy to AWS

poetry run cdk deploy

Git-like Data Catalog (Project Nessie)

PackYak comes with a Construct for hosting a Project Nessie catalog that supports Git-like versioning of the tables in a Data Lakehouse.

It deploys with an AWS DynamoDB Versioned store and an API hosted in AWS Lambda or AWS ECS. The Nessie Server is stateless and can be scaled easily with minimal-to-zero operational overhead.

Create a NessieDynamoDBVersionStore

from packyak.aws_cdk import DynamoDBNessieVersionStore

versionStore = DynamoDBNessieVersionStore(
  scope=stack,
  id="VersionStore",
  versionStoreName="my-version-store",
)

Create a Bucket to store Data Tables (e.g. Parquet files). This will store the "Repository"'s data.

myRepoBucket = Bucket(
  scope=stack,
  id="MyCatalogBucket",
)

Create the Nessie Catalog Service

# hosted on AWS ECS
myCatalog = NessieECSCatalog(
  scope=stack,
  id="MyCatalog",
  vpc=vpc,
  warehouseBucket=myRepoBucket,
  catalogName=lakeHouseName,
  versionStore=versionStore,
)

Create a Branch

Branch off the main branch of data into a dev branch to "freeze" the data as of a particular commit

CREATE BRANCH dev FROM main

Deploy a Spark Cluster

Create an EMR Cluster for processing data

spark = Cluster(
  scope=stack,
  id="Spark",
  clusterName="my-cluster",
  vpc=vpc,
  catalogs={
    # use the Nessie Catalog as the default data catalog for Spark SQL queries
    "spark_catalog": myCatalog,
  },
  installSSMAgent=true,
)

Configure SparkSQL to be served over JDBC

sparkSQL = spark.jdbc(port=10001)

Deploy Streamlit Site

Stand up a Streamlit Site to serve interactive reports and applications over your data.

site = StreamlitSite(
  scope=stack,
  # Point it at the Streamlit site entrypoint
  home="app/home.py",
  # Where the Streamlit pages/tabs are, defaults to `dirname(home)/pages/*.py`
  # pages="app/pages"
)

Deploy to AWS

packyak deploy

Or via the AWS CDK CLI:

poetry run cdk deploy

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

packyak-0.4.2.tar.gz (18.8 kB view details)

Uploaded Source

Built Distribution

packyak-0.4.2-py3-none-any.whl (23.6 kB view details)

Uploaded Python 3

File details

Details for the file packyak-0.4.2.tar.gz.

File metadata

  • Download URL: packyak-0.4.2.tar.gz
  • Upload date:
  • Size: 18.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.7.1 CPython/3.10.13 Darwin/23.1.0

File hashes

Hashes for packyak-0.4.2.tar.gz
Algorithm Hash digest
SHA256 55ad7acb5f8722089e0b9facee8af3f32b557cde8b3aac0c0f0db416a2daba46
MD5 955138af6c0d094449795dbea710f7c0
BLAKE2b-256 37aa5660f894ebd692352cc6c28d9c396f1471044897ba89dd94f95143ef2878

See more details on using hashes here.

File details

Details for the file packyak-0.4.2-py3-none-any.whl.

File metadata

  • Download URL: packyak-0.4.2-py3-none-any.whl
  • Upload date:
  • Size: 23.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.7.1 CPython/3.10.13 Darwin/23.1.0

File hashes

Hashes for packyak-0.4.2-py3-none-any.whl
Algorithm Hash digest
SHA256 e0f4483705091f54c4b4b69416eb8eccaba5bfe4bdf88d2a1ec49d8b1e67dc6b
MD5 8ecbfecbbe657d28113b782f11a81cd6
BLAKE2b-256 ac8ab005db1c24e516b6c8f2f04d6e4ab102a2b16bfdf8fdd3bf7488d56908ee

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page