Infrastructure for AI applications and machine learning pipelines

These details have not been verified by PyPI

Project description

PackYak

[!NOTE] Still in active development.

PackYak

PackYak is an open source platform for building versioned Data Lakehouses in AWS with Python and the AWS CDK.

With one CLI command and an AWS account, you can spin up a world class Data Platform in your own AWS account with Git versioning of Data, SageMaker Domains, Spark Clusters, Streamlit Sites and more.

Maintain your Data Lake just like you do your code - with git! Leverage branches, tags and commits to version your table schemas and data. Provide a consistent view of the data for consumers, enable rapid experimentation and roll back mistakes with ease.

Development of ETL or ML training jobs is made easy with the yak CLI. Easily set up remote sessions on your Spark, Ray or Dask clusters to enjoy the power of cloud computing without giving up the experience of your local IDE.

How it Works

PackYak combines modern Software Development, Cloud Engineering and Data Engineering practices into one Python framework:

Git-like versioning of Data Tables with Project Nessie - no more worrying about the version of data, simply use branches, tags and commits to freeze data or roll back mistakes.
Software-defined Assets (as seen in Dagster) - think of your data pipelines in terms of the data it produces. Greatly simplify how data is produced, modified over time and backfilled in the event of errors.
Infrastructure-as-Code (AWS CDK and Pulumi) - deploy in minutes and manage it all yourself with minimal effort.
Apache Spark - write your ETL as simple python processes that are then scaled automatically over a managed AWS EMR Spark Cluster.
Streamlit - build Streamlit applications that integrate the Data Lakehouse and Apache Spark to provide interactive reports and exploratory tools over the versioned data lake.

Get Started

Install Docker

If you haven't already, install Docker.

Install Python Poetry & Plugins

# Install the Python Poetry CLI
curl -sSL https://install.python-poetry.org | python3 -

# Add the export plugin to generate narrow requirements.txt
poetry self add poetry-plugin-export

Install the `packyak` CLI:

pip install packyak

Create a new Project

packyak new my-project
cd ./my-project

Deploy to AWS

poetry run cdk deploy

Git-like Data Catalog (Project Nessie)

PackYak comes with a Construct for hosting a Project Nessie catalog that supports Git-like versioning of the tables in a Data Lakehouse.

It deploys with an AWS DynamoDB Versioned store and an API hosted in AWS Lambda or AWS ECS. The Nessie Server is stateless and can be scaled easily with minimal-to-zero operational overhead.

Create a `NessieDynamoDBVersionStore`

from packyak.aws_cdk import DynamoDBNessieVersionStore

versionStore = DynamoDBNessieVersionStore(
  scope=stack,
  id="VersionStore",
  versionStoreName="my-version-store",
)

Create a Bucket to store Data Tables (e.g. Parquet files). This will store the "Repository"'s data.

myRepoBucket = Bucket(
  scope=stack,
  id="MyCatalogBucket",
)

Create the Nessie Catalog Service

# hosted on AWS ECS
myCatalog = NessieECSCatalog(
  scope=stack,
  id="MyCatalog",
  vpc=vpc,
  warehouseBucket=myRepoBucket,
  catalogName=lakeHouseName,
  versionStore=versionStore,
)

Create a Branch

Branch off the main branch of data into a dev branch to "freeze" the data as of a particular commit

CREATE BRANCH dev FROM main

Deploy a Spark Cluster

Create an EMR Cluster for processing data

spark = Cluster(
  scope=stack,
  id="Spark",
  clusterName="my-cluster",
  vpc=vpc,
  catalogs={
    # use the Nessie Catalog as the default data catalog for Spark SQL queries
    "spark_catalog": myCatalog,
  },
  installSSMAgent=true,
)

SSH into the Spark Cluster

yak ssh makes it easy to develop on AWS EMR, SageMaker and EC2 instances using your local VS Code IDE by facilitating SSH connections to the host over AWS SSM without complicated networking rules or bastion hosts. Everything is secured by AWS IAM.

yak ssh {ec2-instance-id}

Initialize a SparkSession

To create a SparkSession using your code reposistory's .venv file instead of the system one, use the init_session helper:

from packyak.spark import init_session

spark = init_session()

[!TIP] This is usually found in the first cell of a Jupyter notebook.

If you want to customize the SparkSession further, use session_builder instead:

spark = init_session().getOrCreate()

Remote VS Code over SSH

Once connected to a remote host, you can use VS Code's Remote SSH to start editing code and running commands on the remote host with the comfort of your local VS Code IDE.

SSH in and forward port 22 to a local port of your choice:

yak ssh {ec2-instance-id} -L 9001:localhost:22

Configure the remote host in your .ssh/config:

Host emr
  HostName localhost
  Port 9001
  User root
  IdentityFile ~/.ssh/id_rsa

Configure SparkSQL to be served over JDBC

sparkSQL = spark.jdbc(port=10001)

Deploy Streamlit Site

Stand up a Streamlit Site to serve interactive reports and applications over your data.

site = StreamlitSite(
  scope=stack,
  # Point it at the Streamlit site entrypoint
  home="app/home.py",
  # Where the Streamlit pages/tabs are, defaults to `dirname(home)/pages/*.py`
  # pages="app/pages"
)

Deploy to AWS

packyak deploy

Or via the AWS CDK CLI:

poetry run cdk deploy

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.4.22

Apr 10, 2024

0.4.21

Mar 12, 2024

0.4.20

Mar 10, 2024

0.4.19

Mar 9, 2024

0.4.18

Mar 9, 2024

0.4.17

Mar 9, 2024

0.4.16

Mar 9, 2024

0.4.15

Feb 28, 2024

0.4.14

Feb 28, 2024

0.4.13

Feb 28, 2024

0.4.11

Feb 28, 2024

0.4.10

Feb 28, 2024

0.4.9

Feb 28, 2024

0.4.8

Feb 27, 2024

0.4.7

Feb 27, 2024

0.4.6

Feb 26, 2024

0.4.5

Feb 26, 2024

0.4.3

Feb 23, 2024

0.4.2

Feb 23, 2024

0.4.1

Feb 23, 2024

0.4.0

Feb 23, 2024

0.3.6

Feb 23, 2024

0.3.5

Feb 22, 2024

0.3.4

Feb 22, 2024

0.3.3

Feb 22, 2024

0.1.3

Feb 12, 2024

0.1.2

Jan 17, 2024

0.1.1

Jan 12, 2024

0.1.0

Jan 10, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

packyak-0.4.22.tar.gz (26.5 kB view details)

Uploaded Apr 10, 2024 Source

Built Distribution

packyak-0.4.22-py3-none-any.whl (37.6 kB view details)

Uploaded Apr 10, 2024 Python 3

File details

Details for the file packyak-0.4.22.tar.gz.

File metadata

Download URL: packyak-0.4.22.tar.gz
Upload date: Apr 10, 2024
Size: 26.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: poetry/1.7.1 CPython/3.10.13 Darwin/23.1.0

File hashes

Hashes for packyak-0.4.22.tar.gz
Algorithm	Hash digest
SHA256	`cda87d6f80b6b920e6621f144ecaf4e6c936f859fd0b4a02df9d4cecf2895864`
MD5	`1daf9e1944acd6af882744a3ab2be5ed`
BLAKE2b-256	`2cec997518a3993f1c6e4001036105f5d5104dd449fda468459f7537f4d9f4ff`

See more details on using hashes here.

File details

Details for the file packyak-0.4.22-py3-none-any.whl.

File metadata

Download URL: packyak-0.4.22-py3-none-any.whl
Upload date: Apr 10, 2024
Size: 37.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: poetry/1.7.1 CPython/3.10.13 Darwin/23.1.0

File hashes

Hashes for packyak-0.4.22-py3-none-any.whl
Algorithm	Hash digest
SHA256	`90664b7cae519ba764f8fe662f7fde0ec414a9a6a05ac15aae673ba17acd3e71`
MD5	`ca6c199d0f1bcc927e910daccb2798f3`
BLAKE2b-256	`8be9b33cf7668a1f4d66d5559e7df2846249810a249ddf831cdb4a83937a6f97`

See more details on using hashes here.

packyak 0.4.22

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

PackYak

PackYak

How it Works

Get Started

Install Docker

Install Python Poetry & Plugins

Install the packyak CLI:

Create a new Project

Deploy to AWS

Git-like Data Catalog (Project Nessie)

Create a NessieDynamoDBVersionStore

Create a Bucket to store Data Tables (e.g. Parquet files). This will store the "Repository"'s data.

Create the Nessie Catalog Service

Create a Branch

Deploy a Spark Cluster

SSH into the Spark Cluster

Initialize a SparkSession

Remote VS Code over SSH

Configure SparkSQL to be served over JDBC

Deploy Streamlit Site

Deploy to AWS

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

Install the `packyak` CLI:

Create a `NessieDynamoDBVersionStore`