A CDK Python app for deploying foundational infrastructure for InsuranceLake in AWS

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

InsuranceLake Infrastructure

The InsuranceLake solution is comprised of two codebases: Infrastructure and ETL. This codebase and the documentation that follows is specific to the Infrastructure. For more comprehensive documentation, including several ways to get started quickly, refer to the InsuranceLake ETL with CDK Pipeline README.

This solution helps you deploy ETL processes and data storage resources to create an InsuranceLake. It uses Amazon S3 buckets for storage, AWS Glue for data transformation, and AWS CDK Pipelines. The solution is originally based on the AWS blog Deploy data lake ETL jobs using CDK Pipelines.

CDK Pipelines is a construct library module for painless continuous delivery of CDK applications. CDK stands for Cloud Development Kit. It is an open source software development framework to define your cloud application resources using familiar programming languages.

Specifically, this solution helps you to:

Deploy a 3 Cs (Collect, Cleanse, Consume) InsuranceLake
Deploy ETL jobs needed make common insurance industry data souces available in a data lake
Use pySpark Glue jobs and supporting resoures to perform data transforms in a modular approach
Build and replicate the application in multiple environments quickly
Deploy ETL jobs from a central deployment account to multiple AWS environments such as Dev, Test, and Prod
Leverage the benefit of self-mutating feature of CDK Pipelines; specifically, the pipeline itself is infrastructure as code and can be changed as part of the deployment
Increase the speed of prototyping, testing, and deployment of new ETL jobs

InsuranceLake High Level Architecture

Architecture
- InsuranceLake
- Infrastructure
Codebase
- Source code Structure
- Automation Scripts
Authors and Reviewers
License Summary

Architecture

In this section we talk about the overall InsuranceLake architecture and the infrastructure component.

InsuranceLake

As shown in the figure below, we use Amazon S3 for storage. We use three S3 buckets:

Collect bucket to store raw data in its original format
Cleanse/Curate bucket to store the data that meets the quality and consistency requirements of the lake
Consume bucket for data that is used by analysts and data consumers of the lake (e.g. Amazon Quicksight, Amazon Sagemaker)

InsuranceLake is designed to support a number of source systems with different file formats and data partitions. To demonstrate, we have provided a CSV parser and sample data files for a source system with two data tables, which are uploaded to the Collect bucket.

We use AWS Lambda and AWS Step Functions for orchestration and scheduling of ETL workloads. We then use AWS Glue with pySpark for ETL and data cataloging, Amazon DynamoDB for transformation persistence, Amazon Athena for interactive queries and analysis. We use various AWS services for logging, monitoring, security, authentication, authorization, notification, build, and deployment.

Note: AWS Lake Formation is a service that makes it easy to set up a secure data lake in days. Amazon QuickSight is a scalable, serverless, embeddable, machine learning-powered business intelligence (BI) service built for the cloud. These two services are not used in this solution but can be added.

Conceptual Data Lake

Infrastructure

The figure below represents the infrastructure resources we provision for Data Lake.

Amazon Virtual Private Cloud (VPC)
Subnets
Security Groups
Route Table(s)
VPC Endpoints
Amazon S3 buckets for:
1. Collect data
2. Cleanse/Curate data
3. Consume data

Data Lake Infrastructure Architecture

Codebase

Source Code Structure

Table below explains how this source code structured:

File / Folder	Description
app.py	Application entry point.
code_commit_stack.py	Optional stack to deploy an empty CodeCommit respository for mirroring.
pipeline_stack.py	Pipeline stack entry point.
pipeline_deploy_stage.py	Pipeline deploy stage entry point.
s3_bucket_zones_stack.py	Stack creates S3 buckets - raw, conformed, and purpose-built. This also creates an S3 bucket for server access logging and AWS KMS Key to enabled server side encryption for all buckets.
tagging.py	Program to tag all provisioned resources.
vpc_stack.py	Contains all resources related to the VPC used by Data Lake infrastructure and services. This includes: VPC, Security Groups, and VPC Endpoints (both Gateway and Interface types).
test	This folder contains pytest unit tests
resources	This folder has static resources such as architecture diagrams.

Automation scripts

This repository has the following automation scripts to complete steps before the deployment:

Script	Purpose
bootstrap_deployment_account.sh	Used to bootstrap deployment account
bootstrap_target_account.sh	Used to bootstrap target environments for example dev, test, and production.
configure_account_secrets.py	Used to configure account secrets for GitHub access token.

Authors and Reviewers

The following people are involved in the design, architecture, development, testing, and review of this solution:

Cory Visi, Senior Solutions Architect, Amazon Web Services
Ratnadeep Bardhan Roy, Senior Solutions Architect, Amazon Web Services
Isaiah Grant, Cloud Consultant, 2nd Watch, Inc.
Muhammad Zahid Ali, Data Architect, Amazon Web Services
Ravi Itha, Senior Data Architect, Amazon Web Services
Justiono Putro, Cloud Infrastructure Architect, Amazon Web Services
Mike Apted, Principal Solutions Architect, Amazon Web Services
Nikunj Vaidya, Senior DevOps Specialist, Amazon Web Services

License Summary

This sample code is made available under the MIT-0 license. See the LICENSE file.

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

This version

3.2

Mar 3, 2024

3.1.0

Jan 5, 2024

3.0.1

Dec 12, 2023

2.5.0

Oct 5, 2023

2.4.1

Sep 25, 2023

2.3.0

Sep 23, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

aws-insurancelake-infrastructure-3.2.tar.gz (21.1 kB view hashes)

Uploaded Mar 3, 2024 Source

Built Distribution

aws_insurancelake_infrastructure-3.2-py3-none-any.whl (22.3 kB view hashes)

Uploaded Mar 3, 2024 Python 3

Hashes for aws-insurancelake-infrastructure-3.2.tar.gz

Hashes for aws-insurancelake-infrastructure-3.2.tar.gz
Algorithm	Hash digest
SHA256	`12d3e133b3784419e697091bad3bd7a14ff75e20f06c98a7c69d1013cbf56331`
MD5	`a3c3bfa24de51e99ce5ea52a1ff1ac7d`
BLAKE2b-256	`87072fde20cb896737b9e3bbc4a3a50498161735c6c4fc2a37f3d1728e397a3c`

Hashes for aws_insurancelake_infrastructure-3.2-py3-none-any.whl

Hashes for aws_insurancelake_infrastructure-3.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`db498539c9dba5f72d62d40aa1ff71c2b53a6ca1d7b4e6f9bd097776fc7cb3cd`
MD5	`949b6c95b3e44a7ead2197950fcd2fd6`
BLAKE2b-256	`27feb60ea07ff0b963677ea456ea96794cfcd6cca7860483ac20ecdce370e846`