Skip to main content

An Apache Beam collection of transforms

Project description

DOJO-Beam-Transforms

Welcome to DOJO-Beam-Transforms, a repository dedicated to sharing advanced Apache Beam transformations, custom DoFn classes, and best practices for scalable data processing, curated by the team at DOJO-Smart-Ways.

Table of Contents

  1. DOJO-Beam-Transforms
  2. About DOJO-Smart-Ways
  3. What You'll Find Here
  4. Dependency Versions for Release 1.0.0
  5. Quick Start Guide
  6. Pipeline Deployment with Docker image

About DOJO-Smart-Ways

DOJO-Smart-Ways is committed to advancing data engineering, providing solutions that enhance data processing capabilities, and sharing knowledge within the data engineering community. Our focus is on creating efficient, scalable solutions for real-world data challenges.

What You'll Find Here

This repository contains:

  • Custom Apache Beam Transformations: Reusable code snippets for specific data preparation tasks.
  • Data Processing Recipes: Step-by-step guides for common and advanced data processing scenarios.
  • Integration Examples: How to integrate Apache Beam pipelines with BigQuery and other cloud services for end-to-end data processing solutions.
  • Performance Optimization Tips: Best practices for optimizing your Apache Beam pipelines for performance and cost.

Dependency Versions for Release 1.0.0

The following is a list of the dependencies and their respective versions that are required and compatible with the dojo-beam-transforms package version 1.0.0:

Apache Beam SDK Version

  • apache-beam[dataframe,gcp,interactive] == 2.58.1

Other Dependencies

  • pandas == 2.0.3
  • pandas-datareader == 0.10.0
  • PyMuPDF == 1.23.22
  • pypinyin == 0.51.0
  • unidecode == 1.3.8
  • openpyxl == 3.0.10
  • fsspec == 2024.6.1
  • gcsfs == 2024.6.1

Compatible Python Versions

The following Python versions have been tested and are confirmed to be compatible with this release:

  • Python 3.10
  • Python 3.11

Please ensure that your environment meets these requirements for optimal performance and compatibility.

Quick Start Guide

Streamline Your Setup with DOJO-Beam-Transforms! Begin your development journey smoothly by following this streamlined step:

  1. Initialize Your Development Environment:

    Start by creating a new branch for your project within the DOJO-Beam-Transforms repository. This approach ensures you can develop and iterate on your generic classes tailored to the project's needs. Use the following commands to clone the repository and switch to a new branch named after your project:

    # Clone the DOJO-Beam-Transforms repository
    git clone https://github.com/DOJO-Smart-Ways/DOJO-Beam-Transforms.git
    cd DOJO-Beam-Transforms
    
    # Create and switch to a new branch named 'project_name'
    git checkout -b project_name
    

    Once your project-specific development is underway, you can seamlessly integrate these changes into your Jupyter notebook environment. Execute the command below to install the project branch directly:

    !pip install git+https://github.com/DOJO-Smart-Ways/DOJO-Beam-Transforms.git@project_name#egg=dojo-beam-transforms
    

    This method allows for continuous development and testing within your project's scope, enabling a more efficient workflow.

  2. Consolidate Progress and Manage Dependencies:

    After validating the effectiveness of your enhancements or new features, merge your working progress from the project_name branch into the main branch. This step is crucial for consolidating your efforts and ensuring the broader project benefits from your contributions. Additionally, if your development introduced new dependencies, remember to update the setup.py file accordingly to include these dependencies. This ensures anyone pulling from the main branch or installing the package gets a version with all necessary dependencies resolved.

By following this integrated approach, you maintain a clean and organized development process, facilitating collaboration and ensuring that your enhancements are systematically incorporated into the DOJO-Beam-Transforms project.

  1. Utilize the Components:

    Bring the power of DOJO-Beam-Transforms into your pipeline with ease:

    from pipeline_components.input_file import read_pdf, read_and_apply_headers, read_bq
    from pipeline_components import data_enrichment as de
    from pipeline_components import data_cleaning as dc
    
    def process_delivery_requests(temp_location, output_table):
    
    # Reading the initial data
    delivery_requests, invalid_delivery_requests = read_json(pipeline, 'bucket/location/file.json, identifier='')
    
    # Cleaning the data
    cleaned_data = (
       delivery_requests
       | 'Keep Only BR Currency' >> beam.ParDo(dc.KeepColumnValues('Currency', ['R$', '$']))
       | 'Replace , to .  on Coordinates' >> beam.ParDo(dc.ReplacePatterns(['Longitude', 'Latitude'], ',', '.'))
    )
    
    # Enriching the data
    enriched_data = (
       cleaned_data
       | 'Convert to String' >> beam.ParDo(de.ColumnsToStringConverter(), ['destination', 'origin'])
    )
    
    # Writing the final output to BigQuery
    enriched_data | 'Write to BigQuery' >> beam.io.WriteToBigQuery(
       table=output_table,
       schema='SCHEMA_AUTODETECT',
       create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED,
       write_disposition=beam.io.BigQueryDisposition.WRITE_TRUNCATE,
       custom_gcs_temp_location=temp_location
    )
    
    # Run the pipeline
    pipeline.run().wait_until_finish()
    
    if __name__ == '__main__':
       temp_location = 'path/to/temp/location'
       output_table = 'project-id:dataset.table'
       process_delivery_requests(temp_location, output_table)
    

Pipeline Deployment with Docker Image

Benefits of Saving a Docker Image

Saving your Docker image provides several advantages, including consistency across environments, ease of deployment, and faster start-up times. By saving the image, you ensure that the exact environment used in development is replicated in production, reducing the chances of discrepancies or bugs. Additionally, storing Docker images allows for easy rollbacks to previous versions if needed, and simplifies the process of scaling deployments across multiple instances.

Storage Options

In the example below, the Docker image is stored in Google Cloud's Artifact Registry, a managed service that allows you to securely store and manage your container images. While the Artifact Registry is a convenient option, especially for projects already using Google Cloud, Docker images can also be stored in other commonly used registries, including:

  • Docker Hub: A popular and widely used registry for storing public and private images.
  • Amazon Elastic Container Registry (ECR): A service provided by AWS for managing Docker containers within the AWS ecosystem.
  • Azure Container Registry (ACR): A managed Docker container registry service provided by Microsoft Azure.

Prerequisites

  • Docker installed on your machine.
  • Google Cloud SDK installed.
  1. Clone the Dockerfile

  2. Build the Docker Image Inside the folder where Dockerfile is located run: docker build -t [IMAGE_NAME] .

  3. Authenticate with Google Cloud Configure Docker to authenticate requests for Artifact Registry using the following command: gcloud auth configure-docker [REGION]-docker.pkg.dev

  4. Tag your Docker image Use the following command: docker tag [IMAGE_NAME]:[VERSION] [REGION]-docker.pkg.dev/[PROJECT_ID]/[REPOSITORY]/[IMAGE_NAME]

  5. Push the Docker image to Artifact Registry Use the following command: docker push [REGION]-docker.pkg.dev/[PROJECT_ID]/[REPOSITORY]/[IMAGE_NAME]

  6. Run the Dataflow Pipeline with Custom Container Add these two parameters to yout pipeline options pipeline_options = { 'sdk_container_image': '[REGION]-docker.pkg.dev/[PROJECT_ID]/dojo-beam/[IMAGE_NAME]', 'sdk_location': 'container'}

Embark on your data processing journey with DOJO-Beam-Transforms today!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dojo-beam-transforms-1.1.0.tar.gz (22.5 kB view details)

Uploaded Source

Built Distribution

dojo_beam_transforms-1.1.0-py3-none-any.whl (25.3 kB view details)

Uploaded Python 3

File details

Details for the file dojo-beam-transforms-1.1.0.tar.gz.

File metadata

  • Download URL: dojo-beam-transforms-1.1.0.tar.gz
  • Upload date:
  • Size: 22.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.10.0

File hashes

Hashes for dojo-beam-transforms-1.1.0.tar.gz
Algorithm Hash digest
SHA256 ae18efdc24c605fbe97e91042e120ce8d277d3fcbd04aeeffcc375f8912d7158
MD5 8d18bd8a61b839eeb4b73e67776e792c
BLAKE2b-256 1cf1d77f176b17377e47572db36c04c15f5511f9ca11f60b01e09e49ead9e274

See more details on using hashes here.

File details

Details for the file dojo_beam_transforms-1.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for dojo_beam_transforms-1.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 ae9555b4891c3a9337e70d9dc597e0abd15c00cdec2747b57b97f097eb7eaee1
MD5 0370b9d3610abe901d32b38ec6883463
BLAKE2b-256 a6f2d23696dce0ef48a4e887487f1a2b5f893783d61ae9088a0da5b505111f0a

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page