Skip to main content

A basic data pipeline tools for data engineer to handle the CRM or loyalty data

Project description

About CETL

CETL is a Python library that provides a comprehensive set of tools for building and managing data pipelines. It is designed to assist data engineers in handling Extract, Transform, and Load (ETL) tasks more effectively by simplifying the process and reducing the amount of manual labor involved.

CETL is particularly useful for Python developers who work with data on a regular basis. It uses popular data containers such as pandas dataframes, JSON objects, and PySpark dataframes to provide a familiar interface for developers. This allows users to easily integrate CETL into their existing data pipelines and workflows.

The library is intended to make the ETL process more straightforward by automating many of the technical details involved in data processing and movement. CETL includes a wide range of functions and tools for handling complex data formats, such as CSV, Excel, and JSON files, as well as for working with a variety of data sources, including databases, APIs, and cloud storage services.

One of the key benefits of CETL is its ability to handle large datasets, making it suitable for use in high-performance data processing environments. CETL also includes features for data profiling, data validation, data transformation, and data mapping, allowing users to build sophisticated data pipelines that can handle a wide range of data processing tasks.

Overall, CETL is a powerful data pipeline tool that can help data engineers to improve their productivity and streamline the ETL process. By providing a comprehensive set of functions and tools for working with data, CETL makes it easier to develop and maintain complex ETL pipelines, reducing the amount of time and effort required to manage data processing tasks.


User Guide

Example 1

GenerateDataFrame is a Python class object designed to represent a transformation step in a data pipeline. This object can be used to generate a dummy dataframe without reading actual data from a file. The main purpose of this object is to assist developers in testing their data processing pipelines.

With GenerateDataFrame, developers can quickly and easily create test data that mimics the structure of their actual data. This can be particularly useful when working with large datasets or when data is not readily available. By generating dummy data, developers can test their pipeline's functionality without having to rely on real data sources.

GenerateDataFrame is particularly useful in situations where developers need to test their pipeline's ability to handle different types of data and perform various data transformations. This can include testing the pipeline's ability to handle missing data, data outliers, and data formatting issues.

Overall, GenerateDataFrame is a powerful tool that can help developers to streamline the testing process and ensure the accuracy and efficiency of their data processing pipelines. By allowing developers to generate dummy data, it provides a quick and easy way to test their pipeline's functionality and identify any potential issues before deploying to production.

from cetl import make_pipeline
from cetl.pandas_modules import generateDataFrame
pipe = make_pipeline(generateDataFrame())
df = pipe.transform("")
print(df)
customer_id first_name last_name title
0 111 peter Hong Mr.
1 222 YuCheung Wong Mr.
2 333 Cindy Wong Mrs.

Example 2

from cetl import build_pipeline
from cetl.pandas_modules import generateDataFrame, unionAll
from cetl.functional_modules import dummyStart, parallelTransformer

pipe = build_pipeline(  dummyStart(),
                        parallelTransformer([generateDataFrame(), generateDataFrame()]), 
                        unionAll())
df = pipe.transform("")
print(df)
customer_id first_name last_name title
0 111 peter Hong Mr.
1 222 YuCheung Wong Mr.
2 333 Cindy Wong Mrs.
0 111 peter Hong Mr.
1 222 YuCheung Wong Mr.
2 333 Cindy Wong Mrs.

Alternatively, you can perform the same by using json configuration to the DataPipeline object

from cetl import DataPipeline
cfg = {"pipeline":[ {"type":"dummyStart", "module_type":"functional"},
                    {"type":"parallelTransformer", "transformers":[
                        {"type":"generateDataFrame"},
                        {"type":"generateDataFrame"}
                    ]},
                    {"type":"unionAll"}
]}

pipe = DataPipeline(cfg)
df = pipe.transform("")
print(df)
customer_id first_name last_name title
0 111 peter Hong Mr.
1 222 YuCheung Wong Mr.
2 333 Cindy Wong Mrs.
0 111 peter Hong Mr.
1 222 YuCheung Wong Mr.
2 333 Cindy Wong Mrs.

Render the graph

Note: please make sure the graphviz executable file is installed.
both png file and the svg file will be exported

pipe = pipe.build_digraph()
pipe.save_png("./sample.png")

sample.png

this version will solve the issue of UnboundLocalError: local variable 'pre_transformer_key' referenced before assignment

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cetl-0.2.9.tar.gz (39.6 kB view hashes)

Uploaded Source

Built Distribution

cetl-0.2.9-py3-none-any.whl (58.1 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page