Skip to main content

ETL package for AUTO1 challenge

Project description

guacamoleETL - Document

An ETL pipeline tool that

  • pre-process
    • extracts data from a .txt file (Challenge_me.txt)
    • cleans up the data with invalid information
  • transforms the data through given specifications into a matrix (list of lists)
  • loads the data into a .csv file (output.csv)

Installation

This tool can be installed with pip
Copy-paste and run this command in the terminal

pip install guacamoleETL

Usage

This ETL pipeline can be part of predictive model training and feed the data directly to the model

import guacamoleETL

dataFile = 'Challenge_me.txt'

guacamoleETL.load(dataFile)
result = guacamoleETL.transform(dataFile)
predictive_model = model_training(result)

Functions

  • extract_data(txt_file): Extract data from a .txt file to a temporary .csv file
    Leading or trailing whitespace are removed during the extraction
  • clean_up(): Clean up the data with invalid information
    Rows with the placeholder '-' (NA) in any of the specified columns are excluded
  • transform(path): Transform the data from pre-process through given specifications into a matrix
    engine-location is split into two columns, engine-location_front and engine-location_rear and one-hot-encoded
    num-of-cylinders is transformed from word into integer through a pre-defined dictionary
    engine-size is transformed into integer
    weight is transformed into integer
    horsepower is transformed from German decimal notation string into float number
    aspiration is modified as aspiration_turbo so that turbo engines are marked as 1
    price is converted from minor units to major units
    make is not transformed but kept in the dataset
  • load(path): Load the data from previous transformation into a .csv file

Architecture

All the functions are implemented in the __init__.py, this decision is made based on the following reasons:

  • After the package is imported, if we want to use the transform and load functions directly as sub-module, the functions must be imported or defined in __init__.py.
  • Since they are all connected to each other, such as the transform function takes the result from pre-process (extract and clean up) and the load function also takes the result from transform function, it's easier to follow the flow if they are all in the same file.
  • This might not be the best architecture implementation, but while starting from small, simplicity is always a good consideration.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

guacamoleETL-0.3.0.tar.gz (3.1 kB view hashes)

Uploaded Source

Built Distribution

guacamoleETL-0.3.0-py3-none-any.whl (6.5 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page