Skip to main content

ETL package for AUTO1 challenge

Project description

guacamoleETL - Document

An ETL pipeline tool that

  • pre-process
    • extracts data from a .txt file (Challenge_me.txt)
    • cleans up the data with invalid information
  • transforms the data through given specifications into a matrix (list of lists)
  • loads the data into a .csv file (output.csv)

Installation

This tool can be installed with pip
Copy-paste and run this command in the terminal

pip install guacamoleETL

Usage

This ETL pipeline can be part of predictive model training and feed the data directly to the model

import guacamoleETL

dataFile = 'Challenge_me.txt'

guacamoleETL.load(dataFile)
result = guacamoleETL.transform(dataFile)
predictive_model = model_training(result)

Functions

  • extract_data(txt_file): Extract data from a .txt file to a temporary .csv file
    Leading or trailing whitespace are removed during the extraction
  • clean_up(): Clean up the data with invalid information
    Rows with the placeholder '-' (NA) in any of the specified columns are excluded
  • transform(path): Transform the data from pre-process through given specifications into a matrix
    engine-location is split into two columns, engine-location_front and engine-location_rear and one-hot-encoded
    num-of-cylinders is transformed from word into integer through a pre-defined dictionary
    engine-size is transformed into integer
    weight is transformed into integer
    horsepower is transformed from German decimal notation string into float number
    aspiration is modified as aspiration_turbo so that turbo engines are marked as 1
    price is converted from minor units to major units
    make is not transformed but kept in the dataset
  • load(path): Load the data from previous transformation into a .csv file

Architecture

All the functions are implemented in the __init__.py, this decision is made based on the following reasons:

  • After the package is imported, if we want to use the transform and load functions directly as sub-module, the functions must be imported or defined in __init__.py.
  • Since they are all connected to each other, such as the transform function takes the result from pre-process (extract and clean up) and the load function also takes the result from transform function, it's easier to follow the flow if they are all in the same file.
  • This might not be the best architecture implementation, but while starting from small, simplicity is always a good consideration.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

guacamoleETL-0.3.0.tar.gz (3.1 kB view details)

Uploaded Source

Built Distribution

guacamoleETL-0.3.0-py3-none-any.whl (6.5 kB view details)

Uploaded Python 3

File details

Details for the file guacamoleETL-0.3.0.tar.gz.

File metadata

  • Download URL: guacamoleETL-0.3.0.tar.gz
  • Upload date:
  • Size: 3.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.12.1 pkginfo/1.4.2 requests/2.20.1 setuptools/40.6.2 requests-toolbelt/0.8.0 tqdm/4.28.1 CPython/3.5.2

File hashes

Hashes for guacamoleETL-0.3.0.tar.gz
Algorithm Hash digest
SHA256 2c5ccdcfec63b51a6c6e39ff6177b3b17f297e53927e12c7fbc178319b50c47a
MD5 a331408b17a67742197856510acd52ee
BLAKE2b-256 fef8f32e0da8ab4a45f488d7dd1cb49b31664624e831e45c5ef552d9e9e1e0fe

See more details on using hashes here.

File details

Details for the file guacamoleETL-0.3.0-py3-none-any.whl.

File metadata

  • Download URL: guacamoleETL-0.3.0-py3-none-any.whl
  • Upload date:
  • Size: 6.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.12.1 pkginfo/1.4.2 requests/2.20.1 setuptools/40.6.2 requests-toolbelt/0.8.0 tqdm/4.28.1 CPython/3.5.2

File hashes

Hashes for guacamoleETL-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 c8cd67437a41fb259151da2682366e5c6806c5cb8316f02b3e97aaeb24a57fb3
MD5 175c6fd4fa2ef3002f7a12d929b355ff
BLAKE2b-256 bcf4c0508079f244172f00f7353c59999e2ced8ab4e9c5b662facdcc0ae554b1

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page