ETL package for AUTO1 challenge
Project description
guacamoleETL - Document
An ETL pipeline tool that
- pre-process
- extracts data from a .txt file (Challenge_me.txt)
- cleans up the data with invalid information
- transforms the data through given specifications into a matrix (list of lists)
- loads the data into a .csv file (output.csv)
Installation
This tool can be installed with pip
Copy-paste and run this command in the terminal
pip install guacamoleETL
Usage
This ETL pipeline can be part of predictive model training and feed the data directly to the model
import guacamoleETL
dataFile = 'Challenge_me.txt'
guacamoleETL.load(dataFile)
result = guacamoleETL.transform(dataFile)
predictive_model = model_training(result)
Functions
- extract_data(txt_file):
Extract data from a .txt file to a temporary .csv file
Leading or trailing whitespace are removed during the extraction - clean_up():
Clean up the data with invalid information
Rows with the placeholder '-' (NA) in any of the specified columns are excluded - transform(path):
Transform the data from pre-process through given specifications into a matrix
engine-location
is split into two columns,engine-location_front
andengine-location_rear
and one-hot-encoded
num-of-cylinders
is transformed from word into integer through a pre-defined dictionary
engine-size
is transformed into integer
weight
is transformed into integer
horsepower
is transformed from German decimal notation string into float number
aspiration
is modified asaspiration_turbo
so that turbo engines are marked as 1
price
is converted from minor units to major units
make
is not transformed but kept in the dataset - load(path): Load the data from previous transformation into a .csv file
Architecture
All the functions are implemented in the __init__.py, this decision is made based on the following reasons:
- After the package is imported, if we want to use the transform and load functions directly as sub-module, the functions must be imported or defined in
__init__.py
. - Since they are all connected to each other, such as the transform function takes the result from pre-process (extract and clean up) and the load function also takes the result from transform function, it's easier to follow the flow if they are all in the same file.
- This might not be the best architecture implementation, but while starting from small, simplicity is always a good consideration.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
guacamoleETL-0.3.0.tar.gz
(3.1 kB
view details)
Built Distribution
File details
Details for the file guacamoleETL-0.3.0.tar.gz
.
File metadata
- Download URL: guacamoleETL-0.3.0.tar.gz
- Upload date:
- Size: 3.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.12.1 pkginfo/1.4.2 requests/2.20.1 setuptools/40.6.2 requests-toolbelt/0.8.0 tqdm/4.28.1 CPython/3.5.2
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 2c5ccdcfec63b51a6c6e39ff6177b3b17f297e53927e12c7fbc178319b50c47a |
|
MD5 | a331408b17a67742197856510acd52ee |
|
BLAKE2b-256 | fef8f32e0da8ab4a45f488d7dd1cb49b31664624e831e45c5ef552d9e9e1e0fe |
File details
Details for the file guacamoleETL-0.3.0-py3-none-any.whl
.
File metadata
- Download URL: guacamoleETL-0.3.0-py3-none-any.whl
- Upload date:
- Size: 6.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.12.1 pkginfo/1.4.2 requests/2.20.1 setuptools/40.6.2 requests-toolbelt/0.8.0 tqdm/4.28.1 CPython/3.5.2
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | c8cd67437a41fb259151da2682366e5c6806c5cb8316f02b3e97aaeb24a57fb3 |
|
MD5 | 175c6fd4fa2ef3002f7a12d929b355ff |
|
BLAKE2b-256 | bcf4c0508079f244172f00f7353c59999e2ced8ab4e9c5b662facdcc0ae554b1 |