No project description provided

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

csv-etl

A rules based approach to performing ETL operations on csv files.

Installation

Requires Python 3

Using pip

pip install csv-etl

Building from source

git clone https://github.com/winslowdibona/csv_etl.git
cd csv_etl
pip install -r requirements.txt
python setup.py bdist_wheel
pip install dist/csv_etl-0.0.1-py3-none-any.whl

Quick Introduction

The goal of this project is to provide a re-usable way to perform ETL operations on csv files. This implementation takes a given row of data from a csv file and applies a set of rules to generate a new format of the data. A rule knows three things

Where the original data is coming from (source)
Where the new data is going to (target)
The data type the new data is expected to conform to (output_type)
- output_type can be the following options and will result in the correlating python types
  - Integer : int
  - Decimal : decimal.Decimal
  - String : str
  - Date : datetime.datetime
Operations to perform on the retrieved data (operations)
- The operations is a list of python operations to perform on the data
- Each operation will run through a python eval statement to derive the calculated value
- The values are passed into the eval statement as a list under the variable name s
- Example
  - values = [2, 3]
  - operations = ["s[0] * s[1]"]
  - We would get a value of 6

The root Rule class in this library contains those three properties.

The Rule class is subclassed, and the subclasses are where the logic for extracting/modifying the data takes place.

The current Rule subclasses are

StaticRule
- Use Case: Static value for every row
SingleSourceRule
- Use Case: Extract value from source column, place in new target column
DateRule
- Use Case: Gather date part values from multiple columns, generate date value
CalculationRule
- Use Case: Perform arithmetic calculations with values from columns

Usage

With a config file

from csv_etl.rules import load_rules_from_yaml
from csv_etl import CSVConverter

rules = load_rules_from_yaml("path/to/config/file")
csv_converter = CSVConverter(rules)
result = csv_converter.convert("path/to/csv/file")

Configuration examples can be found here

The basic structure of the configuration file is:

rules:
  -
    target: The new key
    operation: extract || static
    to_type: Integer || String || Decimal || Date
    source: Where to pull data from
    operations: list of python operations to perform

Creating rules programmatically

from csv_etl.rules import (
  StaticRule,
  SingleSourceRule,
  DateRule,
  OutputType
)
from csv_etl import CSVConverter

rules = [
  StaticRule(
  		source='kg',
  		target='Unit',
  		output_type=OutputType.String
  ),
  SingleSourceRule(
  		source='Order Number',
  		target='OrderId',
  		output_type=OutputType.Integer
  ),
  DateRule(
		source={'day': 'Day', 'month': 'Month', 'year': 'Year'},
		target='OrderDate',
		output_type=OutputType.Date
  ),
  CalculationRule(
		source=['num1', 'num2'],
		target='Target',
		output_type=OutputType.Decimal,
		operations=['s[0] * s[1]']
    )
]
csv_converter = CSVConverter(rules)
result = csv_converter.convert("path/to/csv_file")

# By default, the result is a json array. But you can receive is back as a csv string
result = csv_converter.convert("path/to/csv/file", to='csv')

# You can also write the result to a specified file
result = csv_converter.convert("path/to/csv/file", to='csv', outfile="path/to/outfile")

Example Use Case

A more detailed use case of this library can be found here

Helpful Make Commands

Generating Documentation

make docs
open html/csv_etl/index.html

Running Tests

make pytest

Getting Test Coverage

make test-cov
open cov_html/index.html

Docker Playground

If you have Docker installed you can run

make play

to get a python 3 terminal running with the contents of the directory copied over. This can be useful for doing some quick testing.

What Next?

Expand on Date Rule

The current implementation of the DateRule is fairly rigid. It only covers the use case of there being day/month/year columns in the given csv. It also assumes the data values are numbers and not string representations of the months like March

This could be expanded upon to

Handle month strings
Handle different date formats
- YYYY-mm-dd
- mm-dd-YYYY
- dd-mm-YYYY

Better error handling

Currently the errors are just printed to the console. Might be nice to have an option to gather them and have them represented in the resulting data set somehow.

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

0.1.3

Dec 18, 2020

0.1.2

Dec 18, 2020

0.1.1

Dec 17, 2020

0.1.0

Dec 17, 2020

0.0.3

Dec 17, 2020

This version

0.0.2

Dec 16, 2020

0.0.1

Dec 16, 2020

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

csv_etl-0.0.2.tar.gz (6.0 kB view hashes)

Uploaded Dec 16, 2020 Source

Built Distribution

csv_etl-0.0.2-py3-none-any.whl (6.3 kB view hashes)

Uploaded Dec 16, 2020 Python 3

Hashes for csv_etl-0.0.2.tar.gz

Hashes for csv_etl-0.0.2.tar.gz
Algorithm	Hash digest
SHA256	`e5e1ceda78d857b722d2698bf4857b7feb927f47095e3b49cc377d8b91fad1e0`
MD5	`b0aa3486ea471090099c7b577894510f`
BLAKE2b-256	`c88bd5d3194f604a3e861581db862929c920e01d7899f0c29974445f92486dc3`

Hashes for csv_etl-0.0.2-py3-none-any.whl

Hashes for csv_etl-0.0.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`c59ebbb11a2c5842a877994bcce41adfa24aae263ac37e0608898bdeba97e64d`
MD5	`365dd9fdb031f01e260bf261e22dee83`
BLAKE2b-256	`3f2cf85898f0e4088b7f78409dc20f04cdc2a92d04a9fbd2e52c160e1c17981f`