Skip to main content

Extract, Transform and Load pipeline application

Project description

# twiddlepy

`twiddlepy` is a Python library designed for end-to-end extract, transform and load pipline (or ETL for short). Using a mapper file,
and optional functions your data can be transformed into a better suited format.

## Features

- Extract, Transform and Load pipelines
- Multiple datasource options for extracting data
- Multiple repository options for loading data
- Support for mapping input data

## Installation

Twiddlepy is available on the PyPi repository

`pip install twiddlepy`

Or if you want to install directly from the repository: `python setup.py install`, or drop the twiddlepy directory anywhere on your PYTHONPATH.

## Connectors

There are a number of data repository connectors available with Twiddlepy. Currently implemented connectors include:

### Data Source (Input)

- File Based
- CSV
- Excel Document
- Support for custom file loading (e.g. HTML)
- Database
- MySQL
- MSSQL
- Oracle
- SQLite
- MongoDB

### Repository (Output)

- File Based
- CSV
- Apache Solr

## Usage

Create a runnable python file with the following code:

```python
from twiddlepy.config import config
from twiddlepy.driver import TwiddleDriver

driver = TwiddleDriver(config)
driver.process_data()
```

### Example Project Structure

```
.
|-- mapper
| |-- mapper.csv
|-- local_functions.py
|-- run.py (File that runs Twiddle)
|-- twiddle.cfg
```

### User Configuration

Importing config from `twiddle.config` will import the default configuration items for each of the processes,
and will also look for a user defined configuration file on the path where the application is being run from.

All of the configuration items, including all of the default options can be found [here](twiddlepy/data/twiddle_defaults.cfg)

### Mapper File

A mapper file defined by the user is used to defined the input data that will be extracted from
the data repository. The mapper file is a CSV in which there are multiple columns that can be filled in
to specify the data mappings. Thw following columns must be defined in the mapper:

| Column Name | Description | Options |
| :-------------------: | :-----------------------------------------------------------------: | :----------------------------------------------------: |
| dataset | The dataset twiddlepy will use mappings for | Any name (string) |
| source_field_name | A name of a source field | Any name (string) |
| source_field_type | The data type of the source field | One of: "str", "int", "float", "double", "timestamp" |
| allow_missing | Allow the column to be missing in the dataset | One of: "y", "n" (Yes or No) |
| min | Data Validation: minimum allowed value | Any numeric value |
| max | Data Validation: maximum allowed value | Any numeric value |
| allowed_values | Data Validation: list of allowed values | Any array of values |
| unit | The unit the column is represented by | Any name (string) e.g. kg |
| repository | The repository name the column belongs to | Any name (string) |
| repository_field_name | The name the column will be renamed to for data loading | Any name (string) |
| repository_field_type | The data type that will be applied to the column when loading | One of: "string", "integer", "float", "double", "date" |
| ignore | Mark column to be ignore by mapping process (for historic datasets) | One of: "y", "n" (Yes or No) |

## Contribute

As a company, we welcome any input to fix/improve the project. Whilst we don't have a style guide currently,
this is something we will be working on in the future to improve the project further. We're very interested to hear
what you think about Twiddlepy, and any improvements you would like to see so please raise any issues in the tracker!

## Contact

Got a problem/query and want to discuss it with us personally? Contact us at <info@mediaintegration.co.uk>. We also have a website with more
information about the company [here](http://www.mediaintegration.co.uk)


Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

twiddlepy-0.1.3.tar.gz (34.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

twiddlepy-0.1.3-py3-none-any.whl (54.1 kB view details)

Uploaded Python 3

File details

Details for the file twiddlepy-0.1.3.tar.gz.

File metadata

  • Download URL: twiddlepy-0.1.3.tar.gz
  • Upload date:
  • Size: 34.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/41.0.0 requests-toolbelt/0.9.1 tqdm/4.31.1 CPython/3.7.2

File hashes

Hashes for twiddlepy-0.1.3.tar.gz
Algorithm Hash digest
SHA256 5b121b7e3dd3bda0734fb2ba1b50e38a9becbee19d1d91dcec98b53298a4c0ed
MD5 5867c9598d8d69c1a873afb76f9286dc
BLAKE2b-256 e878fb2e991491418f403ba23f8b64311810b6debb297abab095f4741a54009f

See more details on using hashes here.

File details

Details for the file twiddlepy-0.1.3-py3-none-any.whl.

File metadata

  • Download URL: twiddlepy-0.1.3-py3-none-any.whl
  • Upload date:
  • Size: 54.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/41.0.0 requests-toolbelt/0.9.1 tqdm/4.31.1 CPython/3.7.2

File hashes

Hashes for twiddlepy-0.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 0325c9e6191ddd33e2cbde53b62da48e807c6c579c97e002cecef130ba9c2e72
MD5 07d518dd208ca2ee2aae98ae40e67d69
BLAKE2b-256 70f612bce06e7b86709783bbd207e87cfe3f7df6ca403d6068136aba6e33fd24

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page