Skip to main content

Layered extractor of data

Project description

dExtract is based on an Object-Oriented Framework to perform transformations on data on a sequential basis.

The minimal structure requires: 1. An object of the Sequential Class 2. Objects inheriting from the Base Layer Class

Data flows between layers by sending the transform output to the subsequent layer. The model can be run completely or from a specific index.

Sequential Class

The sequential class describes a pipeline of transformations. - Layers are saved in a dictionary keyed by Layer Types and count by type. - The method summary returns the components of the model for visualization.

Types of Layers

There are 4 main types of layers inheriting from the BaseLayer.

1.Slicer (DataLayer)

The slicer layer requires a data input that will be split into a series of ‘slices’ based on data sparsity/density. Row and column slices are calculated and then concatenated to fined ‘boxed areas’.

The output can be defined as a list or as search dictionary to find specific values in the resulting slices for quick identification.

The type of each row and column are defined by simple statistical thresholds. These can be customized according to user parameters.

There are 3 main data types: DATA (2), HEADING (1), EMPTY (0)

The data is then divided according to the DECISIONS matrix on slicing.py.

Decisions Matrix to create slices (prev_type, cur_type) as key 1. start_head (SH) - Begin counting records for a new HEADING section 2. end_head (EH) - Finish counting records for current HEADING section 3. start_data (SD) - Begin counting records for a new DATA section 4. end_data (ED) - Finish counting records for current DATA section 5. start_slice (SS) - Begin counting records for a new SLICE 6. end_slice (ES) - Finish counting records for current SLICE

SH, EH, SD, ED, SS, ES = 0, 1, 2, 3, 4, 5

DECISIONS = {(-1,0): [0,0,0,0,0,0],
             (-1,1): [1,0,0,0,1,0],
             (-1,2): [0,0,1,0,1,0],
             (0,0) : [0,0,0,0,0,0],
             (0,1) : [1,0,0,0,1,0],
             (0,2) : [0,0,1,0,1,0],
             (1,0) : [0,1,0,0,0,1],
             (1,1) : [0,0,0,0,0,0],
             (1,2) : [0,1,1,0,0,0],
             (2,0) : [0,0,0,1,0,1],
             (2,1) : [1,0,0,1,1,1],
             (2,2) : [0,0,0,0,0,0]}

2.Cleaner (DataLayer)

The Cleaner layer will apply predefined transformations based on kwargs. Each transformation is applied on an individual basis.

The clean.py helper includes all transformations and it is easily extendible. Open the helper for a complete definition of each transformation

3.Extractor (BaseLayer)

The Extractor layer provides an interface to retrieve external data.

On v0.9dev1: csv, xl (Excel) and single Excel sheet are supported

Future development includes: - SAS files through the SAS7BDAT Package - Asynchronous feed-forward extraction (allow the model to run in chunks) - Web Scrapping (Both files and websites)

4.Flattener (BaseLayer)

The Flattener layer transforms a nested dictionary of data into a single level dictionary.

It transforms all inputs into dataframes and identifies the result names by adding dictionary keys as ‘levels’ and concatenates them into a DataFrame ID.

Based on the input names dictionary or list, each dataframe is then assigned a new name matching the resulting ID.


Sample Usage

model = Sequential()
model.add(Extractor(ext_type, path, file, **kwargs))

model.add(Cleaner(clean_type = 'data', ignore_empty_cols = True,
                  ignore_empty_rows = True, delete_by_threshold = 0.82))

model.add(Cleaner(treat_axis_as_data = 'both', header_row = 0,
                  delete_escape_chars = True, drop_empty_columns = True))

model.add(Cleaner(compress_header = True, columns_as_row = True,
                  treat_axis_as_data = 'columns'))

model.add(Cleaner(index_as_col = True, transpose_output = True,
                  rename_columns = {'iH':'Row_ID', 'iH_y': 'Measure',
                                    'value':'Value'},
                  add_columns = {'Sheet': kwargs.get('sheet_name'),
                                 'File_Name': file,
                                 'Country': country,
                                 'File_Date': date,}))

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dextract-0.1.dev7.tar.gz (15.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dextract-0.1.dev7-py3-none-any.whl (4.1 kB view details)

Uploaded Python 3

File details

Details for the file dextract-0.1.dev7.tar.gz.

File metadata

  • Download URL: dextract-0.1.dev7.tar.gz
  • Upload date:
  • Size: 15.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.12.1 pkginfo/1.4.2 requests/2.21.0 setuptools/40.7.3 requests-toolbelt/0.8.0 tqdm/4.28.1 CPython/3.6.8

File hashes

Hashes for dextract-0.1.dev7.tar.gz
Algorithm Hash digest
SHA256 6e715c251ee94d8867ad1f72a137f9abedb0bdafc0107f2354e22a17ac21eec8
MD5 6229eaa1a476de9d5071e9a929a3e93d
BLAKE2b-256 956595be24fc655c92f2593205ba7442966ffe4b27c58b27e8bbd14f5e1bfa45

See more details on using hashes here.

File details

Details for the file dextract-0.1.dev7-py3-none-any.whl.

File metadata

  • Download URL: dextract-0.1.dev7-py3-none-any.whl
  • Upload date:
  • Size: 4.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.12.1 pkginfo/1.4.2 requests/2.21.0 setuptools/40.7.3 requests-toolbelt/0.8.0 tqdm/4.28.1 CPython/3.6.8

File hashes

Hashes for dextract-0.1.dev7-py3-none-any.whl
Algorithm Hash digest
SHA256 82f02f108a2c138d8f25878c9e7b0e9189ffa086bb7d01b54fcdb27b3d0198fb
MD5 2901a045cd03789a6599ecae49584d7e
BLAKE2b-256 144037bc8eebb02b42b507a1b43754137c2f0a099bfef729646251c52ed20877

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page