Skip to main content

Advanced data cleaning, data wrangling and feature extraction tools for ML engineers

Project description

1 What is AI-STAC

“Exploratory data analysis can never be the whole story, but nothing else can serve as the foundation stone.” — John Tukey

Augmented Intent - Single Task Accelerator components (AI-STAC) is a unique approach to data recovery, discovery, synthesis and modeling that innovates the approach to data science and it’s transition to production. it’s origins came from an incubator project that shadowed a team of Ph.D. data scientists in connection with the development and delivery of machine learning initiatives to define measurable benefit propositions for customer success. From this, a number of observable ‘capabilities’ were identified as unique and separate concerns. The challenges of the data scientist, and in turn the production teams, were to effectively leveraging that separation of concern and distribute and loosely couple the specialist capability needs to the appropriate skills set.

In addition the need to remove the opaque nature of the machine learning end-to-end required better transparency and traceability, to better inform to the broadest of interested parties and be able to adapt without leaving being the code ‘sludge’ of redundant ideas. AI-STAC is a disruptive innovation, changing the way we approach the challenges of Machine Learning and Augmented Inelegance, introduces the ideas of ‘Single Task Adaptive Component’ around the core concept of ‘Parameterised Intent’

2 Main features

  • Machine Learning Capability Mapping

  • Parametrised Intent

  • Discovery Transitioning

  • Feature Cataloguing

  • Augmented Knowledge

3 Overview

AI-STAC is a change of approach in terms of improving productivity of the data scientists. This approach deconstructs the machine learning discovery vertical into a set of capabilities, ideas and knowledge. It presents a completely novel approach to the traditional process automation and model wrapping that is broadly offered as a solution to solve the considerable challenges that currently restrict the effectiveness of machine learning in the enterprise business.

To achieve this, the project offers advanced and specialized programming methods that are unique in approach and novel while maintaining familiarity within common tooling can be identified in four constructs.

1. Machine Learning Capability Mapping - Separation of capabilities, breaking the machine learning vertical into a set of decoupled and targeted layers of discrete and refined actions that collectively present a human lead (ethical AI) base truth to the next set of capabilities. This not only allows improved transparency of, what is, a messy and sometimes confusing set of discovery orientated coded ideas but also loosely couples and targets activities that are, generally, complex and specialized into identifiable and discrete capabilities that can be chained as separately allocated activities.

2. Parametrized Intent - A unique technique extracting the ideas and thinking of the data scientist from their discovery code and capturing it as intent with parameters that can be replayed against productionized code and data. This decoupling and Separation of Concern between data, code and the intent of actions from that code on that data, considerably improves time to market, code reuse, transparency of actions and the communication of ideas between data scientists and product delivery specialists.

3. Discovery Transitioning - Discovery Transitioning - is a foundation of the sepatation of concerns between data provisioning and feature selection. As part of the Accelerated ML discovery Vertical, Transitioning is a foundation base truth facilitating a transparent transition of the raw canonical dataset to a fit-for-purpose canonical dataset to enable the optimisation of discovery analysis and the identification of features-of-interest, for the data scientist and created boundary separation of capabilities decoupling the Data Scientist for the Data Engineer. As output it also provides ‘intelligent Communication’, not only to the Data Scientist through canonical fit-for-purpose datasets, but more generally offers powerful visual discovery tools and artefact generation for production architects, data and business SME’s, Stakeholders and is the initiator of Augmented Knowledge for an enriched and transparent shared view of the extended data knowledge.

4. Feature Cataloguing – With cross over skills within machine learning and advanced data heuristics, investigation identified commonality and separation across customer engagements that particularly challenged our Ph.D data scientists in their effective delivery of customer success. As a result the project designed and developed Feature Cataloguing, a machine learning technique of extracting and engineering features and their characteristics appropriately parameterized for model selection. This technique implements a juxta view of how features are characterized and presented to the modelling layer. Traditionally features are directly mapped as a representation of the underlying data set. Feature Cataloguing treats each individual feature as its own individual set of characteristics as its representation. The resulting outcome considerably improves experimentation, cross feature association, even when unrelated in the original data sets, and the reuse of identified features-of-interest across use case and business domains.

5. Augmented Knowledge - This the ability to capture information on data, activities and the rich stream of subject matter expertise, injected into the machine learning discovery vertical to provide an Augmented n-view of the model build. This includes security, sensitivity, data value scaling, dictionary, observations, performance, optimization, bias, etc. This enriched view of data allows, amongst other things, improved knowledge share, AI explainability, feature transparency, and accountability that feeds into AI ethics, and insight analysis.

4 Background

Born out of the frustration of time constraints and the inability to show business value within a business expectation, this project aims to provide a set of tools to quickly produce visual and observational results. It also aims to improve the communication outputs needed by ML delivery to talk to Pre-Sales, Stakholders, Business SME’s, Data SME’s product coders and tooling engineers while still remaining within familiar code paragigms.

The package looks to build a set of outputs as part of standard data wrangling and ML exploration that, by their nature, are familiar tools to the various reliant people and processes. For example Data dictionaries for SME’s, Visual representations for clients and stakeholders and configuration contracts for architects, tool builders and data ingestion.

4.1 Discovery Transition

Discovery Transition is first and key part of an end to end process of discovery, productization and tooling. It defines the ‘intelligence’ and business differentiators of everything downstream.

To become effective in the Discovery Transition phase, the ability to be able to micro-iterate within distinct layers enables the needed adaptive delivery and quicker returns on ML use case.

The building and discovery of an ML model can be broken down into three Separation of Concerns (SoC) or Scope of Responsibility (SoR) for the ML engineer and ML model builder.

  • Data Preparation

  • Feature Engineering

  • Model selection and optimisation

with a forth discipline of insight, interpretation and profiling as an outcome. these three SoC’s can be perceived as eight distinct disciplines

4.2 Conceptuasl Eight stages of Model preparation

  1. Connectivity (data sourcing and persisting, fit-for-purpose, quality, quantity, veracity, connectivity)

  2. Data Discovery (filter, selection, typing, cleaning, valuing, validating)

  3. Augmented Knowledge (observation, visualisation, knowledge, value scale)

  4. Data Attribution (attribute mapping, quantitative attribute characterisation. predictor selection)

  5. Feature Engineering (feature modelling, dirty clustering, time series, qualitative feature characterisation)

  6. Feature Framing (hypothesis function, specialisation, custom model framing, model/feature selection)

  7. Model Train (selection, optimisation, testing, training)

  8. Model Predict (learning, feedback loops, opacity testing, insight, profiling, stabilization)

Though conceptual they do represent a set of needed disciplines and the complexity of the journey to quality output.

4.3 Layered approach and Capability Mapping

The idea behind the conceptual eight stages of Machine Learning is to layer the preparation and reuse of the activities undertaken by the ML Data Engineer and ML Modeller. To provide a platform for micro iterations rather than a constant repetition of repeatable tasks through the stack. It also facilitates contractual definitions between the different disciplines that allows loose coupling and automated regeneration of the different stages of model build. Finally it reduces the cross discipline commitments by creating a ‘by-design’ set of contracts targeted at, and written in, the language of the consumer.

The concept of being able to quickly run over a single aspect of the ML discovery and then present a stable base for the next layer to iterate against. this micro-iteration approach allows for quick to market adaptive delivery.

5 Getting Started

The discovery-transition-ds package is a python/pandas implementation of the AI-STAC Transition component, specifically aimed at Python, Numpy and Pandas based Data Science activities. It is build to be very light weight in terms of package dependencies requiring nothing beyond what would be found in an basic Data Science environment. Its designed to be used easily within multiple python based interfaces such as Jupyter, IDE or command-line python.

6 Installation

6.1 package install

The best way to install AI-STAC component packages is directly from the Python Package Index repository using pip. All AI-STAC components are based on a pure python foundation package aistac-foundation

$ pip install aistac-foundation

The AI-STAC component package for the Transition is discovery-transition-ds and pip installed with:

$ pip install discovery-transition-ds

if you want to upgrade your current version then using pip install upgrade with:

$ pip install --upgrade discovery-transition-ds

6.2 First Time Env Setup

In order to ease the startup of tasks a number of environment variables are available to pre-assign where and how configuration and data can be collected. This can considerable improve the burden of setup and help in the migration of the outcome contracts between environments.

In this section we will look at a couple of primary environment variables and demonstrate later how these are used in the Component. In the following example we are assuming a local file reference but this is not the limit of how one can use the environment variables to locate date from multiple different connection mediums. Examples of other connectors include AWS S3, Hive, Redis, MongoDB, Azure Blob Storage, or specific connectors can be created very quickly using the AS-STAC foundation abstracts.

If you are on linux or MacOS:

  1. Open the current user’s profile into a text editor.

$> vi ~/.bash_profile.

2. Add the export command for each environment variable setting your preferred paths in this example I am setting them to a demo projects folder

# where to find the properties contracts
export HADRON_PM_PATH=~/projects/demo/contracts

# The default path for the source and the persisted data
export HADRON_DEFAULT_PATH=~/projects/demo/data

3. In addition to the default environment variables you can set specific component environment variables. This is particularly useful with the Transition component as source data tends to sit separate from our interim storage. For Transition you replace the DEFAULT with TRANSITION, and in this case specify this is the SOURCE path

# specific to te transition component source path
export HADRON_TRANSITION_SOURCE_PATH=/tmp/data/sftp
  1. save your changes

  2. re-run your bash_profile and check the variables have been set

$> source ~/.bash_profile.
$> env

7 Transition Task - Setup

The Transition Component is a ‘Capability’ component and a ‘Separation of Concern’ dealing specifically with the transition of data from connectivity of data source to the persistence of ‘data-of-interest’ that has been prepared appropriate for the language canonical, in this case ‘Pandas DataFrame’.

In the following example we are assuming a local file reference and are using the default AI-STAC Connector Contracts for Data Sourcing and Persisting, but this is not the limit of how one can use connect to data retrieval and storage. Examples of other connectors include AWS S3, Hive, Redis, MongoDB, Azure Blob Storage, or specific connectors can be created very quickly using the AS-STAC foundation abstracts.

7.1 Instantiation

The Transition class is the encapsulating class for the Transitioning Capability, providing a wrapper for transitioning functionality. and imported as:

from ds_discovery import Transition

The easiest way to instantiate the Transition class is to use Factory Instantiation method .from_env(...) that takes advantage of our environment variables set up in the previous section. in order to differentiate each instance of the Transition Component, we assign it a Task name that we can use going forward to retrieve or re-create our Transition instance with all its ‘Intent’

tr = Transition.from_env(task_name='demo')

7.2 Augmented Knowledge

Once you have instantiated the Transition Task it is important to add a description of the task as a future remind, for others using this task and when using the MasterLedger component (not covered in this tutorial) it allows for a quick reference overview of all the tasks in the ledger.

tr.set_description("A Demo task used as an example for the Transitioning tutorial")

Note: the description should be a short summary of the task. If we need to be more verbose, and as good practice, we can also add notes, that are timestamped and cataloged, to help augment knowledge about this task that is carried as part of the Property Contract.

in the Transition Component notes are cataloged within five named sections: * source - notes about the source data that help in what it is, where it came from and any SME knowledge of interest * schema - data schemas to capture and report on the outcome data set * observations - observations of interest or enhancement of the understanding of the task * actions - actions needed, to be taken or have been taken within the task

each catalog can have multiple labels whick in tern can have multiple text entries, each text keyed by timestamp. through the catalog set is fixed, labels can be any reference label

the following example adds a description to the source catalogue

tr.add_notes(catalog='source', label='describe', text="The source of this demo is a synthetic data set"

To retrieve the list of allowed catalog sections we use the property method:

tr.notes_catalog

We now have our Transition instance and had we previously set it up it will contain all the previously set Property Contract

7.3 One-Time Connectors Settings

With each component task we need to set up its connectivity defining three Connector Contract which control the loose coupling of where data is sourced and persisted to the code that uses it. Though we can define up each Connect Contract, it is easier to take advantage of template connectors set up as part of the Factory initialisation method.

Though we can define as many Connector Contract as we like, by its nature, the Transition task has three key connectors that need to be set up as a ‘one-off’ task. Once these are set they are stored in the Property Contract and thus do not need to be set again.

7.3.1 Source Contract

Firstly we need to set up the ‘Source Contract’ that specifies the data to be sourced. Because we are taking advantage of the environment variable HADRON_TRANSITION_SOURCE_PATH we only need to pass the source file name. In this example we are also going to pass two ‘optional’ extra parameters that get passed directly to the Source reader, sep= and encoding=

tr.set_source(uri_file='demo_data.txt', sep='\t', encoding='Latin1')

7.3.2 Persist Contract

Secondly we need to specify where we are going to persist our data once we have transitioned it. Again we are going to take advantage of what our Factory Initialisation method set up for us and allow the Transition task to define our output based on constructed template Connector Contracts.

tr.set_persist()

7.3.3 Dictionary Contract

Finally, and optionally, we set up a Data Dictionary Connector that allows us to output a data dictionary of the source or persist schema to a persisted state that can be shared with other parties of interest. .. code-block:: python

tr.set_dictionary()

Now we have set up the Connector Contracts we no longer need to reference this code again as the information as been stored in the Property Contract. We will look later how we can report on these connectors and observe their settings

We are ready to go. The Transition task is ready to use.

8 Transition Task - Intent

8.1 Instantiate the Task

The easiest way to instantiate the Transition class is to use Factory Instantiation method .from_env(...) that takes advantage of our environment variables set up in the previous section. in order to differentiate each instance of the Transition Component, we assign it a Task name that we can use going forward to retrieve or re-create our Transition instance with all its ‘Intent’

tr = Transition.from_env(task_name='demo')

8.2 Loading the Source Canonical

df = tr.load_source_canonical()

8.3 Canonical Reporting

tr.canonical_report(df)

8.4 Parameterised Intent

Parameterised intent is a core concept and represents the intended action and defining functions of the component. Each method is known as a component intent and the parameters the task parameterisation of that intent. The intent and its parameters are saved and can be replayed using the run_intent_pipeline(canonical) method

The following sections are a brief description of the intent and its parameters. To retrieve the list of available intent methods in code run:

tr.intent_model.__dir__()

8.4.1 auto_clean_header

def auto_clean_header(self, df, case=None, rename_map: dict=None, replace_spaces: str=None, inplace: bool=False,
                      save_intent: bool=None, intent_level: [int, str]=None):

    clean the headers of a pandas DataFrame replacing space with underscore

    :param df: the pandas.DataFrame to drop duplicates from
    :param rename_map: a from: to dictionary of headers to rename
    :param case: changes the headers to lower, upper, title, snake. if none of these then no change
    :param replace_spaces: character to replace spaces with. Default is '_' (underscore)
    :param inplace: if the passed pandas.DataFrame should be used or a deep copy
    :param save_intent: (optional) if the intent contract should be saved to the property manager
    :param intent_level: (optional) the level of the intent,
                    If None: default's 0 unless the global intent_next_available is true then -1
                    if -1: added to a level above any current instance of the intent section, level 0 if not found
                    if int: added to the level specified, overwriting any that already exist
    :return: if inplace, returns a formatted cleaner contract for this method, else a deep copy pandas.DataFrame.

8.4.2 auto_drop_correlated

uses ‘brute force’ techniques to removes highly correlated columns based on the threshold,

set by default to 0.998.

df:

data: the Canonical data to drop duplicates from

threshold:

(optional) threshold correlation between columns. default 0.998

inc_category:

(optional) if category type columns should be converted to numeric representations

sample_percent:

a sample percentage between 0.5 and 1 to avoid over-fitting. Default is 0.85

random_state:

a random state should be applied to the test train split. Default is None

inplace:

if the passed Canonical, should be used or a deep copy

save_intent:

(optional) if the intent contract should be saved to the property manager

intent_level:

(optional) the level of the intent, If None: default’s 0 unless the global intent_next_available is true then -1 if -1: added to a level above any current instance of the intent section, level 0 if not found if int: added to the level specified, overwriting any that already exist

return:

if inplace, returns a formatted cleaner contract for this method, else a deep copy Canonical,.

8.4.3 auto_remove_columns

auto removes columns that are np.NaN, a single value or have a predominant value greater than.

df:

the pandas.DataFrame to auto remove

null_min:

the minimum number of null values default to 0.998 (99.8%) nulls

predominant_max:

the percentage max a single field predominates default is 0.998

nulls_list:

can be boolean or a list: if boolean and True then null_list equals [‘NaN’, ‘nan’, ‘null’, ‘’, ‘None’, ‘ ‘] if list then this is considered potential null values.

auto_contract:

if the auto_category or to_category should be returned

test_size:

a test percentage split from the df to avoid over-fitting. Default is 0 for no split

random_state:

a random state should be applied to the test train split. Default is None

drop_empty_row:

also drop any rows where all the values are empty

inplace:

if to change the passed pandas.DataFrame or return a copy (see return)

save_intent:

(optional) if the intent contract should be saved to the property manager

intent_level:

(optional) the level of the intent, If None: default’s 0 unless the global intent_next_available is true then -1 if -1: added to a level above any current instance of the intent section, level 0 if not found if int: added to the level specified, overwriting any that already exist

return:

if inplace, returns a formatted cleaner contract for this method, else a deep copy pandas.DataFrame.

8.4.4 auto_to_category

auto categorises columns that have a max number of uniqueness with a min number of nulls

and are object dtype

df:

the pandas.DataFrame to auto categorise

unique_max:

the max number of unique values in the column. default to 20

null_max:

maximum number of null in the column between 0 and 1. default to 0.7 (70% nulls allowed)

fill_nulls:

a value to fill nulls that then can be identified as a category type

nulls_list:

potential null values to replace.

inplace:

if the passed pandas.DataFrame should be used or a deep copy

save_intent:

(optional) if the intent contract should be saved to the property manager

intent_level:

(optional) the level of the intent, If None: default’s 0 unless the global intent_next_available is true then -1 if -1: added to a level above any current instance of the intent section, level 0 if not found if int: added to the level specified, overwriting any that already exist

return:

if inplace, returns a formatted cleaner contract for this method, else a deep copy pandas.DataFrame.

8.4.5 to_bool_type

converts column to bool based on the map

df:

the Pandas.DataFrame to get the column headers from

bool_map:

a mapping of what to make True and False

headers:

a list of headers to drop or filter on type

drop:

to drop or not drop the headers

dtype:

the column types to include or exclude. Default None else int, float, bool, object, ‘number’

exclude:

to exclude or include the dtypes

regex:

a regular expression to search the headers

re_ignore_case:

true if the regex should ignore case. Default is False

inplace:

if the passed pandas.DataFrame should be used or a deep copy

save_intent:

(optional) if the intent contract should be saved to the property manager

intent_level:

(optional) the level of the intent, If None: default’s 0 unless the global intent_next_available is true then -1 if -1: added to a level above any current instance of the intent section, level 0 if not found if int: added to the level specified, overwriting any that already exist

return:

if inplace, returns a formatted cleaner contract for this method, else a deep copy pandas.DataFrame.

8.4.6 to_category_type

converts columns to categories

df:

the Pandas.DataFrame to get the column headers from

headers:

a list of headers to drop or filter on type

drop:

to drop or not drop the headers

dtype:

the column types to include or exclude. Default None else int, float, bool, object, ‘number’

exclude:

to exclude or include the dtypes

regex:

a regular expression to search the headers

re_ignore_case:

true if the regex should ignore case. Default is False

as_num:

if true returns the category as a category code

fill_nulls:

a value to fill nulls that then can be identified as a category type

nulls_list:

potential null values to replace.

inplace:

if the passed pandas.DataFrame should be used or a deep copy

save_intent:

(optional) if the intent contract should be saved to the property manager

intent_level:

(optional) the level of the intent, If None: default’s 0 unless the global intent_next_available is true then -1 if -1: added to a level above any current instance of the intent section, level 0 if not found if int: added to the level specified, overwriting any that already exist

return:

if inplace, returns a formatted cleaner contract for this method, else a deep copy pandas.DataFrame.

8.4.7 to_date_type

converts columns to date types

df:

the Pandas.DataFrame to get the column headers from

headers:

a list of headers to drop or filter on type

drop:

to drop or not drop the headers

dtype:

the column types to include or exclude. Default None else int, float, bool, object, ‘number’

exclude:

to exclude or include the dtypes

regex:

a regular expression to search the headers

re_ignore_case:

true if the regex should ignore case. Default is False

inplace:

if the passed pandas.DataFrame should be used or a deep copy

as_num:

if true returns number of days since 0001-01-01 00:00:00 with fraction being hours/mins/secs

year_first:

specifies if to parse with the year first If True parses dates with the year first, eg 10/11/12 is parsed as 2010-11-12. If both dayfirst and yearfirst are True, yearfirst is preceded (same as dateutil).

day_first:

specifies if to parse with the day first If True, parses dates with the day first, eg %d-%m-%Y. If False default to the a prefered preference, normally %m-%d-%Y (but not strict)

date_format:

if the date can’t be inferred uses date format eg format=’%Y%m%d’

save_intent:

(optional) if the intent contract should be saved to the property manager

intent_level:

(optional) the level of the intent, If None: default’s 0 unless the global intent_next_available is true then -1 if -1: added to a level above any current instance of the intent section, level 0 if not found if int: added to the level specified, overwriting any that already exist

return:

if inplace, returns a formatted cleaner contract for this method, else a deep copy pandas.DataFrame.

8.4.8 to_float_type

converts columns to float type

df:

the Pandas.DataFrame to get the column headers from

headers:

a list of headers to drop or filter on type

drop:

to drop or not drop the headers

dtype:

the column types to include or exclude. Default None else int, float, bool, object, ‘number’

exclude:

to exclude or include the dtypes

regex:

a regular expression to search the headers

re_ignore_case:

true if the regex should ignore case. Default is False

precision:

how many decimal places to set the return values. if None then the number is unchanged

fillna:

{ num_value, ‘mean’, ‘mode’, ‘median’ }. Default to np.nan - If num_value, then replaces NaN with this number value - If ‘mean’, then replaces NaN with the mean of the column - If ‘mode’, then replaces NaN with a mode of the column. random sample if more than 1 - If ‘median’, then replaces NaN with the median of the column

errors:

{‘ignore’, ‘raise’, ‘coerce’}, default ‘coerce’ }. Default to ‘coerce’ - If ‘raise’, then invalid parsing will raise an exception - If ‘coerce’, then invalid parsing will be set as NaN - If ‘ignore’, then invalid parsing will return the input

inplace:

if the passed pandas.DataFrame should be used or a deep copy

save_intent:

(optional) if the intent contract should be saved to the property manager

intent_level:

(optional) the level of the intent, If None: default’s 0 unless the global intent_next_available is true then -1 if -1: added to a level above any current instance of the intent section, level 0 if not found if int: added to the level specified, overwriting any that already exist

return:

if inplace, returns a formatted cleaner contract for this method, else a deep copy pandas.DataFrame.

8.4.9 to_int_type

converts columns to int type

df:

the Pandas.DataFrame to get the column headers from

headers:

a list of headers to drop or filter on type

drop:

to drop or not drop the headers

dtype:

the column types to include or exclude. Default None else int, float, bool, object, ‘number’

exclude:

to exclude or include the dtypes

regex:

a regular expression to search the headers

re_ignore_case:

true if the regex should ignore case. Default is False

fillna:

{ num_value, ‘mean’, ‘mode’, ‘median’ }. Default to 0 - If num_value, then replaces NaN with this number value - If ‘mean’, then replaces NaN with the mean of the column - If ‘mode’, then replaces NaN with a mode of the column. random sample if more than 1 - If ‘median’, then replaces NaN with the median of the column

errors:

{‘ignore’, ‘raise’, ‘coerce’}, default ‘coerce’ - If ‘raise’, then invalid parsing will raise an exception - If ‘coerce’, then invalid parsing will be set as NaN - If ‘ignore’, then invalid parsing will return the input

inplace:

if the passed pandas.DataFrame should be used or a deep copy

save_intent:

(optional) if the intent contract should be saved to the property manager

intent_level:

(optional) the level of the intent, If None: default’s 0 unless the global intent_next_available is true then -1 if -1: added to a level above any current instance of the intent section, level 0 if not found if int: added to the level specified, overwriting any that already exist

return:

if inplace, returns a formatted cleaner contract for this method, else a deep copy pandas.DataFrame.

8.4.10 to_normalised

converts columns to float type

df:

the Pandas.DataFrame to get the column headers from

headers:

a list of headers to drop or filter on type

drop:

to drop or not drop the headers

dtype:

the column types to include or exclude. Default None else int, float, bool, object, ‘number’

exclude:

to exclude or include the dtypes

regex:

a regular expression to search the headers

re_ignore_case:

true if the regex should ignore case. Default is False

precision:

how many decimal places to set the return values. if None then the number is unchanged

inplace:

if the passed pandas.DataFrame should be used or a deep copy

save_intent:

(optional) if the intent contract should be saved to the property manager

intent_level:

(optional) the level of the intent, If None: default’s 0 unless the global intent_next_available is true then -1 if -1: added to a level above any current instance of the intent section, level 0 if not found if int: added to the level specified, overwriting any that already exist

return:

if inplace, returns a formatted cleaner contract for this method, else a deep copy pandas.DataFrame.

8.4.11 to_numeric_type

converts columns to int type

df:

the Pandas.DataFrame to get the column headers from

headers:

a list of headers to drop or filter on type

drop:

to drop or not drop the headers

dtype:

the column types to include or exclude. Default None else int, float, bool, object, ‘number’

exclude:

to exclude or include the dtypes

regex:

a regular expression to search the headers

re_ignore_case:

true if the regex should ignore case. Default is False

precision:

how many decimal places to set the return values. if None then the number is unchanged

fillna:

{ num_value, ‘mean’, ‘mode’, ‘median’ }. Default to np.nan - If num_value, then replaces NaN with this number value. Must be a value not a string - If ‘mean’, then replaces NaN with the mean of the column - If ‘mode’, then replaces NaN with a mode of the column. random sample if more than 1 - If ‘median’, then replaces NaN with the median of the column

errors:

{‘ignore’, ‘raise’, ‘coerce’}, default ‘coerce’ - If ‘raise’, then invalid parsing will raise an exception - If ‘coerce’, then invalid parsing will be set as NaN - If ‘ignore’, then invalid parsing will return the input

inplace:

if the passed pandas.DataFrame should be used or a deep copy

save_intent:

(optional) if the intent contract should be saved to the property manager

intent_level:

(optional) the level of the intent, If None: default’s 0 unless the global intent_next_available is true then -1 if -1: added to a level above any current instance of the intent section, level 0 if not found if int: added to the level specified, overwriting any that already exist

return:

if inplace, returns a formatted cleaner contract for this method, else a deep copy pandas.DataFrame.

8.4.12 to_remove

remove columns from the pandas.DataFrame

df:

the Pandas.DataFrame to get the column headers from

headers:

a list of headers to drop or filter on type

drop:

to drop or not drop the headers

dtype:

the column types to include or exclude. Default None else int, float, bool, object, ‘number’

exclude:

to exclude or include the dtypes

regex:

a regular expression to search the headers

re_ignore_case:

true if the regex should ignore case. Default is False

inplace:

if the passed pandas.DataFrame should be used or a deep copy

save_intent:

(optional) if the intent contract should be saved to the property manager

intent_level:

(optional) the level of the intent, If None: default’s 0 unless the global intent_next_available is true then -1 if -1: added to a level above any current instance of the intent section, level 0 if not found if int: added to the level specified, overwriting any that already exist

return:

if inplace, returns a formatted cleaner contract for this method, else a deep copy pandas.DataFrame.

8.4.13 to_select

selects columns from the pandas.DataFrame

df:

the Pandas.DataFrame to get the column headers from

headers:

a list of headers to drop or filter on type

drop:

to drop or not drop the headers

dtype:

the column types to include or exclude. Default None else int, float, bool, object, ‘number’

exclude:

to exclude or include the dtypes

regex:

a regular expression to search the headers

re_ignore_case:

true if the regex should ignore case. Default is False

inplace:

if the passed pandas.DataFrame should be used or a deep copy

save_intent:

(optional) if the intent contract should be saved to the property manager

intent_level:

(optional) the level of the intent, If None: default’s 0 unless the global intent_next_available is true then -1 if -1: added to a level above any current instance of the intent section, level 0 if not found if int: added to the level specified, overwriting any that already exist

return:

if inplace, returns a formatted cleaner contract for this method, else a deep copy pandas.DataFrame.

8.4.14 to_str_type

converts columns to object type

df:

the Pandas.DataFrame to get the column headers from

headers:

a list of headers to drop or filter on type

drop:

to drop or not drop the headers

dtype:

the column types to include or exclude. Default None else int, float, bool, object, ‘number’

exclude:

to exclude or include the dtypes

regex:

a regular expression to search the headers

re_ignore_case:

true if the regex should ignore case. Default is False

use_string_type:

if the dtype ‘string’ should be used or keep as object type

fill_nulls:

a value to fill nulls that then can be identified as a category type

nulls_list:

potential null values to replace.

nulls_list:

can be boolean or a list: if boolean and True then null_list equals [‘NaN’, ‘nan’, ‘null’, ‘’, ‘None’. np.nan, None] if list then this is considered potential null values.

inplace:

if the passed pandas.DataFrame should be used or a deep copy

save_intent:

(optional) if the intent contract should be saved to the property manager

intent_level:

(optional) the level of the intent, If None: default’s 0 unless the global intent_next_available is true then -1 if -1: added to a level above any current instance of the intent section, level 0 if not found if int: added to the level specified, overwriting any that already exist

return:

if inplace, returns a formatted cleaner contract for this method, else a deep copy pandas.DataFrame.

8.5 Persist the Transitioned Canonical

8.5.1 Save Clean Canonical

tr.canonical_report(df_clean)

8.5.2 Save Data Dictionary

tr.save_dictionary(tr.canonical_report(df, stylise=False))

8.6 Run Pipeline

8.6.1 Locally

df_clean = tr.intent_model.run_intent_pipeline(df)

8.6.2 End-to-End

tr.run_transition_pipeline()

9 Transparency and Traceability

9.1 Environ Report

tr.report_environ()

9.2 Connectors Report

tr.report_connectors()

9.3 Intent Report

tr.report_Intent()

9.4 Run Book Report

tr.report_run_book()

9.5 Notes Report

tr.report_Notes()

9.6 Schema Report

10 Reference

10.1 Python version

Python 3.6 or less is not supported. Although Python 3.7 is supported, it is recommended to install discovery-transition-ds against the latest Python 3.8.x or greater whenever possible.

10.2 Pandas version

Pandas 0.25.x and above are supported but It is highly recommended to use the latest 1.0.x release as the first major release of Pandas.

10.3 GitHub Project

discovery-transition-ds: https://github.com/Gigas64/discovery-transition-ds.

10.4 Change log

See CHANGELOG.

10.5 Licence

BSD-3-Clause: LICENSE.

10.6 Authors

Gigas64 (@gigas64) created discovery-transition-ds.

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

discovery-transition-ds-3.3.13.tar.gz (6.7 MB view hashes)

Uploaded Source

Built Distribution

discovery_transition_ds-3.3.13-py38-none-any.whl (6.9 MB view hashes)

Uploaded Python 3.8

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page