discovery-capability

Data Science to production accelerator

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

Introduction

Unfortunately, 85% of data science projects fail due to a lack of understanding of the real business problem. This is usually because of poor communication between data scientists and business teams, resulting in a disconnect between the two groups. Project Hadron has been built to bridge the gap between data scientists and data engineers. More specifically between machine learning business outcomes or use case and a product pipeline. It translates the work of data scientists into meaningful, production ready solutions that can be easily integrated into a DevOps, CI/CD pipeline.

Project Hadron addresses data selection, feature engineering and feature transformation as part of the critical preprocessing of a Machine Learning pipeline or System Data pipeline. At its core the code uses PyArrow as its canonical combining with Pandas as a directed specialist toolset. PyArrow complements Pandas by providing a more memory-efficient in-memory representation, enabling efficient data interchange between different systems, supporting distributed computing, and enhancing compatibility with other programming languages. When used together, Pandas and PyArrow form a powerful combination for handling diverse data processing tasks efficiently.

Data Selection and Feature Engineering

Data selection and feature engineering is the art/science of converting raw data to a form that optimises the success of the next step in a pipeline. This involves a skilled blend of domain expertise, intuition and mathematics. Data selection and feature engineering are the most essential part of building a useable machine learning or data project, constituting an average of 80% of the project’s time, even with hundreds of cutting-edge machine learning algorithms appearing.

Prof Domingos, the author of ‘The Master Algorithm’ says:

"At the end of the day, some machine learning projects succeed and some fail. What makes the
difference? Easily the most important factor is the features used."

Preprocessing

The term “data preprocessing” is commonly used in the field of data science and machine learning to refer ata selection and feature engineering as steps taken to clean, format, and organize raw data into a suitable format for Model Evaluation & Tunning

docs/images/introduction/machine_learning_pipeline_v01.png

Main features

Data Preparation
Feature Selection
Feature Engineering
Feature Cataloguing
Augmented Knowledge
Synthetic Feature Build

Feature transformers

Project Hadron is a Python library with multiple transformers to engineer and select features to use across a synthetic build, statistics and machine learning.

Missing data imputation
Categorical encoding
Variable Discretisation
Outlier capping or removal
Numerical transformation
Redundant feature removal
Synthetic variable creation
Synthetic multivariate
Synthetic model distributions
Datetime features
Time series

Project Hadron allows one to present optimal parameters associated with each transformer, allowing different engineering procedures to be applied to different variables and feature subsets.

Background

Born out of the frustration of time constraints and the inability to show business value within a business expectation, this project aims to provide a set of tools to quickly build production ready data science disciplines within a component based solution demonstrating coupling and cohesion between each disipline, providing a separation of concerns between components.

It also aims to improve the communication outputs needed by ML delivery to talk to Pre-Sales, Stakholders, Business SME’s, Data SME’s product coders and tooling engineers while still remaining within familiar code paradigms.

Getting Started

The discovery-transition-ds package is a set of python components that are focussed on Data Science. They are a concrete implementation of the Project Hadron abstract core. It is build to be very light weight in terms of package dependencies requiring nothing beyond what would be found in an basic Data Science environment. Its designed to be used easily within multiple python based interfaces such as Jupyter, IDE or terminal python.

Package Installation

The best way to install the component packages is directly from the Python Package Index repository using pip.

The component package is discovery-transition-ds and pip installed with:

python -m pip install discovery-transition-ds

if you want to upgrade your current version then using pip install upgrade with:

python -m pip install -U discovery-transition-ds

This will also install or update dependent third party packages. The dependencies are limited to python and related Data Science tooling such as pandas, numpy, scipy, scikit-learn and visual packages matplotlib and seaborn, and thus have a limited footprint and non-disruptive in a machine learning environment.

Get the Source Code

discovery-transition-ds is actively developed on GitHub, where the code is always available.

You can clone the public repository with:

$ git clone git@github.com:project-hadron/discovery-transition-ds.git

Once you have a copy of the source, you can embed it in your own Python package, or install it into your site-packages easily running:

$ cd discovery-transition-ds
$ python -m pip install .

Release Process and Rules

Versions to be released after 3.5.27, the following rules will govern and describe how the discovery-transition-ds produces a new release.

To find the current version of discovery-transition-ds, from your terminal run:

$ python -c "import ds_discovery; print(ds_discovery.__version__)"

Major Releases

A major release will include breaking changes. When it is versioned, it will be versioned as vX.0.0. For example, if the previous release was v10.2.7 the next version will be v11.0.0.

Breaking changes are changes that break backwards compatibility with prior versions. If the project were to change an existing methods signature or alter a class or method name, that would only happen in a Major release. The majority of changes to the dependant core abstraction will result in a major release. Major releases may also include miscellaneous bug fixes that have significant implications.

Project Hadron is committed to providing a good user experience and as such, committed to preserving backwards compatibility as much as possible. Major releases will be infrequent and will need strong justifications before they are considered.

Minor Releases

A minor release will include addition methods, or noticeable changes to code in a backward-compatable manner and miscellaneous bug fixes. If the previous version released was v10.2.7 a minor release would be versioned as v10.3.0.

Minor releases will be backwards compatible with releases that have the same major version number. In other words, all versions that would start with v10. should be compatible with each other.

Patch Releases

A patch release include small and encapsulated code changes that do not directly effect a Major or Minor release, for example changing round(... to np.around(..., and bug fixes that were missed when the project released the previous version. If the previous version released v10.2.7 the hotfix release would be versioned as v10.2.8.

Reference

Python version

Python 3.7 or less is not supported. Although it is recommended to install discovery-transition-ds against the latest Python version or greater whenever possible.

Pandas version

Pandas 1.0.x and above are supported but It is highly recommended to use the latest 1.0.x release as the first major release of Pandas.

GitHub Project

discovery-transition-ds: https://github.com/project-hadron/discovery-transition-ds.

Change log

See CHANGELOG.

License

This project uses the following license: MIT License: https://opensource.org/license/mit/.

Authors

Gigas64 (@gigas64) created discovery-transition-ds.

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

0.23.19

Apr 22, 2024

0.23.18

Apr 21, 2024

0.23.17

Apr 7, 2024

0.23.16

Mar 29, 2024

0.23.15

Mar 28, 2024

0.23.14

Mar 28, 2024

0.23.13

Mar 28, 2024

0.23.12

Mar 26, 2024

0.23.11

Mar 25, 2024

0.23.10

Mar 23, 2024

0.23.9

Mar 23, 2024

0.23.8

Mar 20, 2024

0.23.7

Mar 20, 2024

0.23.6

Mar 20, 2024

0.23.5

Mar 14, 2024

0.23.4

Mar 12, 2024

0.23.3

Mar 12, 2024

0.23.2

Mar 11, 2024

0.23.1

Mar 11, 2024

0.23.0

Mar 11, 2024

0.22.22

Mar 11, 2024

0.22.21

Mar 11, 2024

0.22.20

Mar 11, 2024

0.22.19

Mar 11, 2024

0.22.18

Mar 11, 2024

0.22.17

Mar 11, 2024

0.22.16

Mar 8, 2024

0.22.15

Mar 8, 2024

0.22.14

Mar 8, 2024

0.22.13

Mar 8, 2024

0.22.12

Mar 8, 2024

0.22.11

Mar 7, 2024

0.22.10

Mar 7, 2024

0.22.9

Mar 7, 2024

0.22.8

Mar 7, 2024

0.22.7

Mar 7, 2024

0.22.6

Mar 7, 2024

0.22.5

Mar 7, 2024

0.22.4

Mar 7, 2024

0.22.3

Mar 6, 2024

0.22.2

Mar 5, 2024

0.22.1

Mar 5, 2024

0.22.0

Mar 5, 2024

0.21.13

Mar 4, 2024

0.21.12

Mar 4, 2024

0.21.11

Mar 3, 2024

0.21.10

Mar 3, 2024

0.21.9

Mar 3, 2024

0.21.8

Feb 26, 2024

0.21.7

Feb 25, 2024

0.21.6

Feb 24, 2024

0.21.5

Feb 24, 2024

0.21.3

Feb 22, 2024

0.21.2

Feb 22, 2024

0.21.1

Feb 21, 2024

0.21.0

Feb 21, 2024

0.20.6

Feb 20, 2024

0.20.5

Feb 20, 2024

0.20.4

Feb 19, 2024

0.20.3

Feb 19, 2024

0.20.2

Feb 18, 2024

0.20.1

Feb 17, 2024

0.20.0

Feb 14, 2024

0.19.26

Feb 14, 2024

0.19.25

Feb 14, 2024

0.19.24

Feb 14, 2024

0.19.17

Feb 13, 2024

0.19.16

Feb 12, 2024

0.19.15

Feb 12, 2024

0.19.14

Feb 12, 2024

0.19.13

Feb 11, 2024

0.19.12

Feb 9, 2024

0.19.11

Feb 8, 2024

0.19.10

Feb 5, 2024

0.19.9

Feb 4, 2024

0.19.8

Feb 4, 2024

0.19.7

Feb 4, 2024

0.19.6

Feb 4, 2024

0.19.5

Feb 3, 2024

0.19.4

Feb 2, 2024

0.19.3

Feb 2, 2024

0.19.2

Feb 2, 2024

0.19.1

Feb 1, 2024

0.19.0

Feb 1, 2024

0.18.6

Feb 1, 2024

0.18.5

Feb 1, 2024

0.18.4

Jan 31, 2024

0.18.3

Jan 31, 2024

0.18.2

Jan 31, 2024

0.18.1

Jan 28, 2024

0.17.8

Jan 25, 2024

0.17.6

Jan 23, 2024

0.17.5

Jan 23, 2024

0.17.4

Jan 22, 2024

0.17.3

Jan 22, 2024

0.17.2

Jan 22, 2024

0.17.1

Jan 22, 2024

0.17.0

Jan 21, 2024

0.16.5

Jan 21, 2024

This version

0.16.4

Jan 21, 2024

0.16.3

Jan 20, 2024

0.16.2

Jan 19, 2024

0.16.0

Jan 19, 2024

0.15.29

Jan 19, 2024

0.15.28

Jan 19, 2024

0.15.27

Jan 19, 2024

0.15.26

Jan 19, 2024

0.15.25

Jan 19, 2024

0.15.24

Jan 19, 2024

0.15.23

Jan 18, 2024

0.15.22

Jan 18, 2024

0.15.21

Jan 17, 2024

0.15.20

Jan 17, 2024

0.15.19

Jan 17, 2024

0.15.18

Jan 17, 2024

0.15.17

Jan 16, 2024

0.15.16

Jan 16, 2024

0.15.13

Jan 15, 2024

0.15.12

Jan 13, 2024

0.15.11

Jan 13, 2024

0.15.9

Jan 13, 2024

0.15.7

Jan 13, 2024

0.15.6

Jan 13, 2024

0.15.0

Jan 12, 2024

0.14.1

Jan 10, 2024

0.14.0

Jan 10, 2024

0.13.13

Jan 10, 2024

0.13.12

Jan 10, 2024

0.13.11

Jan 10, 2024

0.13.10

Jan 9, 2024

0.13.9

Jan 9, 2024

0.13.8

Jan 3, 2024

0.13.7

Jan 2, 2024

0.13.6

Jan 2, 2024

0.13.5

Jan 1, 2024

0.13.4

Jan 1, 2024

0.13.1

Dec 31, 2023

0.13.0

Dec 29, 2023

0.12.24

Dec 29, 2023

0.12.23

Dec 29, 2023

0.12.22

Dec 28, 2023

0.12.21

Dec 26, 2023

0.12.20

Dec 26, 2023

0.12.19

Dec 26, 2023

0.12.18

Dec 21, 2023

0.12.17

Dec 21, 2023

0.12.14

Dec 21, 2023

0.12.13

Dec 21, 2023

0.12.12

Dec 21, 2023

0.12.11

Dec 21, 2023

0.12.10

Dec 20, 2023

0.12.9

Dec 20, 2023

0.12.8

Dec 20, 2023

0.12.7

Dec 20, 2023

0.12.6

Dec 20, 2023

0.12.5

Dec 20, 2023

0.12.4

Dec 19, 2023

0.12.3

Dec 19, 2023

0.12.2

Dec 17, 2023

0.12.1

Dec 16, 2023

0.12.0

Dec 4, 2023

0.11.8

Nov 21, 2023

0.11.7

Nov 18, 2023

0.11.6

Nov 17, 2023

0.11.5

Nov 13, 2023

0.11.4

Nov 13, 2023

0.11.3

Nov 12, 2023

0.11.1

Nov 12, 2023

0.11.0

Nov 12, 2023

0.10.25

Nov 12, 2023

0.10.24

Nov 9, 2023

0.10.23

Nov 2, 2023

0.10.22

Nov 1, 2023

0.10.21

Oct 31, 2023

0.10.20

Oct 16, 2023

0.10.19

Oct 16, 2023

0.10.18

Oct 15, 2023

0.10.17

Oct 15, 2023

0.10.16

Oct 14, 2023

0.10.15

Oct 14, 2023

0.10.14

Oct 14, 2023

0.10.13

Oct 11, 2023

0.10.12

Oct 9, 2023

0.10.11

Oct 9, 2023

0.10.10

Oct 8, 2023

0.10.9

Oct 6, 2023

0.10.8

Oct 5, 2023

0.10.7

Oct 4, 2023

0.10.5

Oct 3, 2023

0.10.4

Oct 1, 2023

0.10.3

Sep 28, 2023

0.10.2

Sep 27, 2023

0.10.1

Sep 25, 2023

0.10.0

Sep 25, 2023

0.9.6

Sep 25, 2023

0.9.5

Sep 24, 2023

0.9.4

Sep 23, 2023

0.9.2

Sep 23, 2023

0.9.1

Sep 20, 2023

0.9.0

Sep 19, 2023

0.8.15

Sep 16, 2023

0.8.14

Sep 15, 2023

0.8.13

Sep 14, 2023

0.8.12

Sep 14, 2023

0.8.11

Sep 14, 2023

0.8.10

Sep 14, 2023

0.8.9

Sep 14, 2023

0.8.8

Sep 14, 2023

0.8.7

Sep 14, 2023

0.8.6

Sep 13, 2023

0.8.5

Sep 13, 2023

0.8.4

Sep 12, 2023

0.8.3

Sep 12, 2023

0.8.2

Sep 12, 2023

0.8.1

Sep 11, 2023

0.8.0

Sep 10, 2023

0.7.6

Sep 10, 2023

0.7.5

Aug 29, 2023

0.7.4

Aug 29, 2023

0.7.3

Aug 28, 2023

0.7.2

Aug 28, 2023

0.7.1

Aug 25, 2023

0.7.0

Aug 25, 2023

0.6.7

Aug 25, 2023

0.6.6

Aug 16, 2023

0.6.5

Aug 16, 2023

0.6.3

Aug 13, 2023

0.6.2

Aug 12, 2023

0.6.1

Aug 11, 2023

0.6.0

Aug 10, 2023

0.5.13

Aug 9, 2023

0.5.12

Aug 9, 2023

0.5.11

Aug 8, 2023

0.5.10

Aug 7, 2023

0.5.9

Aug 7, 2023

0.5.8

Aug 7, 2023

0.5.7

Aug 7, 2023

0.5.6

Aug 2, 2023

0.5.5

Aug 1, 2023

0.5.4

Aug 1, 2023

0.5.3

Aug 1, 2023

0.5.2

Aug 1, 2023

0.5.1

Jul 27, 2023

0.5.0

Jul 27, 2023

0.4.11

Jul 27, 2023

0.4.10

Jul 27, 2023

0.4.9

Jul 27, 2023

0.4.8

Jul 25, 2023

0.4.7

Jul 24, 2023

0.4.6

Jul 24, 2023

0.4.5

Jul 24, 2023

0.4.4

Jul 24, 2023

0.4.3

Jul 24, 2023

0.4.2

Jul 24, 2023

0.4.1

Jul 24, 2023

0.4.0

Jul 24, 2023

0.3.15

Jul 21, 2023

0.3.14

Jul 19, 2023

0.3.13

Jul 19, 2023

0.3.12

Jul 19, 2023

0.3.11

Jul 17, 2023

0.3.10

Jul 10, 2023

0.3.9

Jul 10, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

discovery-capability-0.16.4.tar.gz (6.6 MB view hashes)

Uploaded Jan 21, 2024 Source

Built Distribution

discovery_capability-0.16.4-py3-none-any.whl (6.8 MB view hashes)

Uploaded Jan 21, 2024 Python 3

Hashes for discovery-capability-0.16.4.tar.gz

Hashes for discovery-capability-0.16.4.tar.gz
Algorithm	Hash digest
SHA256	`5a643096f60bd18a6216744f70ee5024d249f6c29785b91563d90a0d415aa1df`
MD5	`2d27a52e7dff5bfeb463e78f2a439ff0`
BLAKE2b-256	`4a4b8182689f6d7d83555682b167ec8a5a5b7f55acc264ed593223569ba03955`

Hashes for discovery_capability-0.16.4-py3-none-any.whl

Hashes for discovery_capability-0.16.4-py3-none-any.whl
Algorithm	Hash digest
SHA256	`7adab07cf9ca75ff3a98f4113c89a0add0ac5a450cab5ba18d0eb2765e9fa8d4`
MD5	`dbf79838dce04974c93c8c801de80911`
BLAKE2b-256	`5d677c11c0be6cf37de2be18f24d4f45ecb9c2672ffd93df36e4f5df19ee4ff9`