This project is an ensemble of methods which are frequently used in python Data Science projects.

These details have not been verified by PyPI

Project links

Project description

Data Science Utils: Frequently Used Methods for Data Science

GitHub release (latest SemVer) PyPI - Python Version PyPI - Wheel

Data Science Utils extends the Scikit-Learn API and Matplotlib API to provide simple methods that simplify task and visualization over data.

Code Examples and Documentation

Let's see some code examples and outputs.

You can read the full documentation with all the code examples from: https://datascienceutils.readthedocs.io/en/latest/

In the documentation you can find more methods and more examples.

Plot Confusion Matrix

In following example we are going to use the iris dataset from scikit-learn. so firstly let's import it:

import numpy
from sklearn import datasets

IRIS = datasets.load_iris()
RANDOM_STATE = numpy.random.RandomState(0)

Let's train a SVM classifier on all the target labels and plot confusion matrix:

from matplotlib import pyplot
from sklearn.model_selection import train_test_split
from sklearn.multiclass import OneVsRestClassifier
from sklearn import svm

from ds_utils.metrics import plot_confusion_matrix


x = IRIS.data
y = IRIS.target

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=.5, random_state=RANDOM_STATE)

# Create a simple classifier
classifier = OneVsRestClassifier(svm.LinearSVC(random_state=RANDOM_STATE))
classifier.fit(x_train, y_train)
y_pred = classifier.predict(x_test)

plot_confusion_matrix(y_test, y_pred, [0, 1, 2])
pyplot.show()

And the following image will be shown:

multi label classification confusion matrix

Generate Decision Paths

We'll create a simple decision tree classifier and print it:

from sklearn.tree import DecisionTreeClassifier

from ds_utils.visualization_aids import generate_decision_paths

x = IRIS.data
y = IRIS.target

# Create decision tree classifier object
clf = DecisionTreeClassifier(random_state=RANDOM_STATE, max_depth=3)

# Train model
clf.fit(x, y)
print(generate_decision_paths(clf, iris.feature_names, iris.target_names.tolist(),
                         "iris_tree"))

The following text will be printed:

def iris_tree(petal width (cm), petal length (cm)):
    if petal width (cm) <= 0.8000:
        # return class setosa with probability 0.9804
        return ("setosa", 0.9804)
    else:  # if petal width (cm) > 0.8000
        if petal width (cm) <= 1.7500:
            if petal length (cm) <= 4.9500:
                # return class versicolor with probability 0.9792
                return ("versicolor", 0.9792)
            else:  # if petal length (cm) > 4.9500
                # return class virginica with probability 0.6667
                return ("virginica", 0.6667)
        else:  # if petal width (cm) > 1.7500
            if petal length (cm) <= 4.8500:
                # return class virginica with probability 0.6667
                return ("virginica", 0.6667)
            else:  # if petal length (cm) > 4.8500
                # return class virginica with probability 0.9773
                return ("virginica", 0.9773)

Extract Significant Terms from Subset

This method will help extract the significant terms that will differentiate between subset of documents from the full corpus. Based on the elasticsearch significant_text aggregation.

import pandas

from ds_utils.strings import extract_significant_terms_from_subset

corpus = ['This is the first document.', 'This document is the second document.',
          'And this is the third one.', 'Is this the first document?']
data_frame = pandas.DataFrame(corpus, columns=["content"])
# Let's differentiate between the last two documents from the full corpus
subset_data_frame = data_frame[data_frame.index > 1]
terms = extract_significant_terms_from_subset(data_frame, subset_data_frame, 
                                               "content")

And the following table will be the output for terms:

third	one	and	this	the	is	first	document	second
1.0	1.0	1.0	0.67	0.67	0.67	0.5	0.25	0.0

Excited?

Read about all the modules here and see more methods and abilities (such as drawing a decision tree and more):

Metrics - The module of metrics contains methods that help to calculate and/or visualize evaluation performance of an algorithm.
Preprocess - The module of preprocess contains methods that are processes that could be made to data before training.
Strings - The module of strings contains methods that help manipulate and process strings in a dataframe.
Visualization Aids - The module of visualization aids contains methods that visualize by drawing or printing ML output.

Contributing

Interested in contributing to Data Science Utils? Great! You're welcome, and we would love to have you. We follow the Python Software Foundation Code of Conduct and Matplotlib Usage Guide.

No matter your level of technical skill, you can be helpful. We appreciate bug reports, user testing, feature requests, bug fixes, product enhancements, and documentation improvements.

Thank you for your contributions!

Find a Bug?

Check if there's already an open issue on the topic. If needed, file an issue.

Open Source

Data Science Utils license is MIT License.

Installing Data Science Utils

Data Science Utils is compatible with Python 3.6 or later. The simplest way to install Data Science Utils and its dependencies is from PyPI with pip, Python's preferred package installer:

pip install data-science-utils

Note that this package is an active project and routinely publishes new releases with more methods. In order to upgrade Data Science Utils to the latest version, use pip as follows:

pip install -U data-science-utils

Alternatively you can install from source by cloning the repo and running:

git clone https://github.com/idanmoradarthas/DataScienceUtils.git
cd DataScienceUtils
python setup.py install

Or install using pip from source:

pip install git+https://github.com/idanmoradarthas/DataScienceUtils.git

If you're using Anaconda, you can install using conda:

conda install -c idanmorad data-science-utils

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

1.7.4

Sep 12, 2024

1.7.3

Feb 11, 2024

1.7.1

Feb 28, 2022

1.7

Sep 16, 2020

1.6.3

Jul 9, 2020

1.6.2

Jul 2, 2020

1.6.1

Jun 21, 2020

1.6

Jun 14, 2020

This version

1.5

Jan 8, 2020

1.4.1

Dec 26, 2019

1.4

Dec 26, 2019

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

data_science_utils-1.5.tar.gz (15.4 kB view hashes)

Uploaded Jan 8, 2020 Source

Built Distribution

data_science_utils-1.5-py3-none-any.whl (13.5 kB view hashes)

Uploaded Jan 8, 2020 Python 3

Hashes for data_science_utils-1.5.tar.gz

Hashes for data_science_utils-1.5.tar.gz
Algorithm	Hash digest
SHA256	`345325b895a47ab93e2d9bf6bad394bfffa0e0bf0027ca459dcfee0cc48de172`
MD5	`e83c4b10364841869d20e0421764a121`
BLAKE2b-256	`ed46008ee34d6a4d1140152b7a5c8ffc2f7427b54e2c2ac5c0631040c9f01146`

Hashes for data_science_utils-1.5-py3-none-any.whl

Hashes for data_science_utils-1.5-py3-none-any.whl
Algorithm	Hash digest
SHA256	`2b014e618a545689d2fad45a28ee56158f66cc5f9a138b24cd5ba14e0a22e25d`
MD5	`d60d5dbf777cf5ce623ad392d7f2bd53`
BLAKE2b-256	`faea3baefd84972f333b723dc4197a65ee6936d1a77209ca07819e2e88542e23`