Skip to main content

Basic data exploration, fast and easy

Project description

Cmotions Dataexplore

cmo_dataexplore is a Python library created by The Analytics Lab, which is powered by Cmotions. This library makes it easy for us to explore a new dataset and create nice and insightful graphs along the way. Nothing fancy, but very useful for our consultants. Mostly since this package is used in other packages, which makes our workflow simpler and more efficient.

Since we love to share what we do, why not also do that with our packages, that is why we've decided to make (almost) all of our packages open source, this way we hope to give back to the community that brings us so much. Enjoy!

Installation

Install cmo_dataexplore using pip

pip install cmo-dataexplore

Usage

import cmo_dataexplore as cdx
import cmo_dataviz as cdv
from cmo_dataexplore.resources import get_data_path
import pandas as pd
import matplotlib.pyplot as plt

# retrieve data
df = pd.read_csv(get_data_path("example.csv"))

# init the Explore object
ex = cdx.explore(data=df, dependent_var="Quit_yn", verbose=True)
# if there are records with missings in the dependent variable, these will be removed
# if there are columns that are completely empty, these will be removed
# also if the dependent variable is numeric, this will be changed to string

# list all columns with missing values
ex.retrieve_columns_with_missings()

# get all statistics of the entire dataframe
ex.data_stats()

# check out the number of unique values of each non-numeric variable
ex.categories_counter()

# identify all columns that have Near-Zero Variance (the dependent variable is automatically excluded)
nzv_df, nzv_cols = ex.calculate_NZV(frequency_ratio_thresh=95/5, percentage_unique_thresh=10, verbose=True)
nzv_cols

# mutual information score
ex.calculate_MI_scores(plot=False, ax=None)

# predictive power score for the target variable
ex.calculate_PPS_target(plot=False, ax=None)

# use the predictive power score to find relationships in the data
ex.calculate_PPS(plot=False, ax=None)

# retrieve all columns with a correlation higher than your prefered threshold
ex.retrieve_columns_highly_correlated(corr_thresh=.8)

# create all histograms at once
# this excludes columns with more than 'max_categories' categories
ex.show_hists(nr_plot_cols=4, color_by=ex.dependent_var, bins=10, max_categories=50, plotsize=(15,15))

# create all relation plots at once
# this excludes columns with more than 'max_categories' categories
ex.show_relations(bins=10, max_categories=50, nr_plot_cols=3)

Contributing

Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change. Please make sure to update tests as appropriate.

License

GNU General Public License v3.0

Contributors

Jeanine Schoonemann
Contact us

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cmo_dataexplore-0.0.1.tar.gz (811.6 kB view hashes)

Uploaded Source

Built Distribution

cmo_dataexplore-0.0.1-py3-none-any.whl (811.6 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page