Basic data exploration, fast and easy
Project description
Cmotions Dataexplore
cmo_dataexplore is a Python library created by The Analytics Lab, which is powered by Cmotions. This library makes it easy for us to explore a new dataset and create nice and insightful graphs along the way. Nothing fancy, but very useful for our consultants. Mostly since this package is used in other packages, which makes our workflow simpler and more efficient.
Since we love to share what we do, why not also do that with our packages, that is why we've decided to make (almost) all of our packages open source, this way we hope to give back to the community that brings us so much. Enjoy!
Installation
Install cmo_dataexplore using pip
pip install cmo-dataexplore
Usage
import cmo_dataexplore as cdx
import cmo_dataviz as cdv
from cmo_dataexplore.resources import get_data_path
import pandas as pd
import matplotlib.pyplot as plt
# retrieve data
df = pd.read_csv(get_data_path("example.csv"))
# init the Explore object
ex = cdx.explore(data=df, dependent_var="Quit_yn", verbose=True)
# if there are records with missings in the dependent variable, these will be removed
# if there are columns that are completely empty, these will be removed
# also if the dependent variable is numeric, this will be changed to string
# list all columns with missing values
ex.retrieve_columns_with_missings()
# get all statistics of the entire dataframe
ex.data_stats()
# check out the number of unique values of each non-numeric variable
ex.categories_counter()
# identify all columns that have Near-Zero Variance (the dependent variable is automatically excluded)
nzv_df, nzv_cols = ex.calculate_NZV(frequency_ratio_thresh=95/5, percentage_unique_thresh=10, verbose=True)
nzv_cols
# mutual information score
ex.calculate_MI_scores(plot=False, ax=None)
# predictive power score for the target variable
ex.calculate_PPS_target(plot=False, ax=None)
# use the predictive power score to find relationships in the data
ex.calculate_PPS(plot=False, ax=None)
# retrieve all columns with a correlation higher than your prefered threshold
ex.retrieve_columns_highly_correlated(corr_thresh=.8)
# create all histograms at once
# this excludes columns with more than 'max_categories' categories
ex.show_hists(nr_plot_cols=4, color_by=ex.dependent_var, bins=10, max_categories=50, plotsize=(15,15))
# create all relation plots at once
# this excludes columns with more than 'max_categories' categories
ex.show_relations(bins=10, max_categories=50, nr_plot_cols=3)
Contributing
Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change. Please make sure to update tests as appropriate.
License
GNU General Public License v3.0
Contributors
Jeanine Schoonemann
Contact us
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file cmo_dataexplore-0.0.1.tar.gz
.
File metadata
- Download URL: cmo_dataexplore-0.0.1.tar.gz
- Upload date:
- Size: 811.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.9.1
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | c76c495495c1527a7e6b88efe9bbba06e2e1284040c8ed3ad09d25f9e6dcd71b |
|
MD5 | 27843182703571a3a9598d352b146f7e |
|
BLAKE2b-256 | 625d7a8aa1cd4cda18cefe11f01709f3b9868669867ea52ddb5497d431fa533d |
File details
Details for the file cmo_dataexplore-0.0.1-py3-none-any.whl
.
File metadata
- Download URL: cmo_dataexplore-0.0.1-py3-none-any.whl
- Upload date:
- Size: 811.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.9.1
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | a7b2cb8b266b70fba6413866128f234d98eef475316c9f6d689da788e13b3156 |
|
MD5 | 6c3015110289db58f14eb4013f624ef2 |
|
BLAKE2b-256 | 7892bd6c3fb7e6ee93f1ef44cada400f42114aeb463503c58791d71828e26457 |