Skip to main content

Basic data exploration, fast and easy

Project description

Cmotions Dataexplore

cmo_dataexplore is a Python library created by The Analytics Lab, which is powered by Cmotions. This library makes it easy for us to explore a new dataset and create nice and insightful graphs along the way. Nothing fancy, but very useful for our consultants. Mostly since this package is used in other packages, which makes our workflow simpler and more efficient.

Since we love to share what we do, why not also do that with our packages, that is why we've decided to make (almost) all of our packages open source, this way we hope to give back to the community that brings us so much. Enjoy!

Installation

Install cmo_dataexplore using pip

pip install cmo-dataexplore

Usage

import cmo_dataexplore as cdx
import cmo_dataviz as cdv
from cmo_dataexplore.resources import get_data_path
import pandas as pd
import matplotlib.pyplot as plt

# retrieve data
df = pd.read_csv(get_data_path("example.csv"))

# init the Explore object
ex = cdx.explore(data=df, dependent_var="Quit_yn", verbose=True)
# if there are records with missings in the dependent variable, these will be removed
# if there are columns that are completely empty, these will be removed
# also if the dependent variable is numeric, this will be changed to string

# list all columns with missing values
ex.retrieve_columns_with_missings()

# get all statistics of the entire dataframe
ex.data_stats()

# check out the number of unique values of each non-numeric variable
ex.categories_counter()

# identify all columns that have Near-Zero Variance (the dependent variable is automatically excluded)
nzv_df, nzv_cols = ex.calculate_NZV(frequency_ratio_thresh=95/5, percentage_unique_thresh=10, verbose=True)
nzv_cols

# mutual information score
ex.calculate_MI_scores(plot=False, ax=None)

# predictive power score for the target variable
ex.calculate_PPS_target(plot=False, ax=None)

# use the predictive power score to find relationships in the data
ex.calculate_PPS(plot=False, ax=None)

# retrieve all columns with a correlation higher than your prefered threshold
ex.retrieve_columns_highly_correlated(corr_thresh=.8)

# create all histograms at once
# this excludes columns with more than 'max_categories' categories
ex.show_hists(nr_plot_cols=4, color_by=ex.dependent_var, bins=10, max_categories=50, plotsize=(15,15))

# create all relation plots at once
# this excludes columns with more than 'max_categories' categories
ex.show_relations(bins=10, max_categories=50, nr_plot_cols=3)

Contributing

Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change. Please make sure to update tests as appropriate.

License

GNU General Public License v3.0

Contributors

Jeanine Schoonemann
Contact us

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cmo_dataexplore-0.0.1.tar.gz (811.6 kB view details)

Uploaded Source

Built Distribution

cmo_dataexplore-0.0.1-py3-none-any.whl (811.6 kB view details)

Uploaded Python 3

File details

Details for the file cmo_dataexplore-0.0.1.tar.gz.

File metadata

  • Download URL: cmo_dataexplore-0.0.1.tar.gz
  • Upload date:
  • Size: 811.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.1

File hashes

Hashes for cmo_dataexplore-0.0.1.tar.gz
Algorithm Hash digest
SHA256 c76c495495c1527a7e6b88efe9bbba06e2e1284040c8ed3ad09d25f9e6dcd71b
MD5 27843182703571a3a9598d352b146f7e
BLAKE2b-256 625d7a8aa1cd4cda18cefe11f01709f3b9868669867ea52ddb5497d431fa533d

See more details on using hashes here.

File details

Details for the file cmo_dataexplore-0.0.1-py3-none-any.whl.

File metadata

File hashes

Hashes for cmo_dataexplore-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 a7b2cb8b266b70fba6413866128f234d98eef475316c9f6d689da788e13b3156
MD5 6c3015110289db58f14eb4013f624ef2
BLAKE2b-256 7892bd6c3fb7e6ee93f1ef44cada400f42114aeb463503c58791d71828e26457

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page