Ethan (tll549)'s Exploratory Functions

These details have not been verified by PyPI

Project links

Homepage

Project description

Tll549 (Ethan)'s Exploratory Functions

Installation

Use the the first line for first time installation. Use the second one to upgrade anytime.

pip install TEF

pip install TEF -U

Collecting TEF
  Downloading https://files.pythonhosted.org/packages/6b/fd/ced559ecf12b31a9ba3e83b493e2c18fef275635c98f9cc20bf30bad6aac/TEF-0.2.3-py3-none-any.whl
Installing collected packages: TEF
  Found existing installation: TEF 0.2.2
    Uninstalling TEF-0.2.2:
      Successfully uninstalled TEF-0.2.2
Successfully installed TEF-0.2.3

import TEF

Quick start

Skip to docs below if you want to know the details of functions.

Imagine you got a dirty dataset

import pandas as pd
import numpy as np

np.random.seed(555)
raw = pd.DataFrame({'datetime_col': ['1990-01-01', float('nan'), float('nan'), float('nan'), 
                                     '2017-01-01', float('nan')],
                    'int_col': [34, 645, 23, 4, 0, 6],
                    'float_col': [132.54, float('nan'), 21399.23, 0, 434.74, 4592309.23],
                    'bool_col': [True, False, True, False, False, False],
                    'category_col': ['a', 'a', 'b', 'b', 'b', 'c'],
                    'object_col': ['zzzzz', 'zzzz', ' ', 'nan', 'x', '   ']})
raw.bool_col = raw.bool_col.astype(object) # set it to object for demo
raw

	datetime_col	int_col	float_col	bool_col	category_col	object_col
0	1990-01-01	34	132.54	True	a	zzzzz
1	NaN	645	NaN	False	a	zzzz
2	NaN	23	21399.23	True	b
3	NaN	4	0.00	False	b	nan
4	2017-01-01	0	434.74	False	b	x
5	NaN	6	4592309.23	False	c

auto_set_dtypes

So the first thing you can do is to pass it to auto_set_dtypes(), it tries to detect all columns dtypes and return the set one for you.

Long story short, it use regular expression to detect datatime pattern, set to category if the number of levels is lower than 10 (adjust in max_num_lev). For here, because the dataset is tiny, I manually set the 5th columns to object for demo.

d = TEF.auto_set_dtypes(raw, set_object=[5])

before dtypes: float64(1), int64(1), object(4)
after  dtypes: bool(1), category(1), datetime64[ns](1), float64(1), int64(1), object(1)

dfmeta

Now, it comes to the main function for this package, dfmeta(). It tries to give you an detailed overview so that you don't ever need to type anything like .head(), .describe(), .info(), .dtypes etc.

desc = {
    "datetime_col"            : "some explanation about this columns",
    "int_col"                 : "you <br>can use line break here",
    "float_col"               : "<li>use</li> <li>bullet</li> <li>points</li>",
    "bool_col"                : "in case anything is <mark><b>IMPORTANT</b></mark>",
    "category_col"            : "<a target='_blank' rel='noopener noreferrer' href='https://github.com/tll549/TEF'>add a link to TEF</a>",
    "object_col"              : "<b>bold</b> and <i>italic</i> ofc"
}
TEF.dfmeta(d, description=desc)

shape: (6, 6)
dtypes: bool(1), category(1), datetime64[ns](1), float64(1), int64(1), object(1)
memory usage: 388.0+ bytes

png

The background is colored by dtypes, and every row is actually a columns in the original dataset. Every row is now a descriptive result for that column (set transpose=False if you want).

idx shows the index of that column
dtype
description is a place that you can input your own explanation, will explain more detail below
NaNs shows the number of nulls and the percentage
unique counts shows the number of unique values of that columns, and percentage
unique levs tries to show all unique values, if that's not too many (adjust in max_lev), for this tiny dataset, it shows all because the default is 10
summary shows
- for datatime, quantiles [0% (min), 25%, 50% (median), 75%, 100% (max)]
- for int and float, quantiles, mean, standard error, CV (coefficient of variance, std/mean), skewness, it will followed by a star (*) if it doesn't pass the normality test (skewtest), and another skewtest for after taking log
- for bool, categoy and object, it gives percentage of all levels, if not too many
possible NaNs tries to detect potential nulls that may caused by hand-coded values, for instance, sometimes a space ' ' or a string 'nan' actually means a NaN
possible dup lev tries to detect potential possible duplicate levels, such as sometimes 'object1111' should actually be the same value as 'object111' just because of typo.
the rest 3 columns are randomly sample from the dataset, where we human always like an example

You may wonder, the original 6*6 dataset now becomes a larger 6*12 meta data, is that something useful? Indeed, it is not useful for a small dataset, but imagine an insanely large 10 millions rows * 500 columns dataset, the summary meta dataframe will only have dimension of 500*12! It is always a good idea to start from summary statistics and examples.

descriptions

Those columns names are not always clear and straight forward. We usually want to save a dictionary/explanation somewhere.

This function takes an dictionary description that you can put brief explanations here, and just get back to here anytime when you are doing the analysis. Also, it can be exported to a HTML file as exactly what you see so that you can save it somewhere in your working directory!

First, use get_desc_template() to get a template, copy and paste to the chunk and start filling in your explanations, HTML code is also okay! See above example for what you can do. If you have a lot of columns, consider using get_desc_template_file(), add explanations in a separate .py file and import it back by %run desc.py.

TEF.get_desc_template(d)

desc = {
    "datetime_col"            : "",
    "int_col"                 : "",
    "float_col"               : "",
    "bool_col"                : "",
    "category_col"            : "",
    "object_col"              : ""
}

saving as HTML

In the end, you may want to save this (hopefully) beautiful meta dataframe somewhere, use save_html_standard() to automatically generate a HTML file, it will remove unique levs, possible errors, and samples for a concise output. Check doc for dfmeta_to_htmlfile() if you want to configure. Or, use print_html_standard() so that you can copy and paste to anywhere you want.

TEF.save_html_standard(d, desc, 'meta_d.html', 'This is an header in the html file')

'test.html saved'

plot_1var

Now, the next step is usually plot histogram and have a sense about distributions. This is exactly what it does for plot_1var() function. It handles different dtypes and choss a way to summary it using plot.

TEF.plot_1var(d)

png

quantiles: [0.0, 4.5, 14.5, 31.25, 645.0], mean: 118.67, NaNs: 0
ignored 16.67% 0s and 0 NaNs

png

quantiles: [0.0, 132.54, 434.74, 21399.23, 4592309.23], mean: 922855.15, NaNs: 16.67%
ignored 16.67% 0s and 16.67% NaNs

png

100.00% disaplyed

png

plot_1var_by_cat_y

Now if you have a dependent variable that your main interest is in. You probably are wondering the relationship between independent variables and it. Put the categorical dependent variable in the second argument and it will do the rest!

TEF.plot_1var_by_cat_y(d, 'category_col')

0 not yet for datetime

png

NaNs: 0

png

NaNs: 16.67%

png

Documentations

auto_set_dtypes

auto_set_dtypes(df, max_num_lev=10, 
                set_datetime=[], set_category=[], set_int=[], set_object=[], set_bool=[],
                set_datetime_by_pattern=r'\d{4}-\d{2}-\d{2}',
                verbose=1)

required package: numpy, pandas, re, io (will import automatically when call the function, just a heads up here)

description

set to datetime if the pattern is like '2018-08-08'
- it's designed for all datetime columns in a dataset have the same format like 2019-06-06 06:06:06 (such as downloaded from DOMO)
set to category if the number unique levels is less than max_num_lev
set_dtypes can be used for manually configurations, set_object can be used for ID columns
will also trying to detect possible ID columns for seaching string 'id' in them, if verbose >= 1
notice if there are nulls between True and False columns, it should be set to bool

args

df: pandas dataframe
max_num_lev: the maximum number of levels that should be converted to category
set_{datetime/category/int/object/bool}: a list of indices, e.g. [0, 3, 5], force these columns to be set to those dtypes
set_datetime_by_pattern: a regular expression string, recommend using the default
verbose: int/string, 0/False, 1/'summary', or 2/'detailed'. different type of printouts showing the transformations

example

import pandas as pd
import numpy as np
np.random.seed(555)
raw = pd.DataFrame({'int_col': [34, 645, 23, 4, 0, 6], # all positive
                    'float_col': [132.54, float('nan'), 21399.23, 0, 434.74, 4592309.23],
                    'bool_col': [True, False, True, False, False, False],
                    'category_col': ['a', 'a', 'b', 'b', 'b', 'c'],
                    'object_col': ['z', 'y', ' ', 'nan', 'x', '   ']}) # all positive
d = TEF.auto_set_dtypes(raw, verbose=0, set_object=[4])

before dtypes: float64(1), int64(1), object(4)
after  dtypes: bool(1), category(1), datetime64[ns](1), float64(1), int64(1), object(1)

dfmeta

dfmeta(df, max_lev=10, transpose=True, sample=True, description=None,
           style=True, color_bg_by_type=True, highlight_nan=0.5, in_cell_next_line=True,
           verbose=True, drop=None,
           check_possible_error=True, dup_lev_prop=0.7,
           save_html=None)

required package: numpy, pandas, io, scipy.stats.skew, scipy.stats.skewtest

description

return meta data for the given dataset, see above quick start for example
use dfmeta_to_htmlfile to save the returned object to html, save_html_standard to set with default configurations

args

df: pandas dataframe
max_lev: int, the maximum acceptable number of unique levels
transpose: bool, if True, cols is still cols
sample:
- True: sample 3 rows
- False: don't sample
- 'head': sample first 3 rows
- int: sample first int rows
description: dict, where keys are col names and values are description for that column, can contain html code
style: bool, if True, return html, add .render() to get original html codes; if False, return pandas dataframe instead and will overwrites color_bg_by_type, highlight_nan, in_cell_next_line
color_bg_by_type: bool, coloy the cell background by dtyle, by column. will force to False if style=False
highlight_nan: float [0, 1] or False, the proportion of when should highlight nans. will force to False if style=False
in_cell_next_line: bool, if True, use 'br' to separate elements in a list; if False, use ', '
verbose: bool, whether to print the beginning shape, memory etc.
drop: columns (or rows if transpose=True) that wants to be dropped, doesn't suppor NaNs and dtypes now
check_possible_error: bool, check possible NaNs and duplicate levels or not
dup_lev_prop: float [0, 1], the criteria of the repeatness of two levels
save_html: a list with two strings elements [filename, head], e.g. ['cancelCasesDict.html', 'Cancel Cases Dictionary']

example

import pandas as pd
import numpy as np
np.random.seed(555)
raw = pd.DataFrame({'int_col': [34, 645, 23, 4, 0, 6], # all positive
                    'float_col': [132.54, float('nan'), 21399.23, 0, 434.74, 4592309.23],
                    'bool_col': [True, False, True, False, False, False],
                    'category_col': ['a', 'a', 'b', 'b', 'b', 'c'],
                    'object_col': ['z', 'y', ' ', 'nan', 'x', '   ']}) # all positive
raw.bool_col = raw.bool_col.astype(object)
d = auto_set_dtypes(raw, verbose=0, set_object=[4])

desc = {
    "datetime_col"            : "some explanation about this columns",
    "int_col"                 : "you <br>can use line break here",
    "float_col"               : "<li>use</li> <li>bullet</li> <li>points</li>",
    "bool_col"                : "in case anything is <mark><b>IMPORTANT</b></mark>",
    "category_col"            : "<a target='_blank' rel='noopener noreferrer' href='https://github.com/tll549/TEF'>add a link to TEF</a>",
    "object_col"              : "<b>bold</b> and <i>italic</i> ofc"
}
TEF.dfmeta(d, description=desc)

shape: (6, 6)
dtypes: bool(1), category(1), datetime64[ns](1), float64(1), int64(1), object(1)
memory usage: 388.0+ bytes

get_desc_template

get_desc_template(df)

A function that takes the original dataframe and print a description template for user to fill in. See above for example.

Nothing to configurate.

example

TEF.get_desc_template(d)

desc = {
    "datetime_col"            : "",
    "int_col"                 : "",
    "float_col"               : "",
    "bool_col"                : "",
    "category_col"            : "",
    "object_col"              : ""
}

get_desc_template_file

get_desc_template_file(df, filename='desc.py')

Similar to above, put save as a .py file in the working directory.

dfmeta_to_htmlfile

dfmeta_to_htmlfile(styled_df, filename, head, original_df=None)

description

save the styled meta dataframe to html file

args

styled_df: the object returned by dfmeta
filename: string, can includes file path
head: the header in that html file (in h1 tag)
original_df: the original dataframe that was passed to dfmeta, use to generate verbose print out at the beginning of the file, can be ignored

example

meta = dfmeta(d, verbose=0)
TEF.dfmeta_to_htmlfile(meta, 'test.html', 'header', d)

'test.html saved'

save_html_standard

save_html_standard(df, description, filename, head)

A function that calls dfmeta first with description=description, check_possible_error=False, sample=False, verbose=False, drop=['unique levs'], then pass to dfmeta_to_htmlfile to generate html file.

It's a recommended standard way for generating data dictionary.

example

TEF.save_html_standard(d, desc, 'meta_d.html', 'This is an header in the html file')

'test.html saved'

print_html_standard

print_html_standard(df, description)

A function that prints out all html code, using standard configuration. If you want to configurate, notice the styled dataframe returned by dfmeta is just a styled pandas object, add render() to have access to it.

plot_1var

plot_1var(df, max_num_lev=20, log_numeric=True, cols=None, save_plt=None)

required package: numpy, pandas, seaborn, matplotlib.pyplot

description

plot a plot for every cols, according to its dtype

args

df: pandas dataframe
max_num_lev: skip if theres too many levels, no need when used my auto_set_type function
log_numeric: bool, plot two more plots for numerical which take log on it
cols: a list of int, columns to plot, specify is you don't want to plot all columns, can be use with save_plt arg
save_plt: string, if not None, will save every plots to working directory, the string will be the prefix, a folder is okay but you need to creat the folder by yourself first

example

import pandas as pd
import numpy as np
np.random.seed(555)
raw = pd.DataFrame({'int_col': [34, 645, 23, 4, 0, 6], # all positive
                    'float_col': [132.54, float('nan'), 21399.23, 0, 434.74, 4592309.23],
                    'bool_col': [True, False, True, False, False, False],
                    'category_col': ['a', 'a', 'b', 'b', 'b', 'c'],
                    'object_col': ['z', 'y', ' ', 'nan', 'x', '   ']}) # all positive
raw.bool_col = raw.bool_col.astype(object)
d = auto_set_dtypes(raw, verbose=0, set_object=[4])

TEF.plot_1var(d, cols=[0], save_plt='prefix')

plot_1var_by_cat_y

plot_1var_by_cat_y(df, y, max_num_lev=20, log_numeric=True,
    kind_for_num='boxen')

required package: pandas, seaborn, matplotlib.pyplot

description

plot a plot for every cols, agains the given y dependent var.

Notice saving is not implemented yet, and datetime also, and cat_y means can only handle categorical y.

args

df: pandas dataframe
y: string, col name of the dependent var
max_num_lev: skip if theres too many levels, no need when used my auto_set_type function
log_numeric, bool, take log on y axis if its numerical var, notice the 0's and negatives will be removed automatically
kind_for_num: string, 'boxen', 'box', 'violin', 'strip' (not recommend for big dataset), 'swarm' (not recommend for big dataset), the type of plot for numerical vars

example

import pandas as pd
import numpy as np
np.random.seed(555)
raw = pd.DataFrame({'int_col': [34, 645, 23, 4, 0, 6], # all positive
                    'float_col': [132.54, float('nan'), 21399.23, 0, 434.74, 4592309.23],
                    'bool_col': [True, False, True, False, False, False],
                    'category_col': ['a', 'a', 'b', 'b', 'b', 'c'],
                    'object_col': ['z', 'y', ' ', 'nan', 'x', '   ']}) # all positive
raw.bool_col = raw.bool_col.astype(object)
d = auto_set_dtypes(raw, verbose=0, set_object=[4])

TEF.plot_1var_by_cat_y(d, 'category_col')

Utility Functions

reorder_col

reorder_col(df, to_move, after=None, before=None)

description

Reorder the columns, by specifying the one to move and where. At least one of the argument after or before should be given. Return the modified dataframe.

args

df: pandas dataframe
to_move: string, the name of the column to be moved
after/before: string, the name of the columns that should move to_move to

example

import pandas as pd
import numpy as np
np.random.seed(555)
raw = pd.DataFrame({'int_col': [34, 645, 23, 4, 0, 6], # all positive
                    'float_col': [132.54, float('nan'), 21399.23, 0, 434.74, 4592309.23],
                    'bool_col': [True, False, True, False, False, False],
                    'category_col': ['a', 'a', 'b', 'b', 'b', 'c'],
                    'object_col': ['z', 'y', ' ', 'nan', 'x', '   ']})
TEF.reorder_col(raw, 'bool_col', before='int_col')

rename_cols_by_words

rename_cols_by_words(df, words=[], mapper={}, verbose=1)

description

NOTE: Think twice before using this function! Sometimes some of the column names get too long with this logic, in that case, I personally wouldn't specify any words.

A function that rename the columns by

replacing space ' ' with '_'
- so that you can use d.col_name instead of d['col_name']
rename column names by mapper dictionary
rename by detecting given words in the column names, separate word by word with '_'

args

df: pandas dataframe
words: list of strings, words that should be detected and separated
mapper: dict, where keys are column name before renaming and values are after
verbose: int, 0, 1, 2, how many message you want to print out

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

0.9.1

Sep 17, 2019

0.9.0

Sep 12, 2019

0.8.3

Sep 10, 2019

0.8.2

Sep 9, 2019

0.8.1

Sep 7, 2019

0.8.0

Sep 7, 2019

0.7.9

Sep 7, 2019

0.7.8

Sep 7, 2019

0.7.7

Sep 5, 2019

0.7.6

Sep 5, 2019

0.7.5

Sep 4, 2019

0.7.4

Sep 4, 2019

0.7.3

Sep 4, 2019

0.7.2

Sep 4, 2019

0.7.1

Sep 4, 2019

0.7.0

Sep 4, 2019

0.6.2

Aug 31, 2019

0.6.1

Aug 30, 2019

0.6.0

Aug 27, 2019

0.5.9

Aug 26, 2019

0.5.8

Aug 26, 2019

0.5.7

Aug 26, 2019

0.5.6

Aug 26, 2019

0.5.5

Aug 26, 2019

0.5.4

Aug 26, 2019

0.5.3

Aug 25, 2019

0.5.2

Aug 25, 2019

0.5.1

Aug 25, 2019

0.5.0

Aug 25, 2019

0.4.2

Aug 20, 2019

0.4.1

Aug 16, 2019

0.4.0

Aug 16, 2019

0.3.2

Aug 15, 2019

This version

0.3.1

Aug 8, 2019

0.3.0

Aug 8, 2019

0.2.3

Aug 6, 2019

0.2.2

Aug 6, 2019

0.2.1

Aug 6, 2019

0.2.0

Aug 6, 2019

0.1.0

Aug 6, 2019

0.0.7

Aug 6, 2019

0.0.6

Aug 6, 2019

0.0.5

Aug 6, 2019

0.0.4

Aug 6, 2019

0.0.3

Aug 6, 2019

0.0.2

Aug 6, 2019

0.0.1

Aug 6, 2019

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

TEF-0.3.1-py3-none-any.whl (17.0 kB view hashes)

Uploaded Aug 8, 2019 Python 3

Hashes for TEF-0.3.1-py3-none-any.whl

Hashes for TEF-0.3.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`9f37616a8dd864b20f886b610e70cbe310569d47548fed059295a0ac5eea795e`
MD5	`ee7053d36aa80dfeab02a638dfcd3ff4`
BLAKE2b-256	`deafb2d3bd13b82c398a45c047cf94bc2de053c90aecf3c1f6b9c0cdd5cc4ebb`