TabularDataInvestigation

This package provide a fast tabular data investigation and it will eligible for ML model building and also helps to developers in their projects when needed

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

This package provide a fast tabular data investigation and it will eligible for ML model building and also helps to developers in their projects when needed. Most of the functions return a dataframe or json as output

pip install TabularDataInvestigation

from TabularDataInvestigation import tdi

tdi.find_index_for_null_values(df, return_type='dataframe')
Parameters:
df: pandas Dataframe
return_type(optional): Default = 'dataframe'

Some we need to delete or fill the null values each cell specifically with different methods. some data has meaning and some are unnecceesary. so using this function we can get all the missing column indexes that we can use in our project.

df = pd.DataFrame({'A': [1, None, 3], 'B': ['!', 5, '?'], 'C': ['a', 'b', None]})
df

     A  B     C
0  1.0  !     a
1  NaN  5     b
2  3.0  ?  None

tdi.find_index_for_null_values(df)

     A    C
0  [1]  [2]

tdi.find_index_for_null_values(df, return_type='json')

{"type":"string"}

Here return type is optional ('dataframe' or 'json'). Default: dataframe

tdi.check_error_data_types(df, return_type='dataframe')
Parameters:
df: pandas Dataframe
return_type(optional): Default = 'dataframe')

We usually face some unusual behave in the dataframe. Sometimes we are seeing there is a numeric type column but after checking it shows object or string type column. So this function will find the cuase.

df = pd.DataFrame({'A': [1, 'a', 3], 'B': [1, 5, 2], 'C': [1.3, 3.9,'2,0']})
df

   A  B    C
0  1  1  1.3
1  a  5  3.9
2  3  2  2,0

tdi.check_error_data_types(df)

  columns error_data error_index
0       A        [a]         [1]
1       C      [2,0]         [2]

tdi.check_error_data_types(df, return_type='json')

{"type":"string"}

Here return type is optional ('dataframe' or 'json'). Default: dataframe

tdi.check_num_of_min_category(df, return_type='dataframe')
Parameters:
  df: pandas Dataframe
  minimum_threshold : this define the minimum count of a category(Default=3)
  return_type(optional):how want to get the output (Default = 'dataframe')

df = pd.DataFrame({'A': ['b', 'a', 'b','a'], 'B': ['x', 'x', 'y','x'], 'C': ['p', 'p', 'q','q']})
df

   A  B  C
0  b  x  p
1  a  x  p
2  b  y  q
3  a  x  q

tdi.check_num_of_min_category(df, minimum_threshold=1)

  columns category index
0       B      [y]   [2]

tdi.check_num_of_min_category(df, minimum_threshold=1, return_type='json')

{"type":"string"}

Here return type is optional ('dataframe' or 'json'). Default: dataframe

tdi.check_col_with_one_category(df, return_type='dataframe')
Parameters:
df: pandas Dataframe
return_type(optional):how want to get the output (Default = 'dataframe')

Sometimes we got such categorical column which data have no variation that means all column's data are same. So this function will findout those column(s)

df = pd.DataFrame({'A': ['b', 'a', 'b','a'], 'B': ['x', 'x', 'x','x'], 'C': ['p', 'p', 'q','q']})
df

   A  B  C
0  b  x  p
1  a  x  p
2  b  x  q
3  a  x  q

tdi.check_col_with_one_category(df)

  columns category_name
0       B           [x]

tdi.check_col_with_one_category(df,return_type='json')

{"type":"string"}

Here return type is optional ('dataframe' or 'json'). Default: dataframe

tdi.find_special_char_index(df, return_type='dataframe')
Parameters:
df: pandas Dataframe
return_type(optional):how want to get the output (Default = 'dataframe')

This function will find out for us those indexes which holding the double spaces and special characters into the dataframe.

df = pd.DataFrame({'A': [1, 2, 3], 'B': ['!', 5, '?'], 'C': [1.2, 2.6, '3,2']})
df

   A  B    C
0  1  !  1.2
1  2  5  2.6
2  3  ?  3,2

tdi.find_special_char_index(df)

  columns has_special_char_at
0       A                  []
1       B              [0, 2]
2       C                 [2]

tdi.find_special_char_index(df, return_type='json')

{"type":"string"}

Here return type is optional ('dataframe' or 'json'). Default: dataframe

tdi.duplicate_columns(df)
Parameters:
df: pandas Dataframe

This function return a list of column names those containg the same value column name may different but data is same

df = pd.DataFrame({'A': [1, 2, 3], 'B': ['!', 5, '?'], 'C': [1, 2, 3]})
df

l = tdi.duplicate_columns(df)
l

['A', 'C']

So here 'A' and 'C' columns contain the same data

tdi.correlated_columns(df, return_type='dataframe')
Parameters:
df: pandas Dataframe
return_type(optional):how want to get the output (Default = 'dataframe')

This function will return a dataframe or json which will define that different column but the data is more than 90% correlated

df = pd.DataFrame({'A': [1, 2, 3], 'B': ['a', 5, 'b'], 'C': [1, 2, 3]})
df

tdi.correlated_columns(df, return_type='dataframe')

/usr/local/lib/python3.10/dist-packages/TabularDataInvestigation/tdi.py:116: FutureWarning: The default value of numeric_only in DataFrame.corr is deprecated. In a future version, it will default to False. Select only valid columns or specify the value of numeric_only to silence this warning.
  correalated_df = df.corr()

  columns correlated_columns correlation
0       A                [C]       [1.0]
1       C                [A]       [1.0]

tdi.correlated_columns(df, return_type='json')

/usr/local/lib/python3.10/dist-packages/TabularDataInvestigation/tdi.py:116: FutureWarning: The default value of numeric_only in DataFrame.corr is deprecated. In a future version, it will default to False. Select only valid columns or specify the value of numeric_only to silence this warning.
  correalated_df = df.corr()

{"type":"string"}

Here return type is optional ('dataframe' or 'json'). Default: dataframe

Project details

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

0.0.8

Jul 7, 2023

This version

0.0.7

Jun 17, 2023

0.0.6

Jun 17, 2023

0.0.5

Jun 17, 2023

0.0.4

Jun 17, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

TabularDataInvestigation-0.0.7.tar.gz (7.4 kB view hashes)

Uploaded Jun 17, 2023 Source

Hashes for TabularDataInvestigation-0.0.7.tar.gz

Hashes for TabularDataInvestigation-0.0.7.tar.gz
Algorithm	Hash digest
SHA256	`79605d274fba3b2cc89f21d11a17f3e63656692f858a7f2ec6751066c7b776dc`
MD5	`43fe498db9e7674eb3c5a197270052c1`
BLAKE2b-256	`3826b0cb3e190a1de1aae98ff094007b71f7ad824b3dd2f639f24e419a1a89f1`