Skip to main content

This package provide a fast tabular data investigation and it will eligible for ML model building and also helps to developers in their projects when needed

Project description

This package provide a fast tabular data investigation and it will eligible for ML model building and also helps to developers in their projects when needed. Most of the functions return a dataframe or json as output

pip install TabularDataInvestigation
from TabularDataInvestigation import tdi

tdi.find_index_for_null_values(df, return_type='dataframe')

Parameters (Input):

  • df: pandas Dataframe
  • return_type(optional): Default = 'dataframe'

Output : DataFrame

Sometimes, we need to drop or fill the null values according to individual cell specifically with different methods. Some data contains meaning and some are unnecceesary.So, using this function we can get all the missing cell indexes that we can use in our project.

df = pd.DataFrame({'A': [1, None, 3], 'B': ['!', 5, '?'], 'C': ['a', 'b', None]})
df
     A  B     C
0  1.0  !     a
1  NaN  5     b
2  3.0  ?  None
tdi.find_index_for_null_values(df)
     A    C
0  [1]  [2]
tdi.find_index_for_null_values(df, return_type='json')
{"type":"string"}

Here return type is optional ('dataframe' or 'json'). Default: dataframe. From the output we understand that column "A" Index "1" has a null value.

tdi.check_error_data_types(df, return_type='dataframe')

Parameters(Input):

  • df: pandas Dataframe
  • return_type(optional): Default = 'dataframe')

Output : DataFrame

We usually face some unusual behave in the dataframe for data type issue. Sometimes we are seeing there is a numeric type column but after checking it shows object or string type column for error in the data. So this function will find the cuase in the dataframe.

df = pd.DataFrame({'A': [1, 'a', 3], 'B': [1, 5, 2], 'C': [1.3, 3.9,'2,0']})
df
   A  B    C
0  1  1  1.3
1  a  5  3.9
2  3  2  2,0
tdi.check_error_data_types(df)
  columns error_data error_index
0       A        [a]         [1]
1       C      [2,0]         [2]
tdi.check_error_data_types(df, return_type='json')
{"type":"string"}

Here return type is optional ('dataframe' or 'json'). Default: dataframe. Above output defines that column "A" have error data value "a" and which index is "1"

tdi.check_num_of_min_category(df, return_type='dataframe')

Parameters (Input):

  • df: pandas Dataframe
  • minimum_threshold : this define the minimum count of a category(Default=3)
  • return_type(optional):how want to get the output (Default = 'dataframe')

Output : DataFrame

df = pd.DataFrame({'A': ['b', 'a', 'b','a'], 'B': ['x', 'x', 'y','x'], 'C': ['p', 'p', 'q','q']})
df
   A  B  C
0  b  x  p
1  a  x  p
2  b  y  q
3  a  x  q
tdi.check_num_of_min_category(df, minimum_threshold=1)
  columns category index
0       B      [y]   [2]
tdi.check_num_of_min_category(df, minimum_threshold=1, return_type='json')
{"type":"string"}

Here return type is optional ('dataframe' or 'json'). Default: dataframe. Above output defines that column "B" have fewer categories because we set the minimum threshold is "1" and which index is "2"

tdi.check_col_with_one_category(df, return_type='dataframe')

Parameters (Input):

  • df: pandas Dataframe
  • return_type(optional):how want to get the output (Default = 'dataframe')

Output : DataFrame

Sometimes we got such categorical column which data have no variation that means all column's data are same. So this function will findout those column(s)

df = pd.DataFrame({'A': ['b', 'a', 'b','a'], 'B': ['x', 'x', 'x','x'], 'C': ['p', 'p', 'q','q']})
df
   A  B  C
0  b  x  p
1  a  x  p
2  b  x  q
3  a  x  q
tdi.check_col_with_one_category(df)
  columns category_name
0       B           [x]
tdi.check_col_with_one_category(df,return_type='json')
{"type":"string"}

Here return type is optional ('dataframe' or 'json'). Default: dataframe. Above output defines that column "B" has one category only which category value is "x"

tdi.find_special_char_index(df, return_type='dataframe')

Parameters (Input):

  • df: pandas Dataframe
  • return_type(optional):how want to get the output (Default = 'dataframe')

Output : DataFrame

This function will find out for us those indexes which contain the double spaces and special characters into the dataframe.

df = pd.DataFrame({'A': [1, 2, 3], 'B': ['!', 5, '?'], 'C': [1.2, 2.6, '3,2']})
df
   A  B    C
0  1  !  1.2
1  2  5  2.6
2  3  ?  3,2
tdi.find_special_char_index(df)
  columns has_special_char_at
0       A                  []
1       B              [0, 2]
2       C                 [2]
tdi.find_special_char_index(df, return_type='json')
{"type":"string"}

Here return type is optional ('dataframe' or 'json'). Default: dataframe. Above output dataframe defines that column "B" have special characters which indexes are [0,2] and column "C" has also special character which index is [2]

tdi.duplicate_columns(df)

Parameters (Input):

  • df: pandas Dataframe

Output : List

This function returns a list of column names those containing the same value. Also, handle the case that the column name may different but data is same

df = pd.DataFrame({'A': [1, 2, 3], 'B': ['!', 5, '?'], 'C': [1, 2, 3]})
df
   A  B  C
0  1  !  1
1  2  5  2
2  3  ?  3
l = tdi.duplicate_columns(df)
l
['A', 'C']

So here 'A' and 'C' columns contain the same data

tdi.correlated_columns(df, return_type='dataframe')

Parameters (Input):

  • df: pandas Dataframe
  • return_type(optional):how want to get the output (Default = 'dataframe')

Output : DataFrame

This function will return a dataframe or json which will define that different column but the data is more than 90% correlated.

df = pd.DataFrame({'A': [1, 2, 3], 'B': ['a', 5, 'b'], 'C': [1, 2, 3]})
df
   A  B  C
0  1  a  1
1  2  5  2
2  3  b  3
tdi.correlated_columns(df, return_type='dataframe')
/usr/local/lib/python3.10/dist-packages/TabularDataInvestigation/tdi.py:116: FutureWarning: The default value of numeric_only in DataFrame.corr is deprecated. In a future version, it will default to False. Select only valid columns or specify the value of numeric_only to silence this warning.
  correalated_df = df.corr()
  columns correlated_columns correlation
0       A                [C]       [1.0]
1       C                [A]       [1.0]
tdi.correlated_columns(df, return_type='json')
/usr/local/lib/python3.10/dist-packages/TabularDataInvestigation/tdi.py:116: FutureWarning: The default value of numeric_only in DataFrame.corr is deprecated. In a future version, it will default to False. Select only valid columns or specify the value of numeric_only to silence this warning.
  correalated_df = df.corr()
{"type":"string"}

Here return type is optional ('dataframe' or 'json'). Default: dataframe. Above output defines that column A is correlated with column C and also shows the correlation value

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

TabularDataInvestigation-0.0.8.tar.gz (14.5 kB view details)

Uploaded Source

File details

Details for the file TabularDataInvestigation-0.0.8.tar.gz.

File metadata

File hashes

Hashes for TabularDataInvestigation-0.0.8.tar.gz
Algorithm Hash digest
SHA256 6cdaaa6daa283ffd028f587eb9a6800028b11b7112ea01d81e2ed9f23544768c
MD5 b7c015eb54ed14721108cf800bfdf0eb
BLAKE2b-256 6590b06f649b379a796023c933c2b7bdc64d1b04db945aec17e1c8677bbb481a

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page