Data profiler for Pandas

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

Tests Status Lint/Format Status

Panda-Helper: Quickly and easily inspect data

Panda-Helper is a simple data-profiling utility for Pandas DataFrames and Series

Assess data quality and usefulness with minimal effort

Quickly perform initial data exploration, so you can move on to more in-depth analysis

DataFrame profiles:

Report shape
Detect duplicated rows
Display series names and data types
Calculate distribution statistics on null values per row providing a view on data completeness

Sample DataFrame profile
Vehicles passing through toll stations

DataFrame-Level Info
-------------------------  ------------
DF Shape                   (1586280, 6)
Duplicated Rows             2184

Column Name                 Data Type
--------------------------  -----------
Plaza ID                    int64
Date                        object
Hour                        int64
Direction                   object
# Vehicles - ETC (E-ZPass)  int64
# Vehicles - Cash/VToll     int64

Summary of Nulls Per Row
--------------------------  -----------
count                       1.58628e+06
min                         0
1%                          0
5%                          0
25%                         0
50%                         0
75%                         0
95%                         0
99%                         0
max                         0
median                      0
mean                        0
median absolute deviation   0
standard deviation          0
skew                        0

Series profiles report the:

Series data type
Count of non-null values in the series
Number of unique values
Count of null values
Counts and frequency of the most and least common values
Distribution statistics for numeric data

Sample profile of categorical data
Direction vehicles are traveling

Direction Info
----------------  -------
Data Type         object
Count             1586280
Unique Values     2
Null Values       0

Value      Count  % of total
-------  -------  ------------
I         814100  51.32%
O         772180  48.68%

Sample profile of numeric data
Hourly vehicle counts at tolling points

# Vehicles - ETC (E-ZPass) Info
---------------------------------  -------
Data Type                          int64
Count                              1586280
Unique Values                      8987
Null Values                        0

  Value    Count  % of total
-------  -------  ------------
      0     3137  0.20%
     43     1762  0.11%
     44     1743  0.11%
     40     1712  0.11%
     42     1699  0.11%
     41     1682  0.11%
     39     1676  0.11%
     37     1673  0.11%
     48     1659  0.10%
     46     1654  0.10%
     38     1646  0.10%
     45     1641  0.10%
     36     1636  0.10%
     52     1574  0.10%
     47     1572  0.10%
     50     1571  0.10%
     51     1555  0.10%
     53     1547  0.10%
     55     1543  0.10%
     34     1534  0.10%
   8269        1  0.00%
   8438        1  0.00%
   8876        1  0.00%
   8261        1  0.00%
   8694        1  0.00%

Statistic                            Value
-------------------------  ---------------
count                          1.58628e+06
min                            0
1%                            25
5%                            68
25%                          407
50%                         1054
75%                         2071
95%                         3583
99%                         6308
max                        16854
median                      1054
mean                        1373.16
median absolute deviation    751
standard deviation          1253.1
skew                           1.69154

Installing Panda-Helper

pip install panda-helper

Using Panda-Helper

Profiling a DataFrame
Create the DataFrameProfile and then display it or save the profile.

import pandas as pd
import pandahelper.reports as ph

data = {
    "user_id": [1, 2, 3, 4, 4],
    "transaction": ["purchase", "return", "purchase", "exchange", "exchange"],
    "amount": [100.00, None, 1400.00, 85.12, 85.12],
    "survey": [None, None, None, "online", "online"],
}
df = pd.DataFrame(data)
df_profile = ph.DataFrameProfile(df)
df_profile

DataFrame-Level Info
-------------------------  ------
DF Shape                   (5, 4)
Obviously Duplicated Rows  1

Column Name    Data Type
-------------  -----------
user_id        int64
transaction    object
amount         float64
survey         object

Summary of Nulls Per Row
--------------------------  --------
count                       5
min                         0
1%                          0
5%                          0
25%                         0
50%                         1
75%                         1
95%                         1.8
99%                         1.96
max                         2
median                      1
mean                        0.8
median absolute deviation   1
standard deviation          0.83666
skew                        0.512241

df_profile.save_report("df_profile.txt")

Profiling a Series
Create the SeriesProfile and then display it or save it. That's it!

series_profile = ph.SeriesProfile(df["amount"])
series_profile

amount Info
-------------  -------
Data Type      float64
Count          4
Unique Values  3
Null Values    1

  Value    Count  % of total
-------  -------  ------------
  85.12        2  50.00%
 100           1  25.00%
1400           1  25.00%

Statistic                       Value
-------------------------  ----------
count                         4
min                          85.12
1%                           85.12
5%                           85.12
25%                          85.12
50%                          92.56
75%                         425
95%                        1205
99%                        1361
max                        1400
median                       92.56
mean                        417.56
median absolute deviation     7.44
standard deviation          654.998
skew                          1.99931

series_profile.save_report("amount_profile.txt")

Sample data obtained from:

Project details

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

This version

0.0.2

Jun 7, 2022

0.0.1

Jun 4, 2022

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

panda-helper-0.0.2.tar.gz (19.0 kB view hashes)

Uploaded Jun 7, 2022 Source

Built Distribution

panda_helper-0.0.2-py3-none-any.whl (6.7 kB view hashes)

Uploaded Jun 7, 2022 Python 3

Hashes for panda-helper-0.0.2.tar.gz

Hashes for panda-helper-0.0.2.tar.gz
Algorithm	Hash digest
SHA256	`235d75ca8c446d4798a24667411e1ebc58cd2039349bd291191b51060bfb17e9`
MD5	`fd7ecb0f50a8aa7d95e23d3656b716e0`
BLAKE2b-256	`8ea334e48205fe4002fa313878e37378a2af7e0e748cc22f420e3dccd5a0de21`

Hashes for panda_helper-0.0.2-py3-none-any.whl

Hashes for panda_helper-0.0.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`0352a70ddd06873656042894a07d6807554972c1d3397aa46021993b176e8e02`
MD5	`0d795767a6674f3036288003c9572b4c`
BLAKE2b-256	`61e8052bdaa2a07e45190924779533fafc6d4a195a75f32f39f44adc81f54632`