Data profiler for Pandas
Project description
Panda-Helper: Quickly and easily inspect data
Panda-Helper is a simple data-profiling utility for Pandas' DataFrames and Series.
Assess data quality and usefulness with minimal effort.
Quickly perform initial data exploration, so you can move on to more in-depth analysis.
DataFrame profiles:
- Report shape
- Detect duplicated rows
- Display series names and data types
- Calculate distribution statistics on null values per row providing a view on data completeness
Sample DataFrame profile
Vehicles passing through toll stations
DataFrame-Level Info
------------------------- ------------
DF Shape (1586280, 6)
Duplicated Rows 2184
Column Name Data Type
-------------------------- -----------
Plaza ID int64
Date object
Hour int64
Direction object
# Vehicles - ETC (E-ZPass) int64
# Vehicles - Cash/VToll int64
Summary of Nulls Per Row
-------------------------- -----------
count 1.58628e+06
min 0
1% 0
5% 0
25% 0
50% 0
75% 0
95% 0
99% 0
max 0
median 0
mean 0
median absolute deviation 0
standard deviation 0
skew 0
Series profiles report the:
- Series data type
- Count of non-null values in the series
- Number of unique values
- Count of null values
- Counts and frequency of the most and least common values
- Distribution statistics for numeric-like data
Sample profile of categorical data
Direction vehicles are traveling
Direction Info
---------------- -------
Data Type object
Count 1586280
Unique Values 2
Null Values 0
Value Count % of total
------- ------- ------------
I 814100 51.32%
O 772180 48.68%
Sample profile of numeric data
Hourly vehicle counts at tolling points
# Vehicles - ETC (E-ZPass) Info
--------------------------------- -------
Data Type int64
Count 1586280
Unique Values 8987
Null Values 0
Value Count % of total
------- ------- ------------
0 3137 0.20%
43 1762 0.11%
44 1743 0.11%
40 1712 0.11%
42 1699 0.11%
41 1682 0.11%
39 1676 0.11%
37 1673 0.11%
48 1659 0.10%
46 1654 0.10%
38 1646 0.10%
45 1641 0.10%
36 1636 0.10%
52 1574 0.10%
47 1572 0.10%
50 1571 0.10%
51 1555 0.10%
53 1547 0.10%
55 1543 0.10%
34 1534 0.10%
8269 1 0.00%
8438 1 0.00%
8876 1 0.00%
8261 1 0.00%
8694 1 0.00%
Statistic Value
------------------------- ---------------
count 1.58628e+06
min 0
1% 25
5% 68
25% 407
50% 1054
75% 2071
95% 3583
99% 6308
max 16854
median 1054
mean 1373.16
median absolute deviation 751
standard deviation 1253.1
skew 1.69154
Installing Panda-Helper
pip install panda-helper
Using Panda-Helper
Profiling a DataFrame
Create the DataFrameProfile and then display it or save the profile.
import pandas as pd
import pandahelper as ph
data = {
"user_id": [1, 2, 3, 4, 4],
"transaction": ["purchase", "return", "purchase", "exchange", "exchange"],
"amount": [100.00, None, 1400.00, 85.12, 85.12],
"survey": [None, None, None, "online", "online"],
}
df = pd.DataFrame(data)
df_profile = ph.DataFrameProfile(df)
df_profile
DataFrame-Level Info
------------------------- ------
DF Shape (5, 4)
Obviously Duplicated Rows 1
Column Name Data Type
------------- -----------
user_id int64
transaction object
amount float64
survey object
Summary of Nulls Per Row
-------------------------- --------
count 5
min 0
1% 0
5% 0
25% 0
50% 1
75% 1
95% 1.8
99% 1.96
max 2
median 1
mean 0.8
median absolute deviation 1
standard deviation 0.83666
skew 0.512241
df_profile.save_report("df_profile.txt")
Profiling a Series
Create the SeriesProfile and then display it or save it. That's it!
series_profile = ph.SeriesProfile(df["amount"])
series_profile
amount Info
------------- -------
Data Type float64
Count 4
Unique Values 3
Null Values 1
Value Count % of total
------- ------- ------------
85.12 2 50.00%
100 1 25.00%
1400 1 25.00%
Statistic Value
------------------------- ----------
count 4
min 85.12
1% 85.12
5% 85.12
25% 85.12
50% 92.56
75% 425
95% 1205
99% 1361
max 1400
median 92.56
mean 417.56
median absolute deviation 7.44
standard deviation 654.998
skew 1.99931
series_profile.save_report("amount_profile.txt")
Sample data obtained from:
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for panda_helper-0.0.4-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | fb500580d3e4767023d83493c9564f0d50e70f5574ebb575eaaa60808d51ec00 |
|
MD5 | 3720000cdb05ce54ed3165a618b9bb6e |
|
BLAKE2b-256 | 0911b07773d5e183f57ed6eca3fec5b4e0ee7069ca4c2909fc9d48aefb2bd46d |