Data profiler for Pandas
Project description
Panda-Helper: Quickly and easily inspect data
Panda-Helper creates data profiles for data in Pandas DataFrames and Series
Assess data quality and usefulness with minimal effort
Effortlessly perform initial data exploration, so you can move on to more in-depth analysis
DataFrame profiles quickly and easily:
- Report shape
- Detect duplicated rows
- Display series names and data types
- Provide distribution statistics on null values per row providing a view on data completeness
Sample DataFrame profile
Vehicles passing through toll stations
DataFrame-Level Info
------------------------- ------------
DF Shape (1586280, 6)
Duplicated Rows 2184
Column Name Data Type
-------------------------- -----------
Plaza ID int64
Date object
Hour int64
Direction object
# Vehicles - ETC (E-ZPass) int64
# Vehicles - Cash/VToll int64
Summary of Nulls Per Row
-------------------------- -----------
count 1.58628e+06
min 0
1% 0
5% 0
25% 0
50% 0
75% 0
95% 0
99% 0
max 0
median 0
mean 0
median absolute deviation 0
standard deviation 0
skew 0
Series profiles quickly and easily report the:
- Series data type
- Count of non-null values in the series
- Number of unique values
- Count of null values
- Counts and frequency of the most and least common values
- Distribution statistics for numeric data
Sample profile of categorical data
Direction vehicles are traveling
Direction Info
---------------- -------
Data Type object
Count 1586280
Unique Values 2
Null Values 0
Value Count % of total
------- ------- ------------
I 814100 51.32%
O 772180 48.68%
Sample profile of numeric data
Hourly vehicle counts at tolling points
# Vehicles - ETC (E-ZPass) Info
--------------------------------- -------
Data Type int64
Count 1586280
Unique Values 8987
Null Values 0
Value Count % of total
------- ------- ------------
0 3137 0.20%
43 1762 0.11%
44 1743 0.11%
40 1712 0.11%
42 1699 0.11%
41 1682 0.11%
39 1676 0.11%
37 1673 0.11%
48 1659 0.10%
46 1654 0.10%
38 1646 0.10%
45 1641 0.10%
36 1636 0.10%
52 1574 0.10%
47 1572 0.10%
50 1571 0.10%
51 1555 0.10%
53 1547 0.10%
55 1543 0.10%
34 1534 0.10%
8269 1 0.00%
8438 1 0.00%
8876 1 0.00%
8261 1 0.00%
8694 1 0.00%
Statistic Value
------------------------- ---------------
count 1.58628e+06
min 0
1% 25
5% 68
25% 407
50% 1054
75% 2071
95% 3583
99% 6308
max 16854
median 1054
mean 1373.16
median absolute deviation 751
standard deviation 1253.1
skew 1.69154
Installing Panda-Helper
pip install panda-helper
Using Panda-Helper
Profiling a DataFrame
Create the DataFrameProfile and then display it or save the profile.
import pandas as pd
import pandahelper.reports as ph
data = {
"user_id": [1, 2, 3, 4, 4],
"transaction": ["purchase", "return", "purchase", "exchange", "exchange"],
"amount": [100.00, None, 1400.00, 85.12, 85.12],
"survey": [None, None, None, "online", "online"],
}
df = pd.DataFrame(data)
df_profile = ph.DataFrameProfile(df)
df_profile
DataFrame-Level Info
------------------------- ------
DF Shape (5, 4)
Obviously Duplicated Rows 1
Column Name Data Type
------------- -----------
user_id int64
transaction object
amount float64
survey object
Summary of Nulls Per Row
-------------------------- --------
count 5
min 0
1% 0
5% 0
25% 0
50% 1
75% 1
95% 1.8
99% 1.96
max 2
median 1
mean 0.8
median absolute deviation 1
standard deviation 0.83666
skew 0.512241
df_profile.save_report("df_profile.txt")
Profiling a Series
Create the SeriesProfile and then display it or save it. That's it!
series_profile = ph.SeriesProfile(df["amount"])
series_profile
amount Info
------------- -------
Data Type float64
Count 4
Unique Values 3
Null Values 1
Value Count % of total
------- ------- ------------
85.12 2 50.00%
100 1 25.00%
1400 1 25.00%
Statistic Value
------------------------- ----------
count 4
min 85.12
1% 85.12
5% 85.12
25% 85.12
50% 92.56
75% 425
95% 1205
99% 1361
max 1400
median 92.56
mean 417.56
median absolute deviation 7.44
standard deviation 654.998
skew 1.99931
series_profile.save_report("amount_profile.txt")
Sample data obtained from:
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for panda_helper-0.0.1-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 2b607b95c70818dd805f3eac4ae0af629dc98dd86d00c699e7920629ca217bf5 |
|
MD5 | 0284e32b4ff811ac145789de98e0d67c |
|
BLAKE2b-256 | b4ba05e9d5f6c62b3e99372ca29060bd0ce4aa28fdf383f2d98ac1531244f1ca |