A Python package for generating comprehensive data summaries and statistics, similar to Stata's codebook command.

These details have not been verified by PyPI

Project links

Homepage

Project description

`stata_codebook` Package

The Codebook Package package provides tools for generating detailed descriptive statistics and summaries of data frames, similar to Stata's codebook command. codebook command is a very useful command to examine dataset varaibles. In Stata documentation "codebook examines the data in producing its results. For variables that codebook thinks are continuous, it presents the mean; the standard deviation; and the 10th, 25th, 50th, 75th, and 90th percentiles. For variables that it thinks are categorical, it presents a tabulation.".

The package supports various features, including:

Summary statistics for numeric and categorical variables
Handling of columns with missing values
Detection of mixed data types
Normality testing with Shapiro-Wilk or Kolmogorov-Smirnov tests, depending on dataset size
Output formatting for academic or professional reports
Check for embedded, leading, and trailing balnks in the variables.

Why use stata_codebook over built-in summary statistics?

While pandas offers built-in functions like describe() and value_counts() for summarizing data, the codebook package provides several advantages:

Comprehensive Overview
- Numeric and Categorical Data: Unlike describe(), which primarily focuses on numeric data, codebook provides a detailed summary of both numeric and categorical variables. It not only gives you the common statistics like mean, median, and standard deviation but also includes the top categories and their proportions for categorical variables.
- Handling of Missing Values: The codebook function provides a clear count of missing values for each variable, which is not directly offered by the describe() function.
Data Quality Checks
- Detection of Blanks: One of the unique features of the codebook function is its ability to detect embedded, leading, and trailing blanks in string data. This can be crucial for identifying and resolving data entry issues that might otherwise go unnoticed with standard summary statistics.
- Mixed Data Types: If a column contains mixed data types, the function will automatically detect and handle it, issuing warnings to alert you to potential data quality problems.
Advanced Statistical Insights
- Normality Testing: The codebook function includes normality testing (Shapiro-Wilk for small datasets (<5000 observations) and Kolmogorov-Smirnov for large datasets), providing you with p-values that can help you assess the distribution of your numeric data. This goes beyond what the standard describe() function offers.
- Confidence Intervals: In advanced mode, the function calculates 95% confidence intervals for both numeric variables and the proportions of the top categories in categorical variables, offering deeper insights into your data's variability.
Customizable and Readable Output
- Formatted Output: The codebook function rounds numerical results to a specified number of decimal places, ensuring that the output is easy to read and interpret. This is especially valuable for creating reports or presentations where clarity and professionalism are paramount.
- Consistent Display: By returning a DataFrame with all relevant statistics neatly organized, codebook makes it easier to compare variables side by side, which can be inefficient when using multiple pandas functions.
Easy to Use
- Single Command: With just one command, you can generate a detailed and well-rounded summary of one column or the entire DataFrame, saving time and reducing the risk of overlooking important details.

1. Installation

The package can be installed directly from PyPI using pip:

pip install stata_codebook

2. Quick Start

Here's a quick example to get you started:

import pandas as pd
from stata_codebook import codebook

# Sample DataFrame
data = {
    'age': [25, 30, 35, 40, None],
    'income': [50000, 60000, 70000, 80000, 90000],
    'gender': ['Male', 'Female', 'Female', 'Male', None],
    'is_employed': [True, True, False, True, None]
}
df = pd.DataFrame(data)

# codebook for all dataset varaibles
codebook(df)

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	Variable	Type	Unique values	Missing values	Blank issues	Range	25th percentile	50th percentile (Median)	75th percentile	Mean	Examples	Top categories	SD	95% CI	Normality test	p-value (normality)	Top category proportion	95% CI (top category)
0	age	float64	4	1	Not applicable	(25.0, 40.0)	28.75	32.5	36.25	32.5	[35.0, 25.0, 30.0]	-	-	-	-	-	NaN	NaN
1	income	int64	5	0	Not applicable	(50000, 90000)	60000.0	70000.0	80000.0	70000.0	[70000, 50000, 60000]	-	-	-	-	-	NaN	NaN
2	gender	object	2	1	No blanks detected	-	-	-	-	-	[Female, Male, Female]	{'Male': 2, 'Female': 2}	-	NaN	-	-	-	-
3	is_employed	object	2	1	No blanks detected	-	-	-	-	-	[False, True, True]	{True: 3, False: 1}	-	NaN	-	-	-	-

# codebook for specific column in the dataset
codebook(df, column='income') # numerical column

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	Variable	Type	Unique values	Missing values	Blank issues	Range	25th percentile	50th percentile (Median)	75th percentile	Mean	Examples	Top categories	SD	95% CI	Normality test	p-value (normality)
0	income	int64	5	0	Not applicable	(50000, 90000)	60000.0	70000.0	80000.0	70000.0	[70000, 50000, 60000]	-	-	-	-	-

# codebook for specific column in the dataset
codebook(df, column='gender') # categorical column

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	Variable	Type	Unique values	Missing values	Blank issues	Examples	Top categories	Range	25th percentile	50th percentile (Median)	75th percentile	Mean	SD	Normality test	p-value (normality)	Top category proportion	95% CI (top category)
0	gender	object	2	1	No blanks detected	[Female, Male, Female]	{'Male': 2, 'Female': 2}	-	-	-	-	-	-	-	-	-	-

# codebook for specific column in the dataset additional statistics 
codebook(df, advanced=True)

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	Variable	Type	Unique values	Missing values	Blank issues	Range	25th percentile	50th percentile (Median)	75th percentile	Mean	Examples	Top categories	SD	95% CI	Normality test	p-value (normality)	Top category proportion	95% CI (top category)
0	age	float64	4	1	Not applicable	(25.0, 40.0)	28.75	32.5	36.25	32.5	[35.0, 25.0, 30.0]	-	6.455	(26.174, 38.826)	Shapiro-Wilk	0.972	NaN	NaN
1	income	int64	5	0	Not applicable	(50000, 90000)	60000.0	70000.0	80000.0	70000.0	[70000, 50000, 60000]	-	15811.388	(56140.707, 83859.293)	Shapiro-Wilk	0.967	NaN	NaN
2	gender	object	2	1	No blanks detected	-	-	-	-	-	[Female, Male, Female]	{'Male': 2, 'Female': 2}	-	NaN	-	-	0.50	(0.01, 0.99)
3	is_employed	object	2	1	No blanks detected	-	-	-	-	-	[False, True, True]	{True: 3, False: 1}	-	NaN	-	-	0.75	(0.326, 1.174)

3. Detailed Function Documentation

Function: `codebook`

Generates a detailed codebook for a given DataFrame/variable in the dataframe, providing descriptive statistics and data quality checks.

Parameters:

df (pandas.DataFrame): The DataFrame to analyze.
column (str, optional): If specified, only this column will be analyzed. Defaults to None.
advanced (bool, optional): If True, includes additional statistics like standard deviation, confidence intervals, and normality tests. Defaults to False.
decimal_places (int, optional): The number of decimal places to round numerical results. Defaults to 3.

Returns:

pandas.DataFrame: A DataFrame containing the codebook with descriptive statistics and data quality checks.

Example Usage:

# Generate an advanced codebook for a specific column
codebook(df, column='age', advanced=True, decimal_places=2)

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	Variable	Type	Unique values	Missing values	Blank issues	Range	25th percentile	50th percentile (Median)	75th percentile	Mean	Examples	Top categories	SD	95% CI	Normality test	p-value (normality)
0	age	float64	4	1	Not applicable	(25.0, 40.0)	28.75	32.5	36.25	32.5	[35.0, 25.0, 30.0]	-	6.45	(26.18, 38.82)	Shapiro-Wilk	0.97

4. Notes

If a column contains all missing values, the function will skip detailed analysis for that column and indicate that it is entirely missing. The function automatically handles mixed data types by converting the column to an object type and issuing a warning.

5. Output Explanation:

Variable: The name of the variable.
Type: The data type of the variable.
Unique values: The number of unique non-null values.
Missing values: The number of missing (null) values.
Blank issues: Any detected issues with leading, trailing, or embedded blanks in string variables.
Range: The minimum and maximum values for numeric variables.
25th, 50th, 75th percentile: The respective percentiles for numeric variables.
Mean: The mean of numeric variables.
SD: The standard deviation for numeric variables (advanced mode).
95% CI: The 95% confidence interval for numeric variables (advanced mode).
Normality test: The type of normality test applied (Shapiro-Wilk (for datasets with 5000 or fewer observations) or Kolmogorov-Smirnov (for larger datasets)).
p-value (normality): The p-value from the normality test.
Top categories: The most frequent categories for categorical variables.
Top category proportion: The proportion of the top category for categorical variables (advanced mode).
95% CI (top category): The 95% confidence interval for the top category proportion (advanced mode).

6. FAQ/Troubleshooting

Q1: The codebook function isn't working for my DataFrame with mixed data types. What should I do?

A: The codebook function automatically detects and converts columns with mixed data types to object (string) type. If you see a warning about mixed types, ensure your data is clean and consistently typed, or allow the function to handle it automatically.

Q2: Why does the function skip some columns?

A: The function may skip columns if they contain all missing values (NaN). The output will indicate if a column is entirely missing.

Q3: How can I adjust the number of decimal places for numerical results?

A: You can adjust the decimal precision by setting the decimal_places parameter when calling the codebook function:

codebook(df, advanced=True, decimal_places=2)

License

Developed by: Mohsen Askar ceaser198511@gmail.com

Citation

If you use stata_codebook, please refer to this repository.

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

0.2.0

Aug 16, 2024

This version

0.1.0

Aug 16, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

stata_codebook-0.1.0.tar.gz (11.7 kB view hashes)

Uploaded Aug 16, 2024 Source

Built Distribution

stata_codebook-0.1.0-py3-none-any.whl (10.9 kB view hashes)

Uploaded Aug 16, 2024 Python 3

Hashes for stata_codebook-0.1.0.tar.gz

Hashes for stata_codebook-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`7792805e62641bae268bfb393d9528ca7ac637b15488c1ce8a615ea18a006dab`
MD5	`a46783e96f1edb13e82959cf5e14a42b`
BLAKE2b-256	`c30f399217c2f8d875962d0c09128e258357eeea99f27c16b692e7148676e079`

Hashes for stata_codebook-0.1.0-py3-none-any.whl

Hashes for stata_codebook-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`3ba4f57b2be96ee5c882ee6c8048d556a05d2fd36a1e739ca11f67d66eede2f7`
MD5	`e4229eebf2e6dca543424fae1d905a95`
BLAKE2b-256	`011ae34a5cd6b4f06b4b57a8b82ade380c8822c8225d09137c1ba35df928b8b5`

stata-codebook 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

`stata_codebook` Package

Why use stata_codebook over built-in summary statistics?

1. Installation

2. Quick Start

3. Detailed Function Documentation

Function: `codebook`

4. Notes

5. Output Explanation:

6. FAQ/Troubleshooting

License

Citation

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

stata-codebook 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

stata_codebook Package

Why use stata_codebook over built-in summary statistics?

1. Installation

2. Quick Start

3. Detailed Function Documentation

Function: codebook

4. Notes

5. Output Explanation:

6. FAQ/Troubleshooting

License

Citation

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

`stata_codebook` Package

Function: `codebook`