A Python package for generating comprehensive data summaries and statistics, similar to Stata's codebook command.
Project description
stata_codebook
Package
The Codebook Package
package provides tools for generating detailed descriptive statistics and summaries of data frames, similar to Stata's codebook
command. codebook
command is a very useful command to examine dataset varaibles.
In Stata documentation "codebook
examines the data in producing its results. For variables that codebook thinks are
continuous, it presents the mean; the standard deviation; and the 10th, 25th, 50th, 75th, and 90th
percentiles. For variables that it thinks are categorical, it presents a tabulation.".
The package supports various features, including:
- Summary statistics for numeric and categorical variables
- Handling of columns with missing values
- Detection of mixed data types
- Normality testing with Shapiro-Wilk or Kolmogorov-Smirnov tests, depending on dataset size
- Output formatting for academic or professional reports
- Check for embedded, leading, and trailing balnks in the variables.
Why use stata_codebook over built-in summary statistics?
While pandas offers built-in functions like describe()
and value_counts()
for summarizing data, the codebook package
provides several advantages:
-
Comprehensive Overview
-
Numeric and Categorical Data: Unlike
describe()
, which primarily focuses on numeric data,codebook
provides a detailed summary of both numeric and categorical variables. It not only gives you the common statistics like mean, median, and standard deviation but also includes the top categories and their proportions for categorical variables. -
Handling of Missing Values: The
codebook
function provides a clear count of missing values for each variable, which is not directly offered by thedescribe()
function.
-
-
Data Quality Checks
-
Detection of Blanks: One of the unique features of the
codebook
function is its ability to detect embedded, leading, and trailing blanks in string data. This can be crucial for identifying and resolving data entry issues that might otherwise go unnoticed with standard summary statistics. -
Mixed Data Types: If a column contains mixed data types, the function will automatically detect and handle it, issuing warnings to alert you to potential data quality problems.
-
-
Advanced Statistical Insights
-
Normality Testing: The
codebook
function includes normality testing (Shapiro-Wilk for small datasets (<5000 observations) and Kolmogorov-Smirnov for large datasets), providing you with p-values that can help you assess the distribution of your numeric data. This goes beyond what the standarddescribe()
function offers. -
Confidence Intervals: In advanced mode, the function calculates 95% confidence intervals for both numeric variables and the proportions of the top categories in categorical variables, offering deeper insights into your data's variability.
-
-
Customizable and Readable Output
-
Formatted Output: The
codebook
function rounds numerical results to a specified number of decimal places, ensuring that the output is easy to read and interpret. This is especially valuable for creating reports or presentations where clarity and professionalism are paramount. -
Consistent Display: By returning a DataFrame with all relevant statistics neatly organized,
codebook
makes it easier to compare variables side by side, which can be inefficient when using multiple pandas functions.
-
-
Easy to Use
- Single Command: With just one command, you can generate a detailed and well-rounded summary of one column or the entire DataFrame, saving time and reducing the risk of overlooking important details.
1. Installation
The package can be installed directly from PyPI using pip:
pip install stata_codebook
2. Quick Start
Here's a quick example to get you started:
import pandas as pd
from stata_codebook import codebook
# Sample DataFrame
data = {
'age': [25, 30, 35, 40, None],
'income': [50000, 60000, 70000, 80000, 90000],
'gender': ['Male', 'Female', 'Female', 'Male', None],
'is_employed': [True, True, False, True, None]
}
df = pd.DataFrame(data)
# codebook for all dataset varaibles
codebook(df)
Variable | Type | Unique values | Missing values | Blank issues | Range | 25th percentile | 50th percentile (Median) | 75th percentile | Mean | Examples | Top categories | SD | 95% CI | Normality test | p-value (normality) | Top category proportion | 95% CI (top category) | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | age | float64 | 4 | 1 | Not applicable | (25.0, 40.0) | 28.75 | 32.5 | 36.25 | 32.5 | [35.0, 25.0, 30.0] | - | - | - | - | - | NaN | NaN |
1 | income | int64 | 5 | 0 | Not applicable | (50000, 90000) | 60000.0 | 70000.0 | 80000.0 | 70000.0 | [70000, 50000, 60000] | - | - | - | - | - | NaN | NaN |
2 | gender | object | 2 | 1 | No blanks detected | - | - | - | - | - | [Female, Male, Female] | {'Male': 2, 'Female': 2} | - | NaN | - | - | - | - |
3 | is_employed | object | 2 | 1 | No blanks detected | - | - | - | - | - | [False, True, True] | {True: 3, False: 1} | - | NaN | - | - | - | - |
# codebook for specific column in the dataset
codebook(df, column='income') # numerical column
Variable | Type | Unique values | Missing values | Blank issues | Range | 25th percentile | 50th percentile (Median) | 75th percentile | Mean | Examples | Top categories | SD | 95% CI | Normality test | p-value (normality) | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | income | int64 | 5 | 0 | Not applicable | (50000, 90000) | 60000.0 | 70000.0 | 80000.0 | 70000.0 | [70000, 50000, 60000] | - | - | - | - | - |
# codebook for specific column in the dataset
codebook(df, column='gender') # categorical column
Variable | Type | Unique values | Missing values | Blank issues | Examples | Top categories | Range | 25th percentile | 50th percentile (Median) | 75th percentile | Mean | SD | Normality test | p-value (normality) | Top category proportion | 95% CI (top category) | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | gender | object | 2 | 1 | No blanks detected | [Female, Male, Female] | {'Male': 2, 'Female': 2} | - | - | - | - | - | - | - | - | - | - |
# codebook for specific column in the dataset additional statistics
codebook(df, advanced=True)
Variable | Type | Unique values | Missing values | Blank issues | Range | 25th percentile | 50th percentile (Median) | 75th percentile | Mean | Examples | Top categories | SD | 95% CI | Normality test | p-value (normality) | Top category proportion | 95% CI (top category) | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | age | float64 | 4 | 1 | Not applicable | (25.0, 40.0) | 28.75 | 32.5 | 36.25 | 32.5 | [35.0, 25.0, 30.0] | - | 6.455 | (26.174, 38.826) | Shapiro-Wilk | 0.972 | NaN | NaN |
1 | income | int64 | 5 | 0 | Not applicable | (50000, 90000) | 60000.0 | 70000.0 | 80000.0 | 70000.0 | [70000, 50000, 60000] | - | 15811.388 | (56140.707, 83859.293) | Shapiro-Wilk | 0.967 | NaN | NaN |
2 | gender | object | 2 | 1 | No blanks detected | - | - | - | - | - | [Female, Male, Female] | {'Male': 2, 'Female': 2} | - | NaN | - | - | 0.50 | (0.01, 0.99) |
3 | is_employed | object | 2 | 1 | No blanks detected | - | - | - | - | - | [False, True, True] | {True: 3, False: 1} | - | NaN | - | - | 0.75 | (0.326, 1.174) |
3. Detailed Function Documentation
Function: codebook
Generates a detailed codebook for a given DataFrame/variable in the dataframe, providing descriptive statistics and data quality checks.
Parameters:
df
(pandas.DataFrame): The DataFrame to analyze.column
(str, optional): If specified, only this column will be analyzed. Defaults toNone
.advanced
(bool, optional): IfTrue
, includes additional statistics like standard deviation, confidence intervals, and normality tests. Defaults toFalse
.decimal_places
(int, optional): The number of decimal places to round numerical results. Defaults to 3.
Returns:
- pandas.DataFrame: A DataFrame containing the codebook with descriptive statistics and data quality checks.
Example Usage:
# Generate an advanced codebook for a specific column
codebook(df, column='age', advanced=True, decimal_places=2)
Variable | Type | Unique values | Missing values | Blank issues | Range | 25th percentile | 50th percentile (Median) | 75th percentile | Mean | Examples | Top categories | SD | 95% CI | Normality test | p-value (normality) | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | age | float64 | 4 | 1 | Not applicable | (25.0, 40.0) | 28.75 | 32.5 | 36.25 | 32.5 | [35.0, 25.0, 30.0] | - | 6.45 | (26.18, 38.82) | Shapiro-Wilk | 0.97 |
4. Notes
If a column contains all missing values, the function will skip detailed analysis for that column and indicate that it is entirely missing. The function automatically handles mixed data types by converting the column to an object type and issuing a warning.
5. Output Explanation:
- Variable: The name of the variable.
- Type: The data type of the variable.
- Unique values: The number of unique non-null values.
- Missing values: The number of missing (null) values.
- Blank issues: Any detected issues with leading, trailing, or embedded blanks in string variables.
- Range: The minimum and maximum values for numeric variables.
- 25th, 50th, 75th percentile: The respective percentiles for numeric variables.
- Mean: The mean of numeric variables.
- SD: The standard deviation for numeric variables (advanced mode).
- 95% CI: The 95% confidence interval for numeric variables (advanced mode).
- Normality test: The type of normality test applied (Shapiro-Wilk (for datasets with 5000 or fewer observations) or Kolmogorov-Smirnov (for larger datasets)).
- p-value (normality): The p-value from the normality test.
- Top categories: The most frequent categories for categorical variables.
- Top category proportion: The proportion of the top category for categorical variables (advanced mode).
- 95% CI (top category): The 95% confidence interval for the top category proportion (advanced mode).
6. FAQ/Troubleshooting
Q1: The codebook function isn't working for my DataFrame with mixed data types. What should I do?
A: The codebook
function automatically detects and converts columns with mixed data types to object (string) type. If you see a warning about mixed types, ensure your data is clean and consistently typed, or allow the function to handle it automatically.
Q2: Why does the function skip some columns?
A: The function may skip columns if they contain all missing values (NaN
). The output will indicate if a column is entirely missing.
Q3: How can I adjust the number of decimal places for numerical results?
A: You can adjust the decimal precision by setting the decimal_places
parameter when calling the codebook
function:
codebook(df, advanced=True, decimal_places=2)
License
Released under the MIT License: For more details, see the LICENSE
file in the repository.
Copyright (C) 2024 stata_codebook
Developed by: Mohsen Askar ceaser198511@gmail.com
Citation
If you use stata_codebook
, please refer to this repository.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file stata_codebook-0.2.0.tar.gz
.
File metadata
- Download URL: stata_codebook-0.2.0.tar.gz
- Upload date:
- Size: 11.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.0.0 CPython/3.10.13
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 107978b0897f6c3fdc9cd55993177e328fe915248c3e18fcc6194883dfe3e9f5 |
|
MD5 | 830bdd3abd0cc6d7c11db449652adb09 |
|
BLAKE2b-256 | 295fd065bc4250f27553badf0851e1ba91e2252522aea12021d1753e07020359 |
File details
Details for the file stata_codebook-0.2.0-py3-none-any.whl
.
File metadata
- Download URL: stata_codebook-0.2.0-py3-none-any.whl
- Upload date:
- Size: 10.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.0.0 CPython/3.10.13
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 5594a46fcd262293462887bd357eefbc4e5e0f09624edb12a8fb391ea534a701 |
|
MD5 | 1b2a8c52fa3e1b5b1843b5e314419b78 |
|
BLAKE2b-256 | a2c704630092fb83eddbc654e505adaf2b984ca81f3ed5af0174a177f45e9e0a |