Skip to main content

A Python package for generating comprehensive data summaries and statistics, similar to Stata's codebook command.

Project description

stata_codebook Package

The Codebook Package package provides tools for generating detailed descriptive statistics and summaries of data frames, similar to Stata's codebook command. codebook command is a very useful command to examine dataset varaibles. In Stata documentation "codebook examines the data in producing its results. For variables that codebook thinks are continuous, it presents the mean; the standard deviation; and the 10th, 25th, 50th, 75th, and 90th percentiles. For variables that it thinks are categorical, it presents a tabulation.".

The package supports various features, including:

  • Summary statistics for numeric and categorical variables
  • Handling of columns with missing values
  • Detection of mixed data types
  • Normality testing with Shapiro-Wilk or Kolmogorov-Smirnov tests, depending on dataset size
  • Output formatting for academic or professional reports
  • Check for embedded, leading, and trailing balnks in the variables.

Why use stata_codebook over built-in summary statistics?

While pandas offers built-in functions like describe() and value_counts() for summarizing data, the codebook package provides several advantages:

  1. Comprehensive Overview

    • Numeric and Categorical Data: Unlike describe(), which primarily focuses on numeric data, codebook provides a detailed summary of both numeric and categorical variables. It not only gives you the common statistics like mean, median, and standard deviation but also includes the top categories and their proportions for categorical variables.

    • Handling of Missing Values: The codebook function provides a clear count of missing values for each variable, which is not directly offered by the describe() function.

  2. Data Quality Checks

    • Detection of Blanks: One of the unique features of the codebook function is its ability to detect embedded, leading, and trailing blanks in string data. This can be crucial for identifying and resolving data entry issues that might otherwise go unnoticed with standard summary statistics.

    • Mixed Data Types: If a column contains mixed data types, the function will automatically detect and handle it, issuing warnings to alert you to potential data quality problems.

  3. Advanced Statistical Insights

    • Normality Testing: The codebook function includes normality testing (Shapiro-Wilk for small datasets (<5000 observations) and Kolmogorov-Smirnov for large datasets), providing you with p-values that can help you assess the distribution of your numeric data. This goes beyond what the standard describe() function offers.

    • Confidence Intervals: In advanced mode, the function calculates 95% confidence intervals for both numeric variables and the proportions of the top categories in categorical variables, offering deeper insights into your data's variability.

  4. Customizable and Readable Output

    • Formatted Output: The codebook function rounds numerical results to a specified number of decimal places, ensuring that the output is easy to read and interpret. This is especially valuable for creating reports or presentations where clarity and professionalism are paramount.

    • Consistent Display: By returning a DataFrame with all relevant statistics neatly organized, codebook makes it easier to compare variables side by side, which can be inefficient when using multiple pandas functions.

  5. Easy to Use

    • Single Command: With just one command, you can generate a detailed and well-rounded summary of one column or the entire DataFrame, saving time and reducing the risk of overlooking important details.

1. Installation

The package can be installed directly from PyPI using pip:

pip install stata_codebook

2. Quick Start

Here's a quick example to get you started:

import pandas as pd
from stata_codebook import codebook
# Sample DataFrame
data = {
    'age': [25, 30, 35, 40, None],
    'income': [50000, 60000, 70000, 80000, 90000],
    'gender': ['Male', 'Female', 'Female', 'Male', None],
    'is_employed': [True, True, False, True, None]
}
df = pd.DataFrame(data)
# codebook for all dataset varaibles
codebook(df)
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
Variable Type Unique values Missing values Blank issues Range 25th percentile 50th percentile (Median) 75th percentile Mean Examples Top categories SD 95% CI Normality test p-value (normality) Top category proportion 95% CI (top category)
0 age float64 4 1 Not applicable (25.0, 40.0) 28.75 32.5 36.25 32.5 [35.0, 25.0, 30.0] - - - - - NaN NaN
1 income int64 5 0 Not applicable (50000, 90000) 60000.0 70000.0 80000.0 70000.0 [70000, 50000, 60000] - - - - - NaN NaN
2 gender object 2 1 No blanks detected - - - - - [Female, Male, Female] {'Male': 2, 'Female': 2} - NaN - - - -
3 is_employed object 2 1 No blanks detected - - - - - [False, True, True] {True: 3, False: 1} - NaN - - - -
# codebook for specific column in the dataset
codebook(df, column='income') # numerical column
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
Variable Type Unique values Missing values Blank issues Range 25th percentile 50th percentile (Median) 75th percentile Mean Examples Top categories SD 95% CI Normality test p-value (normality)
0 income int64 5 0 Not applicable (50000, 90000) 60000.0 70000.0 80000.0 70000.0 [70000, 50000, 60000] - - - - -
# codebook for specific column in the dataset
codebook(df, column='gender') # categorical column
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
Variable Type Unique values Missing values Blank issues Examples Top categories Range 25th percentile 50th percentile (Median) 75th percentile Mean SD Normality test p-value (normality) Top category proportion 95% CI (top category)
0 gender object 2 1 No blanks detected [Female, Male, Female] {'Male': 2, 'Female': 2} - - - - - - - - - -
# codebook for specific column in the dataset additional statistics 
codebook(df, advanced=True)
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
Variable Type Unique values Missing values Blank issues Range 25th percentile 50th percentile (Median) 75th percentile Mean Examples Top categories SD 95% CI Normality test p-value (normality) Top category proportion 95% CI (top category)
0 age float64 4 1 Not applicable (25.0, 40.0) 28.75 32.5 36.25 32.5 [35.0, 25.0, 30.0] - 6.455 (26.174, 38.826) Shapiro-Wilk 0.972 NaN NaN
1 income int64 5 0 Not applicable (50000, 90000) 60000.0 70000.0 80000.0 70000.0 [70000, 50000, 60000] - 15811.388 (56140.707, 83859.293) Shapiro-Wilk 0.967 NaN NaN
2 gender object 2 1 No blanks detected - - - - - [Female, Male, Female] {'Male': 2, 'Female': 2} - NaN - - 0.50 (0.01, 0.99)
3 is_employed object 2 1 No blanks detected - - - - - [False, True, True] {True: 3, False: 1} - NaN - - 0.75 (0.326, 1.174)

3. Detailed Function Documentation

Function: codebook

Generates a detailed codebook for a given DataFrame/variable in the dataframe, providing descriptive statistics and data quality checks.

Parameters:

  • df (pandas.DataFrame): The DataFrame to analyze.
  • column (str, optional): If specified, only this column will be analyzed. Defaults to None.
  • advanced (bool, optional): If True, includes additional statistics like standard deviation, confidence intervals, and normality tests. Defaults to False.
  • decimal_places (int, optional): The number of decimal places to round numerical results. Defaults to 3.

Returns:

  • pandas.DataFrame: A DataFrame containing the codebook with descriptive statistics and data quality checks.

Example Usage:

# Generate an advanced codebook for a specific column
codebook(df, column='age', advanced=True, decimal_places=2)
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
Variable Type Unique values Missing values Blank issues Range 25th percentile 50th percentile (Median) 75th percentile Mean Examples Top categories SD 95% CI Normality test p-value (normality)
0 age float64 4 1 Not applicable (25.0, 40.0) 28.75 32.5 36.25 32.5 [35.0, 25.0, 30.0] - 6.45 (26.18, 38.82) Shapiro-Wilk 0.97

4. Notes

If a column contains all missing values, the function will skip detailed analysis for that column and indicate that it is entirely missing. The function automatically handles mixed data types by converting the column to an object type and issuing a warning.

5. Output Explanation:

  • Variable: The name of the variable.
  • Type: The data type of the variable.
  • Unique values: The number of unique non-null values.
  • Missing values: The number of missing (null) values.
  • Blank issues: Any detected issues with leading, trailing, or embedded blanks in string variables.
  • Range: The minimum and maximum values for numeric variables.
  • 25th, 50th, 75th percentile: The respective percentiles for numeric variables.
  • Mean: The mean of numeric variables.
  • SD: The standard deviation for numeric variables (advanced mode).
  • 95% CI: The 95% confidence interval for numeric variables (advanced mode).
  • Normality test: The type of normality test applied (Shapiro-Wilk (for datasets with 5000 or fewer observations) or Kolmogorov-Smirnov (for larger datasets)).
  • p-value (normality): The p-value from the normality test.
  • Top categories: The most frequent categories for categorical variables.
  • Top category proportion: The proportion of the top category for categorical variables (advanced mode).
  • 95% CI (top category): The 95% confidence interval for the top category proportion (advanced mode).

6. FAQ/Troubleshooting

Q1: The codebook function isn't working for my DataFrame with mixed data types. What should I do?

A: The codebook function automatically detects and converts columns with mixed data types to object (string) type. If you see a warning about mixed types, ensure your data is clean and consistently typed, or allow the function to handle it automatically.

Q2: Why does the function skip some columns?

A: The function may skip columns if they contain all missing values (NaN). The output will indicate if a column is entirely missing.

Q3: How can I adjust the number of decimal places for numerical results?

A: You can adjust the decimal precision by setting the decimal_places parameter when calling the codebook function:

codebook(df, advanced=True, decimal_places=2)

License

Released under the MIT License: For more details, see the LICENSE file in the repository. Copyright (C) 2024 stata_codebook

Developed by: Mohsen Askar ceaser198511@gmail.com

Citation

If you use stata_codebook, please refer to this repository.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

stata_codebook-0.1.0.tar.gz (11.7 kB view hashes)

Uploaded Source

Built Distribution

stata_codebook-0.1.0-py3-none-any.whl (10.9 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page