Skip to main content

Python implementation of Stata's tabulate command for pandas DataFrames

Project description

pandas-tabulate

PyPI version Python 3.7+ License: MIT

Python implementation of Stata's tabulate command for pandas DataFrames.

pandas-tabulate brings the power and familiarity of Stata's tabulate command to Python, providing comprehensive cross-tabulation and frequency analysis tools that seamlessly integrate with pandas DataFrames.

Key Features

  • Comprehensive tabulation: One-way and two-way frequency tables
  • Statistical analysis: Chi-square tests, Fisher exact tests, and other statistical measures
  • Flexible formatting: Multiple output formats and customization options
  • Missing value handling: Configurable treatment of missing data
  • Stata compatibility: Familiar syntax and output format for Stata users
  • Performance optimized: Efficient implementation using pandas and NumPy

Installation

pip install pandas-tabulate

Quick Start

import pandas as pd
import pandas_tabulate as ptab

# Create sample data
df = pd.DataFrame({
    'gender': ['M', 'F', 'M', 'F', 'M', 'F', 'M', 'F'],
    'education': ['High', 'Low', 'High', 'High', 'Low', 'Low', 'High', 'Low'],
    'income': [50000, 30000, 60000, 45000, 35000, 25000, 55000, 28000]
})

# One-way tabulation
result = ptab.tabulate(df['gender'])
print(result)

# Two-way tabulation with statistics
result = ptab.tabulate(df['gender'], df['education'], 
                      chi2=True, exact=True)
print(result)

Available Functions

Core Tabulation Functions

  • tabulate(var1, var2=None, **kwargs) - Main tabulation function
  • oneway(variable, **kwargs) - One-way frequency tables
  • twoway(var1, var2, **kwargs) - Two-way cross-tabulation

Statistical Tests

  • Chi-square test - Test of independence for categorical variables
  • Fisher exact test - Exact test for small sample sizes
  • Likelihood ratio test - Alternative test of independence
  • Cramér's V - Measure of association strength

Output Options

  • Frequencies - Raw counts
  • Percentages - Row, column, and total percentages
  • Cumulative - Cumulative frequencies and percentages
  • Missing handling - Include/exclude missing values

Detailed Examples

One-way Tabulation

import pandas as pd
import pandas_tabulate as ptab

# Basic frequency table
df = pd.DataFrame({'status': ['A', 'B', 'A', 'C', 'B', 'A', 'C']})
result = ptab.oneway(df['status'])
print(result)

# With percentages and cumulative statistics
result = ptab.oneway(df['status'], 
                    percent=True, 
                    cumulative=True)
print(result)

Two-way Cross-tabulation

# Basic cross-tabulation
result = ptab.twoway(df['gender'], df['education'])
print(result)

# With row and column percentages
result = ptab.twoway(df['gender'], df['education'],
                    row_percent=True,
                    col_percent=True)
print(result)

# With statistical tests
result = ptab.twoway(df['gender'], df['education'],
                    chi2=True,
                    exact=True,
                    cramers_v=True)
print(result)

Missing Value Handling

import numpy as np

# Data with missing values
df_missing = pd.DataFrame({
    'var1': ['A', 'B', np.nan, 'A', 'C'],
    'var2': ['X', np.nan, 'Y', 'X', 'Y']
})

# Exclude missing values (default)
result = ptab.twoway(df_missing['var1'], df_missing['var2'])

# Include missing values
result = ptab.twoway(df_missing['var1'], df_missing['var2'], 
                    missing=True)

Stata to Python Translation Guide

Stata Command pandas-tabulate Equivalent
tabulate var1 ptab.oneway(df['var1'])
tabulate var1, missing ptab.oneway(df['var1'], missing=True)
tabulate var1 var2 ptab.twoway(df['var1'], df['var2'])
tabulate var1 var2, chi2 ptab.twoway(df['var1'], df['var2'], chi2=True)
tabulate var1 var2, exact ptab.twoway(df['var1'], df['var2'], exact=True)
tabulate var1 var2, row col ptab.twoway(df['var1'], df['var2'], row_percent=True, col_percent=True)

Function Reference

tabulate(var1, var2=None, **kwargs)

Main tabulation function that automatically determines whether to perform one-way or two-way tabulation.

Parameters:

  • var1: pandas Series - First variable
  • var2: pandas Series, optional - Second variable for cross-tabulation
  • percent: bool, default False - Show percentages
  • cumulative: bool, default False - Show cumulative statistics
  • chi2: bool, default False - Perform chi-square test
  • exact: bool, default False - Perform Fisher exact test
  • missing: bool, default False - Include missing values

Returns:

  • TabulationResult object with tables and statistics

Statistical Tests

All statistical tests return results with:

  • Test statistic
  • p-value
  • Degrees of freedom (where applicable)
  • Critical value
  • Interpretation

Contributing

We welcome contributions! Please see our Contributing Guide for details.

Development Setup

git clone https://github.com/brycewang-stanford/pandas-tabulate.git
cd pandas-tabulate
pip install -e ".[dev]"
python -m pytest tests/

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

  • Inspired by Stata's tabulate command
  • Built on pandas, NumPy, and SciPy
  • Thanks to the open-source community for feedback and contributions

Support


If this package helps your research, please consider starring the repository!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pandas_tabulate-0.1.0.tar.gz (9.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pandas_tabulate-0.1.0-py3-none-any.whl (10.9 kB view details)

Uploaded Python 3

File details

Details for the file pandas_tabulate-0.1.0.tar.gz.

File metadata

  • Download URL: pandas_tabulate-0.1.0.tar.gz
  • Upload date:
  • Size: 9.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.5

File hashes

Hashes for pandas_tabulate-0.1.0.tar.gz
Algorithm Hash digest
SHA256 7bd608a848c02f949f543ed7d404b5b760e8f78c730e900214232e4ea567b451
MD5 48ca26244549a4fecf15200a171e60c7
BLAKE2b-256 442df79636694d2cb8d261192e11d3f465e96c7c7c79b066c24dc7983f2749c9

See more details on using hashes here.

File details

Details for the file pandas_tabulate-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for pandas_tabulate-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 fdd8000e6f11579b83ee75480c4934f517bb2c51ecdf625b0ec7e042251a59f1
MD5 ffb42d3821eb13c76aca17a57c505947
BLAKE2b-256 0ff25ee6960abfacd83877b0737f88ed5ce1e401bf94fc9f2ec664e5d106ca16

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page