A pandas-based library that replicates Stata's tabulate functionality

These details have not been verified by PyPI

Project links

Project description

pdtab: Pandas-based Tabulation Library

pdtab is a comprehensive Python library that replicates the functionality of Stata's tabulate command using pandas as the backend. This library provides powerful one-way, two-way, and summary tabulations with statistical tests and measures of association.

Overview

Stata's tabulate command is one of the most widely used tools for creating frequency tables and cross-tabulations in statistical analysis. pdtab brings this functionality to Python, offering:

Complete Stata compatibility: Replicates all major features of Stata's tabulate command
Statistical tests: Chi-square tests, Fisher's exact test, likelihood-ratio tests
Association measures: Cramér's V, Goodman and Kruskal's gamma, Kendall's τb
Flexible output: Console tables, HTML, and visualization options
Weighted analysis: Support for frequency, analytic, and importance weights
Missing value handling: Comprehensive options for dealing with missing data

Integration with Broader Ecosystem

pdtab is part of a comprehensive econometric and statistical analysis ecosystem:

PyStataR

The pdtab library will be integrated into PyStataR, a comprehensive Python package that bridges Stata and R functionality in Python. PyStataR aims to provide Stata users with familiar commands and workflows while leveraging Python's powerful data science ecosystem.

StasPAI

For users interested in AI-powered econometric analysis, StasPAI offers a related project focused on integrating statistical analysis with artificial intelligence methods. StasPAI provides advanced econometric modeling capabilities enhanced by machine learning approaches.

These projects together form a unified toolkit for modern econometric analysis, combining the best of Stata's user-friendly interface, R's statistical capabilities, and Python's machine learning ecosystem.

Installation

pip install pdtab

Or install from source:

git clone https://github.com/brycewang-stanford/pdtab.git
cd pdtab
pip install -e .

Requirements

Python 3.8+
pandas >= 1.0.0
numpy >= 1.18.0
scipy >= 1.4.0
matplotlib >= 3.0.0 (for plotting)
seaborn >= 0.11.0 (for enhanced plotting)

🎯 Design Philosophy

pdtab is designed as a pure Python library focused exclusively on providing Stata's tabulate functionality through a clean, programmatic API.

Key Design Decisions:

No Command-Line Interface: pdtab is intentionally designed as a library-only package to maintain simplicity and focus on the Python ecosystem
Jupyter-First Approach: Optimized for data science workflows in Jupyter notebooks and Python scripts
Programmatic Access: All functionality accessible through Python functions with comprehensive options
Integration Ready: Designed to integrate seamlessly with pandas, matplotlib, and the broader PyData ecosystem

This design ensures pdtab remains lightweight, maintainable, and perfectly suited for modern data science workflows.

Quick Start

Basic One-way Tabulation

import pandas as pd
import pdtab

# Create sample data
data = pd.DataFrame({
    'gender': ['Male', 'Female', 'Male', 'Female', 'Male', 'Female'],
    'education': ['High School', 'College', 'Graduate', 'High School', 'College', 'Graduate'],
    'income': [35000, 45000, 75000, 40000, 55000, 80000]
})

# One-way frequency table
result = pdtab.tabulate('gender', data=data)
print(result)

gender      Freq    Percent    Cum
Male          3      50.00   50.00
Female        3      50.00  100.00
Total         6     100.00  100.00

Two-way Cross-tabulation with Statistics

# Two-way table with chi-square test
result = pdtab.tabulate('gender', 'education', data=data, chi2=True, exact=True)
print(result)

Summary Tabulation

# Summary statistics by group
result = pdtab.tabulate('gender', data=data, summarize='income')
print(result)

Summary of income by gender

gender     Mean     Std. Dev.   Freq
Male     55000.0    20000.0      3
Female   55000.0    20000.0      3
Total    55000.0    18257.4      6

Main Functions

`tabulate(varname1, varname2=None, data=None, **options)`

Main tabulation function supporting:

One-way options:

missing=True: Include missing values as a category
sort=True: Sort by frequency (descending)
plot=True: Create bar chart
nolabel=True: Show numeric codes instead of labels
generate='prefix': Create indicator variables

Two-way options:

chi2=True: Pearson's chi-square test
exact=True: Fisher's exact test
lrchi2=True: Likelihood-ratio chi-square
V=True: Cramér's V
gamma=True: Goodman and Kruskal's gamma
taub=True: Kendall's τb
row=True: Row percentages
column=True: Column percentages
cell=True: Cell percentages
expected=True: Expected frequencies

Summary options:

summarize='variable': Variable to summarize
means=False: Suppress means
standard=False: Suppress standard deviations
freq=False: Suppress frequencies

`tab1(varlist, data=None, **options)`

Create one-way tables for multiple variables:

results = pdtab.tab1(['gender', 'education'], data=data)
for var, result in results.items():
    print(f"\n{var}:")
    print(result)

`tab2(varlist, data=None, **options)`

Create all possible two-way tables:

results = pdtab.tab2(['gender', 'education', 'region'], data=data, chi2=True)
for (var1, var2), result in results.items():
    print(f"\n{var1} × {var2}:")
    print(result)

`tabi(table_data, **options)`

Immediate tabulation from supplied data:

# From string (Stata format)
result = pdtab.tabi("30 18 \\ 38 14", exact=True)

# From list
result = pdtab.tabi([[30, 18], [38, 14]], chi2=True)

Visualization

Create plots directly from tabulation results:

# Bar chart for one-way table
result = pdtab.tabulate('gender', data=data, plot=True)

# Heatmap for two-way table  
result = pdtab.tabulate('gender', 'education', data=data)
fig = pdtab.viz.create_tabulation_plots(result, plot_type='heatmap')

Statistical Tests

Supported Tests

Pearson's Chi-square Test: Tests independence in contingency tables
Likelihood-ratio Chi-square: Alternative to Pearson's chi-square
Fisher's Exact Test: Exact test for small samples (especially 2×2 tables)

Association Measures

Cramér's V: Measure of association (0-1 scale)
Goodman and Kruskal's Gamma: For ordinal variables (-1 to 1)
Kendall's τb: Rank correlation with tie correction (-1 to 1)

Weighted Analysis

Support for different weight types:

# Frequency weights
result = pdtab.tabulate('gender', data=data, weights='freq_weight')

# Analytic weights  
result = pdtab.tabulate('gender', data=data, weights='analytic_weight')

Missing Value Handling

Flexible options for missing data:

# Exclude missing values (default)
result = pdtab.tabulate('gender', data=data)

# Include missing as category
result = pdtab.tabulate('gender', data=data, missing=True)

# Subpopulation analysis
result = pdtab.tabulate('gender', data=data, subpop='analysis_sample')

Export Options

Export results in multiple formats:

result = pdtab.tabulate('gender', 'education', data=data)

# Export to dictionary
data_dict = result.to_dict()

# Export to HTML
html_table = result.to_html()

# Save plot
fig = pdtab.viz.create_tabulation_plots(result)
pdtab.viz.save_plot(fig, 'crosstab.png')

Advanced Examples

Complex Two-way Analysis

# Comprehensive two-way analysis
result = pdtab.tabulate(
    'treatment', 'outcome', 
    data=clinical_data,
    chi2=True,           # Chi-square test
    exact=True,          # Fisher's exact test
    V=True,              # Cramér's V
    row=True,            # Row percentages
    expected=True,       # Expected frequencies
    missing=True         # Include missing values
)

print(result)
print(f"Chi-square: {result.statistics['chi2']['statistic']:.3f}")
print(f"p-value: {result.statistics['chi2']['p_value']:.3f}")
print(f"Cramér's V: {result.statistics['cramers_v']:.3f}")

Summary Analysis by Multiple Groups

# Income analysis by gender and education
result = pdtab.tabulate(
    'gender', 'education',
    data=data,
    summarize='income',
    means=True,
    standard=True,
    obs=True
)

Immediate Analysis of Published Data

# Analyze a 2×3 contingency table from literature
published_data = """
    45 55 60 \\
    30 40 35
"""

result = pdtab.tabi(published_data, chi2=True, exact=True, V=True)
print("Published data analysis:")
print(result)

Stata Comparison

pdtab aims for 100% compatibility with Stata's tabulate command:

Stata Command	pdtab Equivalent
`tabulate gender`	`pdtab.tabulate('gender', data=df)`
`tabulate gender education, chi2`	`pdtab.tabulate('gender', 'education', data=df, chi2=True)`
`tabulate gender, summarize(income)`	`pdtab.tabulate('gender', data=df, summarize='income')`
`tab1 gender education region`	`pdtab.tab1(['gender', 'education', 'region'], data=df)`
`tab2 gender education region`	`pdtab.tab2(['gender', 'education', 'region'], data=df)`
`tabi 30 18 \\ 38 14, exact`	`pdtab.tabi("30 18 \\\\ 38 14", exact=True)`

🤝 Contributing

We welcome contributions! Please see our Contributing Guide for details.

Development Setup

git clone https://github.com/brycewang-stanford/pdtab.git
cd pdtab
pip install -e ".[dev]"
pytest

📄 License

MIT License - see LICENSE file for details.

🙏 Acknowledgments

Stata Corporation for the original tabulate command design
Pandas Development Team for the excellent data manipulation library
SciPy Community for statistical computing tools

Related Projects

pdtab is part of a broader ecosystem of econometric and statistical tools:

PyStataR - Comprehensive Python package bridging Stata and R functionality (pdtab will be integrated into this project)
StasPAI - AI-powered econometric analysis toolkit combining statistical methods with machine learning

Support

Documentation: https://pdtab.readthedocs.io
Issues: GitHub Issues
Discussions: GitHub Discussions

pdtab - Bringing Stata's tabulation power to the Python ecosystem! 🐍

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.1

Aug 1, 2025

0.1.0

Aug 1, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdtab-0.1.1.tar.gz (40.9 kB view details)

Uploaded Aug 1, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pdtab-0.1.1-py3-none-any.whl (31.0 kB view details)

Uploaded Aug 1, 2025 Python 3

File details

Details for the file pdtab-0.1.1.tar.gz.

File metadata

Download URL: pdtab-0.1.1.tar.gz
Upload date: Aug 1, 2025
Size: 40.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.5

File hashes

Hashes for pdtab-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`0d0e37d549702b24250d636d80d6ac92b1bde4e97bbf2c46b18d039e4880a961`
MD5	`3a9d6f7cb6727462a04bce4ffb8c3f1d`
BLAKE2b-256	`34634c779b3cb184c6762335484150833250c12fe10deeb4c0ae7d0644faf767`

See more details on using hashes here.

File details

Details for the file pdtab-0.1.1-py3-none-any.whl.

File metadata

Download URL: pdtab-0.1.1-py3-none-any.whl
Upload date: Aug 1, 2025
Size: 31.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.5

File hashes

Hashes for pdtab-0.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`d6722e1ac72f161272fcc2fcc2dabb4a800c929c79f247ca05db453046ce453d`
MD5	`096045d1711483c11ebc7bb7093c3454`
BLAKE2b-256	`4cab2808c556328865494a3ec428201897f04bc298d67a9330d53d98056e30f2`

See more details on using hashes here.

pdtab 0.1.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

pdtab: Pandas-based Tabulation Library

Overview

Integration with Broader Ecosystem

PyStataR

StasPAI

Installation

Requirements

🎯 Design Philosophy

Key Design Decisions:

Quick Start

Basic One-way Tabulation

Two-way Cross-tabulation with Statistics

Summary Tabulation

Main Functions

tabulate(varname1, varname2=None, data=None, **options)

tab1(varlist, data=None, **options)

tab2(varlist, data=None, **options)

tabi(table_data, **options)

Visualization

Statistical Tests

Supported Tests

Association Measures

Weighted Analysis

Missing Value Handling

Export Options

Advanced Examples

Complex Two-way Analysis

Summary Analysis by Multiple Groups

Immediate Analysis of Published Data

Stata Comparison

🤝 Contributing

Development Setup

📄 License

🙏 Acknowledgments

Related Projects

Support

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`tabulate(varname1, varname2=None, data=None, **options)`

`tab1(varlist, data=None, **options)`

`tab2(varlist, data=None, **options)`

`tabi(table_data, **options)`