Skip to main content

Comprehensive Python package providing Stata-equivalent commands for pandas DataFrames

Project description

PyStataR

Python Version PyPI Version License Downloads

The Ultimate Python Toolkit for Academic Research - Bringing Stata & R's Power to Python ๐Ÿš€

๐Ÿšจ IMPORTANT: Version 0.1.0+ Import Changes

PyStataR v0.1.0+ introduces simplified import syntax for better usability:

# โœ… NEW (v0.1.0+) - Direct function imports
from pystatar import tabulate, reghdfe, winsor2
from pystatar import rank, rowmean  # Individual functions

# Use directly
result = tabulate(df['education'])
regression = reghdfe(data, 'y', ['x1', 'x2'])
# โŒ OLD (v0.0.x) - Module-style imports (deprecated)
from pystatar import tabulate
result = tabulate.tabulate(df, 'education')  # No longer works

Migration Guide: Update your import statements to use the new direct import syntax. All examples below use the v0.1.0+ syntax.

Project Vision & Goals

PyStataR aims to recreate and significantly enhance the top 20 most frequently used Stata commands in Python, transforming them into the most powerful and user-friendly statistical tools for academic research. Our goal is to not just replicate Stata's functionality, but to expand and improve upon it, leveraging Python's ecosystem to create superior research tools.

Why This Project Matters

  • Bridge the Gap: Seamless transition from Stata to Python for researchers
  • Enhanced Functionality: Each command will be significantly expanded beyond Stata's original capabilities
  • Modern Research Tools: Built for today's data science and research needs
  • Community-Driven: Open source development with academic researchers in mind

Target Commands (20 Most Used in Academic Research)

โœ… tabulate - Cross-tabulation and frequency analysis
โœ… egen - Extended data generation and manipulation
โœ… reghdfe - High-dimensional fixed effects regression
โœ… winsor2 - Data winsorizing and trimming
๐Ÿ”„ Coming Soon: summarize, describe, merge, reshape, collapse, keep/drop, generate, replace, sort, by, if/in, reg, logit, probit, ivregress, xtreg

Want to see a specific command implemented?

  • Create an issue to request a command
  • Contribute to help us complete this project faster
  • โญ Star this repo to show your support!

Core Modules Overview

tabulate - Advanced Cross-tabulation and Frequency Analysis

  • Beyond Stata: Enhanced statistical tests, multi-dimensional tables, and publication-ready output
  • Key Features: Chi-square tests, Fisher's exact test, Cramรฉr's V, Kendall's tau, gamma coefficients
  • Use Cases: Survey analysis, categorical data exploration, market research

egen - Extended Data Generation and Manipulation

  • Beyond Stata: Advanced ranking algorithms, robust statistical functions, and vectorized operations
  • Key Features: Group operations, ranking with tie-breaking, row statistics, percentile calculations
  • Use Cases: Data preprocessing, feature engineering, panel data construction

reghdfe - High-Dimensional Fixed Effects Regression

  • Beyond Stata: Memory-efficient algorithms, advanced clustering options, and diagnostic tools
  • Key Features: Multiple fixed effects, clustered standard errors, instrumental variables, robust diagnostics
  • Use Cases: Panel data analysis, causal inference, economic research

winsor2 - Advanced Outlier Detection and Treatment

  • Beyond Stata: Multiple detection methods, group-specific treatment, and comprehensive diagnostics
  • Key Features: IQR-based detection, percentile methods, group-wise operations, flexible trimming
  • Use Cases: Data cleaning, outlier analysis, robust statistical modeling

Advanced Features & Performance

Performance Optimizations

  • Vectorized Operations: All functions leverage NumPy and pandas for maximum speed
  • Memory Efficiency: Optimized for large datasets common in academic research
  • Parallel Processing: Multi-core support for computationally intensive operations
  • Lazy Evaluation: Smart caching and delayed computation when beneficial

Research-Grade Features

  • Publication Ready: LaTeX and HTML output for academic papers
  • Reproducible Research: Comprehensive logging and version tracking
  • Missing Data Handling: Multiple imputation and robust missing value treatment
  • Bootstrapping: Built-in bootstrap methods for confidence intervals
  • Cross-Validation: Integrated CV methods for model validation

Quick Installation

pip install pystatar

Comprehensive Usage Examples

tabulate - Advanced Cross-tabulation

The tabulate module provides comprehensive frequency analysis and cross-tabulation capabilities, extending far beyond Stata's original functionality.

Basic One-way Tabulation

import pandas as pd
import numpy as np
# v0.1.0+ Import Syntax - Direct function imports
from pystatar import tabulate

# Create sample dataset
df = pd.DataFrame({
    'gender': ['Male', 'Female', 'Male', 'Female', 'Male', 'Female'] * 100,
    'education': ['High School', 'College', 'Graduate', 'High School', 'College', 'Graduate'] * 100,
    'income': np.random.normal(50000, 15000, 600),
    'age': np.random.randint(22, 65, 600),
    'industry': np.random.choice(['Tech', 'Finance', 'Healthcare', 'Education'], 600)
})

# Simple frequency table - Direct function call (v0.1.0+)
result = tabulate(df['education'])
print(result)

Advanced Two-way Cross-tabulation with Statistics

# Two-way tabulation with comprehensive statistics (v0.1.0+)
result = tabulate(
    df['gender'], df['education'],
    chi2=True,              # Chi-square test
    exact=True,             # Fisher's exact test
    cramers_v=True,         # Cramรฉr's V
    missing=True,           # Include missing values
    row_percent=True,       # Row percentages
    col_percent=True,       # Column percentages
    cell_percent=True       # Cell percentages
)

# Access different components
print("Frequency Table:")
print(result.table)
print(f"\nChi-square p-value: {result.chi2_pvalue:.4f}")
print(f"Cramรฉr's V: {result.cramers_v:.4f}")

Multi-way Tabulation

# Note: Three-way tabulation is planned for future versions
# For now, you can create separate two-way tables for each industry

# Get unique industries
industries = df['industry'].unique()

print("=== Cross-tabulation by Industry ===")
for industry in industries:
    # Filter data for this industry
    industry_df = df[df['industry'] == industry]
    
    # Create two-way table for this subset
    if len(industry_df) > 0:
        result = tabulate(
            industry_df['gender'], 
            industry_df['education'],
            chi2=True
        )
        print(f"\n=== {industry} ===")
        print(result.table)
        if hasattr(result, 'chi2_pvalue'):
            print(f"Chi-square p-value: {result.chi2_pvalue:.4f}")

egen - Extended Data Generation

The egen module provides powerful data manipulation functions that extend Stata's egen capabilities.

Ranking and Percentile Functions

# v0.1.0+ Import Syntax - Direct function imports
from pystatar import rank

# Advanced ranking with tie-breaking options
df['income_rank'] = rank(df['income'], method='average')  # Handle ties

# Group-specific rankings (Note: group-by functionality planned for future release)
# For now, you can use pandas groupby with rank
df['rank_within_industry'] = df.groupby('industry')['income'].rank(method='average')

# Basic percentile calculations using pandas
df['income_90th'] = df['income'].quantile(0.9)
df['income_iqr'] = df['income'].quantile(0.75) - df['income'].quantile(0.25)

Row Operations

# v0.1.0+ Import Syntax - Direct function imports
from pystatar import rowtotal, rowmean, rowmin, rowmax, rowsd, rowcount

# Create test scores dataset
scores_df = pd.DataFrame({
    'student': range(1, 101),
    'math': np.random.normal(75, 10, 100),
    'english': np.random.normal(80, 12, 100),
    'science': np.random.normal(78, 11, 100),
    'history': np.random.normal(82, 9, 100)
})

# Row statistics (v0.1.0+)
scores_df['total_score'] = rowtotal(scores_df, ['math', 'english', 'science', 'history'])
scores_df['avg_score'] = rowmean(scores_df, ['math', 'english', 'science', 'history'])
scores_df['min_score'] = rowmin(scores_df, ['math', 'english', 'science', 'history'])
scores_df['max_score'] = rowmax(scores_df, ['math', 'english', 'science', 'history'])
scores_df['score_sd'] = rowsd(scores_df, ['math', 'english', 'science', 'history'])

# Count non-missing values per row
scores_df['subjects_taken'] = rowcount(scores_df, ['math', 'english', 'science', 'history'])

Group Statistics and Operations

# v0.1.0+ Import Syntax - Group functions
from pystatar import mean, sd, count, tag

# Group summary statistics
df['mean_income_by_education'] = mean(df['income'], by=df['education'])
df['sd_income_by_gender'] = sd(df['income'], by=df['gender'])

# Group identification and counting
df['education_group_size'] = count(df['education'])
df['first_in_group'] = tag(df, ['education', 'gender'])  # First observation in group

# Advanced group operations using pandas (median not yet implemented in pystatar)
df['median_income_by_industry'] = df.groupby('industry')['income'].transform('median')
df['group_sequence'] = df.groupby('education').cumcount() + 1  # Sequence within group

# Advanced group operations
df['income_rank_in_education'] = df.groupby('education')['income'].rank(method='average')
df['above_group_median'] = (df['income'] > df.groupby('education')['income'].transform('median')).astype(int)

reghdfe - Advanced Fixed Effects Regression

The reghdfe module provides state-of-the-art estimation for linear models with high-dimensional fixed effects.

Basic Fixed Effects Regression

# v0.1.0+ Import Syntax
from pystatar import reghdfe

# Create panel dataset
np.random.seed(42)
n_firms, n_years = 100, 10
n_obs = n_firms * n_years

panel_df = pd.DataFrame({
    'firm_id': np.repeat(range(n_firms), n_years),
    'year': np.tile(range(2010, 2020), n_firms),
    'log_sales': np.random.normal(10, 1, n_obs),
    'log_employment': np.random.normal(4, 0.5, n_obs),
    'log_capital': np.random.normal(8, 0.8, n_obs),
    'industry': np.repeat(np.random.choice(['Tech', 'Manufacturing', 'Services'], n_firms), n_years)
})

# Basic regression with firm and year fixed effects (v0.1.0+)
result = reghdfe(
    data=panel_df,
    y='log_sales',
    x=['log_employment', 'log_capital'],
    fe=['firm_id', 'year']
)

print(result.summary())
print(f"R-squared: {result.r2:.4f}")
print(f"Number of observations: {result.N}")

Advanced Regression with Clustering and Instruments

# Add instrumental variables
panel_df['instrument1'] = np.random.normal(0, 1, n_obs)
panel_df['instrument2'] = np.random.normal(0, 1, n_obs)

# Regression with clustering and multiple fixed effects (v0.1.0+)
result = reghdfe(
    data=panel_df,
    y='log_sales',
    x=['log_employment', 'log_capital'],
    fe=['firm_id', 'year', 'industry'],  # Multiple fixed effects
    cluster='firm_id',                    # Clustered standard errors
    weights='log_employment',             # Weighted regression (using existing variable)
)

# Access detailed results
print("Coefficient Table:")
print(result.coef_table)
print(f"\nFixed Effects absorbed: {result.absorbed_fe}")
print(f"Clusters: {result.n_clusters}")

Instrumental Variables with High-Dimensional FE

# Note: IV functionality is planned for future versions
# For now, reghdfe provides standard fixed effects regression
# IV estimation can be performed using statsmodels or other packages

result = reghdfe(
    data=panel_df,
    y='log_sales',
    x=['log_capital'],                   # Exogenous controls only for now
    fe=['firm_id', 'year'],
    cluster='firm_id'
)

print("Standard Fixed Effects Results:")
print(result.summary())
print(f"R-squared: {result.r2:.4f}")

winsor2 - Advanced Outlier Treatment

The winsor2 module provides comprehensive outlier detection and treatment methods.

Basic Winsorizing

# v0.1.0+ Import Syntax
from pystatar import winsor2

# Create dataset with outliers
outlier_df = pd.DataFrame({
    'income': np.concatenate([
        np.random.normal(50000, 10000, 950),  # Normal observations
        np.random.uniform(200000, 500000, 50)  # Outliers
    ]),
    'age': np.random.randint(18, 70, 1000),
    'industry': np.random.choice(['Tech', 'Finance', 'Retail', 'Healthcare'], 1000)
})

# Basic winsorizing at 1st and 99th percentiles (v0.1.0+)
result = winsor2(outlier_df, ['income'])
print("Original vs Winsorized:")
print(f"Original: min={outlier_df['income'].min():.0f}, max={outlier_df['income'].max():.0f}")
print(f"Winsorized: min={result['income_w'].min():.0f}, max={result['income_w'].max():.0f}")

Group-wise Winsorizing

# Winsorize within groups (v0.1.0+)
result = winsor2(
    outlier_df, 
    ['income'],
    by=outlier_df['industry'], # Winsorize within each industry
    cuts=(5, 95),              # Use 5th and 95th percentiles
    suffix='_clean'            # Custom suffix
)

# Compare distributions by group
for industry in outlier_df['industry'].unique():
    mask = outlier_df['industry'] == industry
    original = outlier_df.loc[mask, 'income']
    winsorized = result.loc[mask, 'income_clean']
    print(f"\n{industry}:")
    print(f"  Original: {original.describe()}")
    print(f"  Winsorized: {winsorized.describe()}")

Trimming vs Winsorizing Comparison

# Compare different outlier treatment methods (v0.1.0+)
trim_result = winsor2(
    outlier_df, 
    ['income'],
    trim=True,              # Trim (remove) instead of winsorize
    cuts=(2.5, 97.5)       # Trim 2.5% from each tail
)

winsor_result = winsor2(
    outlier_df, 
    ['income'],
    trim=False,             # Winsorize (cap) outliers
    cuts=(2.5, 97.5)
)

print("Treatment Comparison:")
print(f"Original N: {len(outlier_df)}")
print(f"After trimming N: {trim_result['income_tr'].notna().sum()}")
print(f"After winsorizing N: {len(winsor_result)}")
print(f"Trimmed mean: {trim_result['income_tr'].mean():.0f}")
print(f"Winsorized mean: {winsor_result['income_w'].mean():.0f}")

Advanced Outlier Detection

# Multiple variable winsorizing with custom thresholds (v0.1.0+)
multi_result = winsor2(
    outlier_df,
    ['income', 'age'],
    cuts=(1, 99),              # Different cuts for different variables
    by=outlier_df['industry'], # Group-specific treatment
    replace=True,              # Replace original variables
    label=True                 # Add descriptive labels
)

# Generate outlier indicators using pandas and numpy (outlier_indicator planned for future release)
import numpy as np

# IQR method for outlier detection
Q1 = outlier_df['income'].quantile(0.25)
Q3 = outlier_df['income'].quantile(0.75)
IQR = Q3 - Q1
outlier_df['income_outlier'] = ((outlier_df['income'] < (Q1 - 1.5 * IQR)) | 
                                (outlier_df['income'] > (Q3 + 1.5 * IQR))).astype(int)

# Percentile method for outlier detection
p1 = outlier_df['income'].quantile(0.01)
p99 = outlier_df['income'].quantile(0.99)
outlier_df['extreme_outlier'] = ((outlier_df['income'] < p1) | 
                                 (outlier_df['income'] > p99)).astype(int)

print("Outlier Detection Results:")
print(f"IQR method detected {outlier_df['income_outlier'].sum()} outliers")
print(f"Percentile method detected {outlier_df['extreme_outlier'].sum()} outliers")

Project Structure

pystatar/
โ”œโ”€โ”€ __init__.py              # Main package initialization
โ”œโ”€โ”€ tabulate/               # Cross-tabulation module
โ”‚   โ”œโ”€โ”€ __init__.py
โ”‚   โ”œโ”€โ”€ core.py
โ”‚   โ”œโ”€โ”€ results.py
โ”‚   โ””โ”€โ”€ stats.py
โ”œโ”€โ”€ egen/                   # Extended generation module
โ”‚   โ”œโ”€โ”€ __init__.py
โ”‚   โ””โ”€โ”€ core.py
โ”œโ”€โ”€ reghdfe/               # High-dimensional FE regression
โ”‚   โ”œโ”€โ”€ __init__.py
โ”‚   โ”œโ”€โ”€ core.py
โ”‚   โ”œโ”€โ”€ estimation.py
โ”‚   โ””โ”€โ”€ utils.py
โ”œโ”€โ”€ winsor2/               # Winsorizing module
โ”‚   โ”œโ”€โ”€ __init__.py
โ”‚   โ”œโ”€โ”€ core.py
โ”‚   โ””โ”€โ”€ utils.py
โ”œโ”€โ”€ utils/                 # Shared utilities
โ”‚   โ”œโ”€โ”€ __init__.py
โ”‚   โ””โ”€โ”€ common.py
โ””โ”€โ”€ tests/                 # Test suite
    โ”œโ”€โ”€ test_tabulate.py
    โ”œโ”€โ”€ test_egen.py
    โ”œโ”€โ”€ test_reghdfe.py
    โ””โ”€โ”€ test_winsor2.py

Key Features

  • Familiar Syntax: Stata-like command structure and parameters
  • Pandas Integration: Seamless integration with pandas DataFrames
  • High Performance: Optimized implementations using pandas and NumPy
  • Comprehensive Coverage: Most commonly used Stata commands
  • Statistical Rigor: Proper statistical tests and robust standard errors
  • Flexible Output: Multiple output formats and customization options
  • Missing Value Handling: Configurable treatment of missing data

Documentation

Each module comes with comprehensive documentation and examples:

Contributing to the Project

We're building the future of academic research tools in Python! Here's how you can help:

Priority Commands Needed

Help us implement the remaining 16 high-priority commands:

Data Management: summarize, describe, merge, reshape, collapse, keep, drop, generate, replace, sort

Statistical Analysis: reg, logit, probit, ivregress, xtreg, anova

How to Contribute

  1. Request a Command: Open an issue with the command you need
  2. Implement a Command: Check our contribution guidelines and submit a PR
  3. Report Bugs: Help us improve existing functionality
  4. Improve Documentation: Add examples, tutorials, or clarifications
  5. Spread the Word: Star the repo and share with fellow researchers

Recognition

All contributors will be recognized in our documentation and release notes. Major contributors will be listed as co-authors on any academic publications about this project.

Academic Collaboration

We welcome partnerships with universities and research institutions. If you're interested in using this project in your coursework or research, please reach out!

Community & Support

Comparison with Stata

Feature Stata PyStataR Advantage
Speed Base performance 2-10x faster* Vectorized operations
Memory Limited by system Efficient pandas backend Better large dataset handling
Extensibility Ado files Python ecosystem Unlimited customization
Cost $$$$ Free & Open Source Accessible to all researchers
Integration Standalone Python data science stack Seamless workflow
Output Limited formats Multiple (LaTeX, HTML, etc.) Publication ready

*Performance comparison based on typical academic datasets (1M+ observations)

๐Ÿ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

๐Ÿ™ Acknowledgments

This package builds upon the excellent work of:

  • pandas - The backbone of our data manipulation
  • numpy - Powering our numerical computations
  • scipy - Statistical functions and algorithms
  • statsmodels - Statistical modeling foundations
  • pyhdfe - High-dimensional fixed effects algorithms
  • The entire Stata community - For decades of statistical innovation that inspired this project

Future Roadmap

Version 1.0 Goals (Target: End of 2025)

  • Core 4 commands implemented
  • Additional 16 high-priority commands
  • Comprehensive test suite (>95% coverage)
  • Complete documentation with tutorials
  • Performance benchmarks vs Stata

Version 2.0 Vision (2026)

  • Machine learning integration
  • R integration for cross-platform compatibility
  • Web interface for non-programmers
  • Jupyter notebook extensions

๐Ÿ“ˆ Project Statistics

GitHub stars GitHub forks GitHub issues GitHub pull requests

Contact & Collaboration

Created by Bryce Wang - Stanford University

Academic Partnerships Welcome!

  • Course integration and teaching materials
  • Research collaborations and citations
  • Institutional licensing and support
  • Student contributor programs

โญ Love this project? Give it a star and help us reach more researchers! โญ

Together, we're building the future of academic research in Python

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pystatar-0.1.3.tar.gz (52.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pystatar-0.1.3-py3-none-any.whl (43.5 kB view details)

Uploaded Python 3

File details

Details for the file pystatar-0.1.3.tar.gz.

File metadata

  • Download URL: pystatar-0.1.3.tar.gz
  • Upload date:
  • Size: 52.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.5

File hashes

Hashes for pystatar-0.1.3.tar.gz
Algorithm Hash digest
SHA256 6a9ff19213c16c84cf2748aa90011eef6d672393da42011e8c84f61d261b0e8b
MD5 562f97fb74ff19d5e38165ba8c378e63
BLAKE2b-256 c582334cb975a0ddc30f2ac73196254c686d0c143839c254ce1c6b6e89080cf8

See more details on using hashes here.

File details

Details for the file pystatar-0.1.3-py3-none-any.whl.

File metadata

  • Download URL: pystatar-0.1.3-py3-none-any.whl
  • Upload date:
  • Size: 43.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.5

File hashes

Hashes for pystatar-0.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 848880a5017d9806f52b27587e87b20a469f71451a5ecff6e204f8917c336b2e
MD5 e3d9756aae7d71f4ed34a32354189d9b
BLAKE2b-256 678db861fa885bd9a428121e2e4a76f4aaa128e015a9700afc33ccac4c2c6c2c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page