Comprehensive Python package providing Stata-equivalent commands for pandas DataFrames

These details have not been verified by PyPI

Project links

Project description

PyStataR

The Ultimate Python Toolkit for Academic Research - Bringing Stata & R's Power to Python 🚀

Project Vision & Goals

PyStataR aims to recreate and significantly enhance the top 20 most frequently used Stata commands in Python, transforming them into the most powerful and user-friendly statistical tools for academic research. Our goal is to not just replicate Stata's functionality, but to expand and improve upon it, leveraging Python's ecosystem to create superior research tools.

Why This Project Matters

Bridge the Gap: Seamless transition from Stata to Python for researchers
Enhanced Functionality: Each command will be significantly expanded beyond Stata's original capabilities
Modern Research Tools: Built for today's data science and research needs
Community-Driven: Open source development with academic researchers in mind

Target Commands (20 Most Used in Academic Research)

✅ tabulate - Cross-tabulation and frequency analysis
✅ egen - Extended data generation and manipulation
✅ reghdfe - High-dimensional fixed effects regression
✅ winsor2 - Data winsorizing and trimming
🔄 Coming Soon: summarize, describe, merge, reshape, collapse, keep/drop, generate, replace, sort, by, if/in, reg, logit, probit, ivregress, xtreg

Want to see a specific command implemented?

Create an issue to request a command
Contribute to help us complete this project faster
⭐ Star this repo to show your support!

Core Modules Overview

tabulate - Advanced Cross-tabulation and Frequency Analysis

Beyond Stata: Enhanced statistical tests, multi-dimensional tables, and publication-ready output
Key Features: Chi-square tests, Fisher's exact test, Cramér's V, Kendall's tau, gamma coefficients
Use Cases: Survey analysis, categorical data exploration, market research

egen - Extended Data Generation and Manipulation

Beyond Stata: Advanced ranking algorithms, robust statistical functions, and vectorized operations
Key Features: Group operations, ranking with tie-breaking, row statistics, percentile calculations
Use Cases: Data preprocessing, feature engineering, panel data construction

reghdfe - High-Dimensional Fixed Effects Regression

Beyond Stata: Memory-efficient algorithms, advanced clustering options, and diagnostic tools
Key Features: Multiple fixed effects, clustered standard errors, instrumental variables, robust diagnostics
Use Cases: Panel data analysis, causal inference, economic research

winsor2 - Advanced Outlier Detection and Treatment

Beyond Stata: Multiple detection methods, group-specific treatment, and comprehensive diagnostics
Key Features: IQR-based detection, percentile methods, group-wise operations, flexible trimming
Use Cases: Data cleaning, outlier analysis, robust statistical modeling

Advanced Features & Performance

Performance Optimizations

Vectorized Operations: All functions leverage NumPy and pandas for maximum speed
Memory Efficiency: Optimized for large datasets common in academic research
Parallel Processing: Multi-core support for computationally intensive operations
Lazy Evaluation: Smart caching and delayed computation when beneficial

Research-Grade Features

Publication Ready: LaTeX and HTML output for academic papers
Reproducible Research: Comprehensive logging and version tracking
Missing Data Handling: Multiple imputation and robust missing value treatment
Bootstrapping: Built-in bootstrap methods for confidence intervals
Cross-Validation: Integrated CV methods for model validation

Quick Installation

pip install pystatar

Comprehensive Usage Examples

`tabulate` - Advanced Cross-tabulation

The tabulate module provides comprehensive frequency analysis and cross-tabulation capabilities, extending far beyond Stata's original functionality.

Basic One-way Tabulation

import pandas as pd
```python
from pystatar import tabulate

# Create sample dataset
df = pd.DataFrame({
    'gender': ['Male', 'Female', 'Male', 'Female', 'Male', 'Female'] * 100,
    'education': ['High School', 'College', 'Graduate', 'High School', 'College', 'Graduate'] * 100,
    'income': np.random.normal(50000, 15000, 600),
    'age': np.random.randint(22, 65, 600),
    'industry': np.random.choice(['Tech', 'Finance', 'Healthcare', 'Education'], 600)
})

# Simple frequency table
result = tabulate.tabulate(df, 'education')
print(result)

Advanced Two-way Cross-tabulation with Statistics

# Two-way tabulation with comprehensive statistics
result = tabulate.tabulate(
    df, 
    'gender', 'education',
    chi2=True,           # Chi-square test
    exact=True,          # Fisher's exact test
    gamma=True,          # Gamma coefficient
    taub=True,           # Kendall's tau-b
    V=True,              # Cramér's V
    missing=True,        # Include missing values
    row=True,            # Row percentages
    col=True,            # Column percentages
    cell=True            # Cell percentages
)

# Access different components
print("Frequency Table:")
print(result.table)
print(f"\nChi-square p-value: {result.chi2_pvalue:.4f}")
print(f"Cramér's V: {result.cramers_v:.4f}")

Multi-way Tabulation

# Three-way tabulation with layering
result = tabulate.tabulate(
    df, 
    'gender', 'education',
    by='industry',       # Layer by industry
    chi2=True
)

# Access results by layer
for industry, table_result in result.by_results.items():
    print(f"\n=== {industry} ===")
    print(table_result.table)

`egen` - Extended Data Generation

The egen module provides powerful data manipulation functions that extend Stata's egen capabilities.

Ranking and Percentile Functions

from pystatar import egen

# Advanced ranking with tie-breaking options
df['income_rank'] = egen.rank(df['income'], method='average')  # Handle ties
df['income_pctile'] = egen.xtile(df['income'], nquantiles=10)  # Deciles

# Group-specific rankings
df['rank_within_industry'] = egen.rank(df['income'], by='industry')

# Percentile calculations
df['income_90th'] = egen.pctile(df['income'], 90)
df['income_iqr'] = egen.pctile(df['income'], 75) - egen.pctile(df['income'], 25)

Row Operations

# Create test scores dataset
scores_df = pd.DataFrame({
    'student': range(1, 101),
    'math': np.random.normal(75, 10, 100),
    'english': np.random.normal(80, 12, 100),
    'science': np.random.normal(78, 11, 100),
    'history': np.random.normal(82, 9, 100)
})

# Row statistics
scores_df['total_score'] = egen.rowtotal(scores_df, ['math', 'english', 'science', 'history'])
scores_df['avg_score'] = egen.rowmean(scores_df, ['math', 'english', 'science', 'history'])
scores_df['min_score'] = egen.rowmin(scores_df, ['math', 'english', 'science', 'history'])
scores_df['max_score'] = egen.rowmax(scores_df, ['math', 'english', 'science', 'history'])
scores_df['score_sd'] = egen.rowsd(scores_df, ['math', 'english', 'science', 'history'])

# Count non-missing values per row
scores_df['subjects_taken'] = egen.rownonmiss(scores_df, ['math', 'english', 'science', 'history'])

Group Statistics and Operations

# Group summary statistics
df['mean_income_by_education'] = egen.mean(df['income'], by='education')
df['median_income_by_industry'] = egen.median(df['income'], by='industry')
df['sd_income_by_gender'] = egen.sd(df['income'], by='gender')

# Group identification and counting
df['education_group_size'] = egen.count(df, by='education')
df['first_in_group'] = egen.tag(df, ['education', 'gender'])  # First observation in group
df['group_sequence'] = egen.seq(df, by='education')          # Sequence within group

# Advanced group operations
df['income_rank_in_education'] = egen.rank(df['income'], by='education')
df['above_group_median'] = (df['income'] > egen.median(df['income'], by='education')).astype(int)

`reghdfe` - Advanced Fixed Effects Regression

The reghdfe module provides state-of-the-art estimation for linear models with high-dimensional fixed effects.

Basic Fixed Effects Regression

from pystatar import reghdfe

# Create panel dataset
np.random.seed(42)
n_firms, n_years = 100, 10
n_obs = n_firms * n_years

panel_df = pd.DataFrame({
    'firm_id': np.repeat(range(n_firms), n_years),
    'year': np.tile(range(2010, 2020), n_firms),
    'log_sales': np.random.normal(10, 1, n_obs),
    'log_employment': np.random.normal(4, 0.5, n_obs),
    'log_capital': np.random.normal(8, 0.8, n_obs),
    'industry': np.repeat(np.random.choice(['Tech', 'Manufacturing', 'Services'], n_firms), n_years)
})

# Basic regression with firm and year fixed effects
result = reghdfe.reghdfe(
    data=panel_df,
    depvar='log_sales',
    regressors=['log_employment', 'log_capital'],
    absorb=['firm_id', 'year']
)

print(result.summary())
print(f"R-squared: {result.r2:.4f}")
print(f"Number of observations: {result.N}")

Advanced Regression with Clustering and Instruments

# Add instrumental variables
panel_df['instrument1'] = np.random.normal(0, 1, n_obs)
panel_df['instrument2'] = np.random.normal(0, 1, n_obs)

# Regression with clustering and multiple fixed effects
result = reghdfe.reghdfe(
    data=panel_df,
    depvar='log_sales',
    regressors=['log_employment', 'log_capital'],
    absorb=['firm_id', 'year', 'industry'],  # Multiple fixed effects
    cluster='firm_id',                        # Clustered standard errors
    weights='employment',                     # Weighted regression
    subset=(panel_df['year'] >= 2012)        # Conditional estimation
)

# Access detailed results
print("Coefficient Table:")
print(result.coef_table)
print(f"\nFixed Effects absorbed: {result.absorbed_fe}")
print(f"Clusters: {result.n_clusters}")

Instrumental Variables with High-Dimensional FE

# IV regression with fixed effects
iv_result = reghdfe.ivreghdfe(
    data=panel_df,
    depvar='log_sales',
    endogenous=['log_employment'],           # Endogenous variable
    instruments=['instrument1', 'instrument2'],  # Instruments
    exogenous=['log_capital'],               # Exogenous controls
    absorb=['firm_id', 'year'],
    cluster='firm_id'
)

print("First Stage Results:")
print(iv_result.first_stage)
print(f"\nWeak instruments test (F-stat): {iv_result.first_stage_fstat:.2f}")
print(f"Overidentification test (Hansen J): {iv_result.hansen_j:.4f}")

`winsor2` - Advanced Outlier Treatment

The winsor2 module provides comprehensive outlier detection and treatment methods.

Basic Winsorizing

from pystatar import winsor2

# Create dataset with outliers
outlier_df = pd.DataFrame({
    'income': np.concatenate([
        np.random.normal(50000, 10000, 950),  # Normal observations
        np.random.uniform(200000, 500000, 50)  # Outliers
    ]),
    'age': np.random.randint(18, 70, 1000),
    'industry': np.random.choice(['Tech', 'Finance', 'Retail', 'Healthcare'], 1000)
})

# Basic winsorizing at 1st and 99th percentiles
result = winsor2.winsor2(outlier_df, ['income'])
print("Original vs Winsorized:")
print(f"Original: min={outlier_df['income'].min():.0f}, max={outlier_df['income'].max():.0f}")
print(f"Winsorized: min={result['income_w'].min():.0f}, max={result['income_w'].max():.0f}")

Group-wise Winsorizing

# Winsorize within groups
result = winsor2.winsor2(
    outlier_df, 
    ['income'],
    by='industry',          # Winsorize within each industry
    cuts=(5, 95),          # Use 5th and 95th percentiles
    suffix='_clean'        # Custom suffix
)

# Compare distributions by group
for industry in outlier_df['industry'].unique():
    mask = outlier_df['industry'] == industry
    original = outlier_df.loc[mask, 'income']
    winsorized = result.loc[mask, 'income_clean']
    print(f"\n{industry}:")
    print(f"  Original: {original.describe()}")
    print(f"  Winsorized: {winsorized.describe()}")

Trimming vs Winsorizing Comparison

# Compare different outlier treatment methods
trim_result = winsor2.winsor2(
    outlier_df, 
    ['income'],
    trim=True,              # Trim (remove) instead of winsorize
    cuts=(2.5, 97.5)       # Trim 2.5% from each tail
)

winsor_result = winsor2.winsor2(
    outlier_df, 
    ['income'],
    trim=False,             # Winsorize (cap) outliers
    cuts=(2.5, 97.5)
)

print("Treatment Comparison:")
print(f"Original N: {len(outlier_df)}")
print(f"After trimming N: {trim_result['income_tr'].notna().sum()}")
print(f"After winsorizing N: {len(winsor_result)}")
print(f"Trimmed mean: {trim_result['income_tr'].mean():.0f}")
print(f"Winsorized mean: {winsor_result['income_w'].mean():.0f}")

Advanced Outlier Detection

# Multiple variable winsorizing with custom thresholds
multi_result = winsor2.winsor2(
    outlier_df,
    ['income', 'age'],
    cuts=(1, 99),           # Different cuts for different variables
    by='industry',          # Group-specific treatment
    replace=True,           # Replace original variables
    label=True              # Add descriptive labels
)

# Generate outlier indicators
outlier_df['income_outlier'] = winsor2.outlier_indicator(
    outlier_df['income'], 
    method='iqr',           # Use IQR method
    factor=1.5              # 1.5 * IQR threshold
)

outlier_df['extreme_outlier'] = winsor2.outlier_indicator(
    outlier_df['income'],
    method='percentile',    # Use percentile method
    cuts=(1, 99)
)

print("Outlier Detection Results:")
print(f"IQR method detected {outlier_df['income_outlier'].sum()} outliers")
print(f"Percentile method detected {outlier_df['extreme_outlier'].sum()} outliers")

Project Structure

pystatar/
├── __init__.py              # Main package initialization
├── tabulate/               # Cross-tabulation module
│   ├── __init__.py
│   ├── core.py
│   ├── results.py
│   └── stats.py
├── egen/                   # Extended generation module
│   ├── __init__.py
│   └── core.py
├── reghdfe/               # High-dimensional FE regression
│   ├── __init__.py
│   ├── core.py
│   ├── estimation.py
│   └── utils.py
├── winsor2/               # Winsorizing module
│   ├── __init__.py
│   ├── core.py
│   └── utils.py
├── utils/                 # Shared utilities
│   ├── __init__.py
│   └── common.py
└── tests/                 # Test suite
    ├── test_tabulate.py
    ├── test_egen.py
    ├── test_reghdfe.py
    └── test_winsor2.py

Key Features

Familiar Syntax: Stata-like command structure and parameters
Pandas Integration: Seamless integration with pandas DataFrames
High Performance: Optimized implementations using pandas and NumPy
Comprehensive Coverage: Most commonly used Stata commands
Statistical Rigor: Proper statistical tests and robust standard errors
Flexible Output: Multiple output formats and customization options
Missing Value Handling: Configurable treatment of missing data

Documentation

Each module comes with comprehensive documentation and examples:

tabulate Documentation - Cross-tabulation and frequency analysis
egen Documentation - Extended data generation functions
reghdfe Documentation - High-dimensional fixed effects regression
winsor2 Documentation - Data winsorizing and trimming

Contributing to the Project

We're building the future of academic research tools in Python! Here's how you can help:

Priority Commands Needed

Help us implement the remaining 16 high-priority commands:

Data Management: summarize, describe, merge, reshape, collapse, keep, drop, generate, replace, sort

Statistical Analysis: reg, logit, probit, ivregress, xtreg, anova

How to Contribute

Request a Command: Open an issue with the command you need
** Implement a Command**: Check our contribution guidelines and submit a PR
** Report Bugs**: Help us improve existing functionality
** Improve Documentation**: Add examples, tutorials, or clarifications
** Spread the Word**: Star the repo and share with fellow researchers

Recognition

All contributors will be recognized in our documentation and release notes. Major contributors will be listed as co-authors on any academic publications about this project.

Academic Collaboration

We welcome partnerships with universities and research institutions. If you're interested in using this project in your coursework or research, please reach out!

Community & Support

Documentation: https://pystatar.readthedocs.io
Discussions: GitHub Discussions
Issues: Bug Reports & Feature Requests
** Email**: brycew6m@stanford.edu for academic collaborations

Comparison with Stata

Feature	Stata	PyStataR	Advantage
Speed	Base performance	2-10x faster*	Vectorized operations
Memory	Limited by system	Efficient pandas backend	Better large dataset handling
Extensibility	Ado files	Python ecosystem	Unlimited customization
Cost	$$$$	Free & Open Source	Accessible to all researchers
Integration	Standalone	Python data science stack	Seamless workflow
Output	Limited formats	Multiple (LaTeX, HTML, etc.)	Publication ready

*Performance comparison based on typical academic datasets (1M+ observations)

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

This package builds upon the excellent work of:

pandas - The backbone of our data manipulation
numpy - Powering our numerical computations
scipy - Statistical functions and algorithms
statsmodels - Statistical modeling foundations
pyhdfe - High-dimensional fixed effects algorithms
The entire Stata community - For decades of statistical innovation that inspired this project

Future Roadmap

Version 1.0 Goals (Target: End of 2025)

Core 4 commands implemented
Additional 16 high-priority commands
Comprehensive test suite (>95% coverage)
Complete documentation with tutorials
Performance benchmarks vs Stata

Version 2.0 Vision (2026)

Machine learning integration
R integration for cross-platform compatibility
Web interface for non-programmers
Jupyter notebook extensions

📈 Project Statistics

Contact & Collaboration

Created by Bryce Wang - Stanford University

Email: brycew6m@stanford.edu
GitHub: @brycewang-stanford
Academic: Stanford Graduate School of Business
LinkedIn: Connect with me

Academic Partnerships Welcome!

Course integration and teaching materials
Research collaborations and citations
Institutional licensing and support
Student contributor programs

⭐ Love this project? Give it a star and help us reach more researchers! ⭐

Together, we're building the future of academic research in Python

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.4.2

Apr 14, 2026

0.4.1

Apr 14, 2026

0.4.0

Aug 2, 2025

0.3.1

Aug 1, 2025

0.3.0

Aug 1, 2025

0.2.0

Aug 1, 2025

0.1.4

Jul 27, 2025

0.1.3

Jul 27, 2025

0.1.2

Jul 27, 2025

0.1.1

Jul 27, 2025

This version

0.1.0

Jul 27, 2025

0.0.0

Jul 27, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pystatar-0.1.0.tar.gz (51.0 kB view details)

Uploaded Jul 27, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pystatar-0.1.0-py3-none-any.whl (43.0 kB view details)

Uploaded Jul 27, 2025 Python 3

File details

Details for the file pystatar-0.1.0.tar.gz.

File metadata

Download URL: pystatar-0.1.0.tar.gz
Upload date: Jul 27, 2025
Size: 51.0 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.5

File hashes

Hashes for pystatar-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`a15dc3967bf391472017168cb7b7fd2ef5f5db44df6741aa7c47be29b3a743ca`
MD5	`77d26d78422a75edba114182994c11de`
BLAKE2b-256	`d28b68982a2a4a47e67a9e5ef8df6854342d2dbc82ea4c1bbb5079e3758b3dd2`

See more details on using hashes here.

File details

Details for the file pystatar-0.1.0-py3-none-any.whl.

File metadata

Download URL: pystatar-0.1.0-py3-none-any.whl
Upload date: Jul 27, 2025
Size: 43.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.5

File hashes

Hashes for pystatar-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`5850a4dc3b4ce7e6a7092cdfbd3e6f68ffdcac77c3b991ce3e8c0be1bae69013`
MD5	`663741f45ceda74eefafc00a1e795567`
BLAKE2b-256	`b5377c08afd97338867080bf32ef3d847ceafda4b40d484ec6470bb7e47507f2`

See more details on using hashes here.

pystatar 0.1.0

Navigation

Verified details

Maintainers

Meta

Unverified details

Project links

Meta

Classifiers

Project description

PyStataR

Project Vision & Goals

Why This Project Matters

Target Commands (20 Most Used in Academic Research)

Core Modules Overview

tabulate - Advanced Cross-tabulation and Frequency Analysis

egen - Extended Data Generation and Manipulation

reghdfe - High-Dimensional Fixed Effects Regression

winsor2 - Advanced Outlier Detection and Treatment

Advanced Features & Performance

Performance Optimizations

Research-Grade Features

Quick Installation

Comprehensive Usage Examples

tabulate - Advanced Cross-tabulation

Basic One-way Tabulation

Advanced Two-way Cross-tabulation with Statistics

Multi-way Tabulation

egen - Extended Data Generation

Ranking and Percentile Functions

Row Operations

Group Statistics and Operations

reghdfe - Advanced Fixed Effects Regression

Basic Fixed Effects Regression

Advanced Regression with Clustering and Instruments

Instrumental Variables with High-Dimensional FE

winsor2 - Advanced Outlier Treatment

Basic Winsorizing

Group-wise Winsorizing

Trimming vs Winsorizing Comparison

Advanced Outlier Detection

Project Structure

Key Features

Documentation

Contributing to the Project

Priority Commands Needed

How to Contribute

Recognition

Academic Collaboration

Community & Support

Comparison with Stata

📄 License

🙏 Acknowledgments

Future Roadmap

Version 1.0 Goals (Target: End of 2025)

Version 2.0 Vision (2026)

📈 Project Statistics

Contact & Collaboration

Academic Partnerships Welcome!

⭐ Love this project? Give it a star and help us reach more researchers! ⭐

Project details

Verified details

Maintainers

Meta

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`tabulate` - Advanced Cross-tabulation

`egen` - Extended Data Generation

`reghdfe` - Advanced Fixed Effects Regression

`winsor2` - Advanced Outlier Treatment