Comprehensive Python package providing Stata-equivalent commands for pandas DataFrames
Project description
PyStataR
The Ultimate Python Toolkit for Academic Research - Bringing Stata & R's Power to Python ๐
Project Vision & Goals
PyStataR aims to recreate and significantly enhance the top 20 most frequently used Stata commands in Python, transforming them into the most powerful and user-friendly statistical tools for academic research. Our goal is to not just replicate Stata's functionality, but to expand and improve upon it, leveraging Python's ecosystem to create superior research tools.
Why This Project Matters
- Bridge the Gap: Seamless transition from Stata to Python for researchers
- Enhanced Functionality: Each command will be significantly expanded beyond Stata's original capabilities
- Modern Research Tools: Built for today's data science and research needs
- Community-Driven: Open source development with academic researchers in mind
Target Commands (20 Most Used in Academic Research)
โ
tabulate - Cross-tabulation and frequency analysis
โ
egen - Extended data generation and manipulation
โ
reghdfe - High-dimensional fixed effects regression
โ
winsor2 - Data winsorizing and trimming
๐ Coming Soon: summarize, describe, merge, reshape, collapse, keep/drop, generate, replace, sort, by, if/in, reg, logit, probit, ivregress, xtreg
Want to see a specific command implemented?
- Create an issue to request a command
- Contribute to help us complete this project faster
- โญ Star this repo to show your support!
Core Modules Overview
tabulate - Advanced Cross-tabulation and Frequency Analysis
- Beyond Stata: Enhanced statistical tests, multi-dimensional tables, and publication-ready output
- Key Features: Chi-square tests, Fisher's exact test, Cramรฉr's V, Kendall's tau, gamma coefficients
- Use Cases: Survey analysis, categorical data exploration, market research
egen - Extended Data Generation and Manipulation
- Beyond Stata: Advanced ranking algorithms, robust statistical functions, and vectorized operations
- Key Features: Group operations, ranking with tie-breaking, row statistics, percentile calculations
- Use Cases: Data preprocessing, feature engineering, panel data construction
reghdfe - High-Dimensional Fixed Effects Regression
- Beyond Stata: Memory-efficient algorithms, advanced clustering options, and diagnostic tools
- Key Features: Multiple fixed effects, clustered standard errors, instrumental variables, robust diagnostics
- Use Cases: Panel data analysis, causal inference, economic research
winsor2 - Advanced Outlier Detection and Treatment
- Beyond Stata: Multiple detection methods, group-specific treatment, and comprehensive diagnostics
- Key Features: IQR-based detection, percentile methods, group-wise operations, flexible trimming
- Use Cases: Data cleaning, outlier analysis, robust statistical modeling
Advanced Features & Performance
Performance Optimizations
- Vectorized Operations: All functions leverage NumPy and pandas for maximum speed
- Memory Efficiency: Optimized for large datasets common in academic research
- Parallel Processing: Multi-core support for computationally intensive operations
- Lazy Evaluation: Smart caching and delayed computation when beneficial
Research-Grade Features
- Publication Ready: LaTeX and HTML output for academic papers
- Reproducible Research: Comprehensive logging and version tracking
- Missing Data Handling: Multiple imputation and robust missing value treatment
- Bootstrapping: Built-in bootstrap methods for confidence intervals
- Cross-Validation: Integrated CV methods for model validation
Quick Installation
pip install pystatar
Comprehensive Usage Examples
tabulate - Advanced Cross-tabulation
The tabulate module provides comprehensive frequency analysis and cross-tabulation capabilities, extending far beyond Stata's original functionality.
Basic One-way Tabulation
import pandas as pd
```python
from pystatar import tabulate
# Create sample dataset
df = pd.DataFrame({
'gender': ['Male', 'Female', 'Male', 'Female', 'Male', 'Female'] * 100,
'education': ['High School', 'College', 'Graduate', 'High School', 'College', 'Graduate'] * 100,
'income': np.random.normal(50000, 15000, 600),
'age': np.random.randint(22, 65, 600),
'industry': np.random.choice(['Tech', 'Finance', 'Healthcare', 'Education'], 600)
})
# Simple frequency table
result = tabulate.tabulate(df, 'education')
print(result)
Advanced Two-way Cross-tabulation with Statistics
# Two-way tabulation with comprehensive statistics
result = tabulate.tabulate(
df,
'gender', 'education',
chi2=True, # Chi-square test
exact=True, # Fisher's exact test
gamma=True, # Gamma coefficient
taub=True, # Kendall's tau-b
V=True, # Cramรฉr's V
missing=True, # Include missing values
row=True, # Row percentages
col=True, # Column percentages
cell=True # Cell percentages
)
# Access different components
print("Frequency Table:")
print(result.table)
print(f"\nChi-square p-value: {result.chi2_pvalue:.4f}")
print(f"Cramรฉr's V: {result.cramers_v:.4f}")
Multi-way Tabulation
# Three-way tabulation with layering
result = tabulate.tabulate(
df,
'gender', 'education',
by='industry', # Layer by industry
chi2=True
)
# Access results by layer
for industry, table_result in result.by_results.items():
print(f"\n=== {industry} ===")
print(table_result.table)
egen - Extended Data Generation
The egen module provides powerful data manipulation functions that extend Stata's egen capabilities.
Ranking and Percentile Functions
from pystatar import egen
# Advanced ranking with tie-breaking options
df['income_rank'] = egen.rank(df['income'], method='average') # Handle ties
df['income_pctile'] = egen.xtile(df['income'], nquantiles=10) # Deciles
# Group-specific rankings
df['rank_within_industry'] = egen.rank(df['income'], by='industry')
# Percentile calculations
df['income_90th'] = egen.pctile(df['income'], 90)
df['income_iqr'] = egen.pctile(df['income'], 75) - egen.pctile(df['income'], 25)
Row Operations
# Create test scores dataset
scores_df = pd.DataFrame({
'student': range(1, 101),
'math': np.random.normal(75, 10, 100),
'english': np.random.normal(80, 12, 100),
'science': np.random.normal(78, 11, 100),
'history': np.random.normal(82, 9, 100)
})
# Row statistics
scores_df['total_score'] = egen.rowtotal(scores_df, ['math', 'english', 'science', 'history'])
scores_df['avg_score'] = egen.rowmean(scores_df, ['math', 'english', 'science', 'history'])
scores_df['min_score'] = egen.rowmin(scores_df, ['math', 'english', 'science', 'history'])
scores_df['max_score'] = egen.rowmax(scores_df, ['math', 'english', 'science', 'history'])
scores_df['score_sd'] = egen.rowsd(scores_df, ['math', 'english', 'science', 'history'])
# Count non-missing values per row
scores_df['subjects_taken'] = egen.rownonmiss(scores_df, ['math', 'english', 'science', 'history'])
Group Statistics and Operations
# Group summary statistics
df['mean_income_by_education'] = egen.mean(df['income'], by='education')
df['median_income_by_industry'] = egen.median(df['income'], by='industry')
df['sd_income_by_gender'] = egen.sd(df['income'], by='gender')
# Group identification and counting
df['education_group_size'] = egen.count(df, by='education')
df['first_in_group'] = egen.tag(df, ['education', 'gender']) # First observation in group
df['group_sequence'] = egen.seq(df, by='education') # Sequence within group
# Advanced group operations
df['income_rank_in_education'] = egen.rank(df['income'], by='education')
df['above_group_median'] = (df['income'] > egen.median(df['income'], by='education')).astype(int)
reghdfe - Advanced Fixed Effects Regression
The reghdfe module provides state-of-the-art estimation for linear models with high-dimensional fixed effects.
Basic Fixed Effects Regression
from pystatar import reghdfe
# Create panel dataset
np.random.seed(42)
n_firms, n_years = 100, 10
n_obs = n_firms * n_years
panel_df = pd.DataFrame({
'firm_id': np.repeat(range(n_firms), n_years),
'year': np.tile(range(2010, 2020), n_firms),
'log_sales': np.random.normal(10, 1, n_obs),
'log_employment': np.random.normal(4, 0.5, n_obs),
'log_capital': np.random.normal(8, 0.8, n_obs),
'industry': np.repeat(np.random.choice(['Tech', 'Manufacturing', 'Services'], n_firms), n_years)
})
# Basic regression with firm and year fixed effects
result = reghdfe.reghdfe(
data=panel_df,
depvar='log_sales',
regressors=['log_employment', 'log_capital'],
absorb=['firm_id', 'year']
)
print(result.summary())
print(f"R-squared: {result.r2:.4f}")
print(f"Number of observations: {result.N}")
Advanced Regression with Clustering and Instruments
# Add instrumental variables
panel_df['instrument1'] = np.random.normal(0, 1, n_obs)
panel_df['instrument2'] = np.random.normal(0, 1, n_obs)
# Regression with clustering and multiple fixed effects
result = reghdfe.reghdfe(
data=panel_df,
depvar='log_sales',
regressors=['log_employment', 'log_capital'],
absorb=['firm_id', 'year', 'industry'], # Multiple fixed effects
cluster='firm_id', # Clustered standard errors
weights='employment', # Weighted regression
subset=(panel_df['year'] >= 2012) # Conditional estimation
)
# Access detailed results
print("Coefficient Table:")
print(result.coef_table)
print(f"\nFixed Effects absorbed: {result.absorbed_fe}")
print(f"Clusters: {result.n_clusters}")
Instrumental Variables with High-Dimensional FE
# IV regression with fixed effects
iv_result = reghdfe.ivreghdfe(
data=panel_df,
depvar='log_sales',
endogenous=['log_employment'], # Endogenous variable
instruments=['instrument1', 'instrument2'], # Instruments
exogenous=['log_capital'], # Exogenous controls
absorb=['firm_id', 'year'],
cluster='firm_id'
)
print("First Stage Results:")
print(iv_result.first_stage)
print(f"\nWeak instruments test (F-stat): {iv_result.first_stage_fstat:.2f}")
print(f"Overidentification test (Hansen J): {iv_result.hansen_j:.4f}")
winsor2 - Advanced Outlier Treatment
The winsor2 module provides comprehensive outlier detection and treatment methods.
Basic Winsorizing
from pystatar import winsor2
# Create dataset with outliers
outlier_df = pd.DataFrame({
'income': np.concatenate([
np.random.normal(50000, 10000, 950), # Normal observations
np.random.uniform(200000, 500000, 50) # Outliers
]),
'age': np.random.randint(18, 70, 1000),
'industry': np.random.choice(['Tech', 'Finance', 'Retail', 'Healthcare'], 1000)
})
# Basic winsorizing at 1st and 99th percentiles
result = winsor2.winsor2(outlier_df, ['income'])
print("Original vs Winsorized:")
print(f"Original: min={outlier_df['income'].min():.0f}, max={outlier_df['income'].max():.0f}")
print(f"Winsorized: min={result['income_w'].min():.0f}, max={result['income_w'].max():.0f}")
Group-wise Winsorizing
# Winsorize within groups
result = winsor2.winsor2(
outlier_df,
['income'],
by='industry', # Winsorize within each industry
cuts=(5, 95), # Use 5th and 95th percentiles
suffix='_clean' # Custom suffix
)
# Compare distributions by group
for industry in outlier_df['industry'].unique():
mask = outlier_df['industry'] == industry
original = outlier_df.loc[mask, 'income']
winsorized = result.loc[mask, 'income_clean']
print(f"\n{industry}:")
print(f" Original: {original.describe()}")
print(f" Winsorized: {winsorized.describe()}")
Trimming vs Winsorizing Comparison
# Compare different outlier treatment methods
trim_result = winsor2.winsor2(
outlier_df,
['income'],
trim=True, # Trim (remove) instead of winsorize
cuts=(2.5, 97.5) # Trim 2.5% from each tail
)
winsor_result = winsor2.winsor2(
outlier_df,
['income'],
trim=False, # Winsorize (cap) outliers
cuts=(2.5, 97.5)
)
print("Treatment Comparison:")
print(f"Original N: {len(outlier_df)}")
print(f"After trimming N: {trim_result['income_tr'].notna().sum()}")
print(f"After winsorizing N: {len(winsor_result)}")
print(f"Trimmed mean: {trim_result['income_tr'].mean():.0f}")
print(f"Winsorized mean: {winsor_result['income_w'].mean():.0f}")
Advanced Outlier Detection
# Multiple variable winsorizing with custom thresholds
multi_result = winsor2.winsor2(
outlier_df,
['income', 'age'],
cuts=(1, 99), # Different cuts for different variables
by='industry', # Group-specific treatment
replace=True, # Replace original variables
label=True # Add descriptive labels
)
# Generate outlier indicators
outlier_df['income_outlier'] = winsor2.outlier_indicator(
outlier_df['income'],
method='iqr', # Use IQR method
factor=1.5 # 1.5 * IQR threshold
)
outlier_df['extreme_outlier'] = winsor2.outlier_indicator(
outlier_df['income'],
method='percentile', # Use percentile method
cuts=(1, 99)
)
print("Outlier Detection Results:")
print(f"IQR method detected {outlier_df['income_outlier'].sum()} outliers")
print(f"Percentile method detected {outlier_df['extreme_outlier'].sum()} outliers")
Project Structure
pystatar/
โโโ __init__.py # Main package initialization
โโโ tabulate/ # Cross-tabulation module
โ โโโ __init__.py
โ โโโ core.py
โ โโโ results.py
โ โโโ stats.py
โโโ egen/ # Extended generation module
โ โโโ __init__.py
โ โโโ core.py
โโโ reghdfe/ # High-dimensional FE regression
โ โโโ __init__.py
โ โโโ core.py
โ โโโ estimation.py
โ โโโ utils.py
โโโ winsor2/ # Winsorizing module
โ โโโ __init__.py
โ โโโ core.py
โ โโโ utils.py
โโโ utils/ # Shared utilities
โ โโโ __init__.py
โ โโโ common.py
โโโ tests/ # Test suite
โโโ test_tabulate.py
โโโ test_egen.py
โโโ test_reghdfe.py
โโโ test_winsor2.py
Key Features
- Familiar Syntax: Stata-like command structure and parameters
- Pandas Integration: Seamless integration with pandas DataFrames
- High Performance: Optimized implementations using pandas and NumPy
- Comprehensive Coverage: Most commonly used Stata commands
- Statistical Rigor: Proper statistical tests and robust standard errors
- Flexible Output: Multiple output formats and customization options
- Missing Value Handling: Configurable treatment of missing data
Documentation
Each module comes with comprehensive documentation and examples:
- tabulate Documentation - Cross-tabulation and frequency analysis
- egen Documentation - Extended data generation functions
- reghdfe Documentation - High-dimensional fixed effects regression
- winsor2 Documentation - Data winsorizing and trimming
Contributing to the Project
We're building the future of academic research tools in Python! Here's how you can help:
Priority Commands Needed
Help us implement the remaining 16 high-priority commands:
Data Management: summarize, describe, merge, reshape, collapse, keep, drop, generate, replace, sort
Statistical Analysis: reg, logit, probit, ivregress, xtreg, anova
How to Contribute
- Request a Command: Open an issue with the command you need
- ** Implement a Command**: Check our contribution guidelines and submit a PR
- ** Report Bugs**: Help us improve existing functionality
- ** Improve Documentation**: Add examples, tutorials, or clarifications
- ** Spread the Word**: Star the repo and share with fellow researchers
Recognition
All contributors will be recognized in our documentation and release notes. Major contributors will be listed as co-authors on any academic publications about this project.
Academic Collaboration
We welcome partnerships with universities and research institutions. If you're interested in using this project in your coursework or research, please reach out!
Community & Support
- Documentation: https://pystatar.readthedocs.io
- Discussions: GitHub Discussions
- Issues: Bug Reports & Feature Requests
- ** Email**: brycew6m@stanford.edu for academic collaborations
Comparison with Stata
| Feature | Stata | PyStataR | Advantage |
|---|---|---|---|
| Speed | Base performance | 2-10x faster* | Vectorized operations |
| Memory | Limited by system | Efficient pandas backend | Better large dataset handling |
| Extensibility | Ado files | Python ecosystem | Unlimited customization |
| Cost | $$$$ | Free & Open Source | Accessible to all researchers |
| Integration | Standalone | Python data science stack | Seamless workflow |
| Output | Limited formats | Multiple (LaTeX, HTML, etc.) | Publication ready |
*Performance comparison based on typical academic datasets (1M+ observations)
๐ License
This project is licensed under the MIT License - see the LICENSE file for details.
๐ Acknowledgments
This package builds upon the excellent work of:
- pandas - The backbone of our data manipulation
- numpy - Powering our numerical computations
- scipy - Statistical functions and algorithms
- statsmodels - Statistical modeling foundations
- pyhdfe - High-dimensional fixed effects algorithms
- The entire Stata community - For decades of statistical innovation that inspired this project
Future Roadmap
Version 1.0 Goals (Target: End of 2025)
- Core 4 commands implemented
- Additional 16 high-priority commands
- Comprehensive test suite (>95% coverage)
- Complete documentation with tutorials
- Performance benchmarks vs Stata
Version 2.0 Vision (2026)
- Machine learning integration
- R integration for cross-platform compatibility
- Web interface for non-programmers
- Jupyter notebook extensions
๐ Project Statistics
Contact & Collaboration
Created by Bryce Wang - Stanford University
- Email: brycew6m@stanford.edu
- GitHub: @brycewang-stanford
- Academic: Stanford Graduate School of Business
- LinkedIn: Connect with me
Academic Partnerships Welcome!
- Course integration and teaching materials
- Research collaborations and citations
- Institutional licensing and support
- Student contributor programs
โญ Love this project? Give it a star and help us reach more researchers! โญ
Together, we're building the future of academic research in Python
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pystatar-0.1.0.tar.gz.
File metadata
- Download URL: pystatar-0.1.0.tar.gz
- Upload date:
- Size: 51.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a15dc3967bf391472017168cb7b7fd2ef5f5db44df6741aa7c47be29b3a743ca
|
|
| MD5 |
77d26d78422a75edba114182994c11de
|
|
| BLAKE2b-256 |
d28b68982a2a4a47e67a9e5ef8df6854342d2dbc82ea4c1bbb5079e3758b3dd2
|
File details
Details for the file pystatar-0.1.0-py3-none-any.whl.
File metadata
- Download URL: pystatar-0.1.0-py3-none-any.whl
- Upload date:
- Size: 43.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5850a4dc3b4ce7e6a7092cdfbd3e6f68ffdcac77c3b991ce3e8c0be1bae69013
|
|
| MD5 |
663741f45ceda74eefafc00a1e795567
|
|
| BLAKE2b-256 |
b5377c08afd97338867080bf32ef3d847ceafda4b40d484ec6470bb7e47507f2
|