Skip to main content

Python implementation of Stata's egen command for pandas DataFrames

Project description

PyEgen

PyPI version Python 3.7+ License: MIT Downloads

Python implementation of Stata's egen command for pandas DataFrames. This package provides Stata-style data manipulation functions, making it easier for researchers to transition from Stata to Python while maintaining familiar syntax and functionality.

Quick Start

pip install pyegen
import pandas as pd
import numpy as np
import pyegen as egen

# Create sample data
df = pd.DataFrame({
    'group': ['A', 'A', 'B', 'B', 'C', 'C'],
    'var1': [1, np.nan, 3, 4, 5, 6],
    'var2': [np.nan, 2, 5, 6, 7, 8],
    'var3': [10, 11, 12, 13, 14, 15]
})

# Row-wise operations
df['first_nonmiss'] = egen.rowfirst(df, ['var1', 'var2', 'var3'])
df['row_median'] = egen.rowmedian(df, ['var1', 'var2', 'var3'])
df['missing_count'] = egen.rowmiss(df, ['var1', 'var2', 'var3'])

# Group-wise operations  
df['group_mean'] = egen.mean(df['var1'], by=df['group'])
df['group_median'] = egen.median(df['var1'], by=df['group'])
df['group_rank'] = egen.rank(df['var1'], method='min')

# Utility functions
df['has_value_1_or_2'] = egen.anymatch(df, ['var1', 'var2'], [1, 2])
df['concat_vars'] = egen.concat(df, ['group', 'var1'], punct='_')

Available Functions

PyEgen provides 45+ functions with 100% coverage of Stata's egen capabilities:

Row-wise Functions

  • rowmean(), rowtotal(), rowmax(), rowmin(), rowsd()
  • rowfirst(), rowlast(), rowmedian(), rowmiss(), rownonmiss(), rowpctile()

Statistical Functions

  • rank(), count(), mean(), sum(), max(), min(), sd()
  • median(), mode(), iqr(), kurt(), skew(), mad(), mdev()
  • pc(), pctile(), std(), total()

Utility Functions

  • tag(), group(), seq(), anycount(), anymatch(), anyvalue()
  • concat(), cut(), diff(), ends(), fill()

🎯 Key Features

  • Complete Stata Coverage: All 45 egen functions implemented
  • Pandas Integration: Works seamlessly with pandas DataFrames
  • Missing Value Handling: Consistent with Stata behavior
  • Group Operations: Full support for by-group operations with by parameter
  • Type Safety: Comprehensive input validation and error handling
  • Performance: Optimized for large datasets

📚 Complete Function Reference

Row-wise Functions

Function Stata Equivalent Description
rowmean() egen newvar = rowmean(varlist) Row mean
rowtotal() egen newvar = rowtotal(varlist) Row sum
rowmax() egen newvar = rowmax(varlist) Row maximum
rowmin() egen newvar = rowmin(varlist) Row minimum
rowsd() egen newvar = rowsd(varlist) Row standard deviation
rowfirst() egen newvar = rowfirst(varlist) First non-missing value
rowlast() egen newvar = rowlast(varlist) Last non-missing value
rowmedian() egen newvar = rowmedian(varlist) Row median
rowmiss() egen newvar = rowmiss(varlist) Count of missing values
rownonmiss() egen newvar = rownonmiss(varlist) Count of non-missing values
rowpctile() egen newvar = rowpctile(varlist), p(#) Row percentile

Statistical Functions (with grouping support)

Function Stata Equivalent Description
count() egen newvar = count(var), by(group) Count observations
mean() egen newvar = mean(var), by(group) Mean
sum() egen newvar = sum(var), by(group) Sum
total() egen newvar = total(var), by(group) Total (treats missing as 0)
max() egen newvar = max(var), by(group) Maximum
min() egen newvar = min(var), by(group) Minimum
sd() egen newvar = sd(var), by(group) Standard deviation
median() egen newvar = median(var), by(group) Median
mode() egen newvar = mode(var), by(group) Mode
iqr() egen newvar = iqr(var), by(group) Interquartile range
kurt() egen newvar = kurt(var), by(group) Kurtosis
skew() egen newvar = skew(var), by(group) Skewness
mad() egen newvar = mad(var), by(group) Median absolute deviation
mdev() egen newvar = mdev(var), by(group) Mean absolute deviation
pctile() egen newvar = pctile(var), p(#) Percentile
pc() egen newvar = pc(var), by(group) Percent of total
std() egen newvar = std(var), by(group) Standardized values

Utility Functions

Function Stata Equivalent Description
rank() egen newvar = rank(var) Ranking with tie options
tag() egen newvar = tag(varlist) Tag first obs in group
group() egen newvar = group(varlist) Create group identifiers
seq() egen newvar = seq() Generate sequences
anycount() egen newvar = anycount(varlist), v(values) Count matching values
anymatch() egen newvar = anymatch(varlist), v(values) Check for matches
anyvalue() egen newvar = anyvalue(var), v(values) Return matching values
concat() egen newvar = concat(varlist), punct() Concatenate variables
cut() egen newvar = cut(var), group(#) Create categorical from continuous
diff() egen newvar = diff(varlist) Check if variables differ
ends() egen newvar = ends(strvar), head|last|tail Extract string parts
fill() egen newvar = fill(numlist) Create repeating patterns

💡 Migration Recommendation

For new projects, we recommend using the unified PyStataR package which provides a comprehensive suite of Stata-equivalent commands:

pip install py-stata-commands
from py_stata_commands import egen
df['rank_var'] = egen.rank(df['income'])

Why Consider PyStataR?

  • Single installation for all Stata-equivalent commands (tabulate, egen, reghdfe, winsor2)
  • Consistent API across all modules
  • Enhanced documentation and examples
  • Active development and long-term support

PyStataR Repository: https://github.com/brycewang-stanford/PyStataR

Documentation & Examples

For comprehensive examples and function documentation, see:

📊 Function Coverage Status

  • ✅ Row-wise functions: 11/11 (100%)
  • ✅ Statistical functions: 17/17 (100%)
  • ✅ Utility functions: 12/12 (100%)
  • ✅ String functions: 2/2 (100%)
  • ✅ Sequence functions: 2/2 (100%)

Total: 45/45 functions (100% coverage)

🧪 Testing

# Run tests
python -m pytest tests/

# Run specific test
python -m pytest tests/test_core.py

🔧 Project Status

PyEgen will continue to be maintained for existing users, but new feature development will primarily focus on PyStataR. This ensures:

  • ✅ Bug fixes and compatibility updates for PyEgen
  • ✅ Stable API for existing codebases
  • 🚀 Enhanced features and new capabilities in PyStataR

Installation & Requirements

pip install pyegen

Requirements:

  • Python 3.7+
  • pandas >= 1.3.0
  • numpy >= 1.20.0

🤝 Contributing

We welcome contributions! For major changes, please consider contributing to PyStataR for maximum impact.

🔗 Stata Documentation Reference

This implementation follows the official Stata documentation for egen:

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🔗 Related Projects

  • PyStataR - Unified Stata-equivalent commands and R functions (recommended for new projects)
  • StatsPAI - StatsPAI = Stats + Econometrics + ML + AI + LLMs

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyegen-0.2.2.tar.gz (15.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pyegen-0.2.2-py3-none-any.whl (11.7 kB view details)

Uploaded Python 3

File details

Details for the file pyegen-0.2.2.tar.gz.

File metadata

  • Download URL: pyegen-0.2.2.tar.gz
  • Upload date:
  • Size: 15.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.5

File hashes

Hashes for pyegen-0.2.2.tar.gz
Algorithm Hash digest
SHA256 bc68e466ae3925d05f27ba211e49c01a257ce56b2f4a6d323eb02064dd358014
MD5 1a843e82a6e24e1030f2bff14487aa19
BLAKE2b-256 35974f386c56a2dd64c3237c5c6f4c7310308e291b407d54ad96ff3d386f4d4f

See more details on using hashes here.

File details

Details for the file pyegen-0.2.2-py3-none-any.whl.

File metadata

  • Download URL: pyegen-0.2.2-py3-none-any.whl
  • Upload date:
  • Size: 11.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.5

File hashes

Hashes for pyegen-0.2.2-py3-none-any.whl
Algorithm Hash digest
SHA256 3ab4e31680b535cd7ad3e8c9e2e152688ce98636889bf98f6ab847d86b661c94
MD5 2aa8fcc3ce25034ec80a5481517c1534
BLAKE2b-256 6af156ef4bcae6e2088f7401b3990a4400d24813503c9de7a9e23858989db72d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page