Skip to main content

Python implementation of Stata's egen command for pandas DataFrames

Project description

PyEgen

PyPI version Python 3.7+ License: MIT Downloads

Python implementation of Stata's egen command for pandas DataFrames. This package provides Stata-style data manipulation functions, making it easier for researchers to transition from Stata to Python while maintaining familiar syntax and functionality.

Quick Start

pip install pyegen
import pandas as pd
import numpy as np
import pyegen as egen

# Create sample data
df = pd.DataFrame({
    'group': ['A', 'A', 'B', 'B', 'C', 'C'],
    'var1': [1, np.nan, 3, 4, 5, 6],
    'var2': [np.nan, 2, 5, 6, 7, 8],
    'var3': [10, 11, 12, 13, 14, 15]
})

# Row-wise operations
df['first_nonmiss'] = egen.rowfirst(df, ['var1', 'var2', 'var3'])
df['row_median'] = egen.rowmedian(df, ['var1', 'var2', 'var3'])
df['missing_count'] = egen.rowmiss(df, ['var1', 'var2', 'var3'])

# Group-wise operations  
df['group_mean'] = egen.mean(df['var1'], by=df['group'])
df['group_median'] = egen.median(df['var1'], by=df['group'])
df['group_rank'] = egen.rank(df['var1'], method='min')

# Utility functions
df['has_value_1_or_2'] = egen.anymatch(df, ['var1', 'var2'], [1, 2])
df['concat_vars'] = egen.concat(df, ['group', 'var1'], punct='_')

Available Functions

PyEgen provides 45+ functions with 100% coverage of Stata's egen capabilities:

Row-wise Functions

  • rowmean(), rowtotal(), rowmax(), rowmin(), rowsd()
  • rowfirst(), rowlast(), rowmedian(), rowmiss(), rownonmiss(), rowpctile()

Statistical Functions

  • rank(), count(), mean(), sum(), max(), min(), sd()
  • median(), mode(), iqr(), kurt(), skew(), mad(), mdev()
  • pc(), pctile(), std(), total()

Utility Functions

  • tag(), group(), seq(), anycount(), anymatch(), anyvalue()
  • concat(), cut(), diff(), ends(), fill()

🎯 Key Features

  • Complete Stata Coverage: All 45 egen functions implemented
  • Pandas Integration: Works seamlessly with pandas DataFrames
  • Missing Value Handling: Consistent with Stata behavior
  • Group Operations: Full support for by-group operations with by parameter
  • Type Safety: Comprehensive input validation and error handling
  • Performance: Optimized for large datasets

📚 Complete Function Reference

Row-wise Functions

Function Stata Equivalent Description
rowmean() egen newvar = rowmean(varlist) Row mean
rowtotal() egen newvar = rowtotal(varlist) Row sum
rowmax() egen newvar = rowmax(varlist) Row maximum
rowmin() egen newvar = rowmin(varlist) Row minimum
rowsd() egen newvar = rowsd(varlist) Row standard deviation
rowfirst() egen newvar = rowfirst(varlist) First non-missing value
rowlast() egen newvar = rowlast(varlist) Last non-missing value
rowmedian() egen newvar = rowmedian(varlist) Row median
rowmiss() egen newvar = rowmiss(varlist) Count of missing values
rownonmiss() egen newvar = rownonmiss(varlist) Count of non-missing values
rowpctile() egen newvar = rowpctile(varlist), p(#) Row percentile

Statistical Functions (with grouping support)

Function Stata Equivalent Description
count() egen newvar = count(var), by(group) Count observations
mean() egen newvar = mean(var), by(group) Mean
sum() egen newvar = sum(var), by(group) Sum
total() egen newvar = total(var), by(group) Total (treats missing as 0)
max() egen newvar = max(var), by(group) Maximum
min() egen newvar = min(var), by(group) Minimum
sd() egen newvar = sd(var), by(group) Standard deviation
median() egen newvar = median(var), by(group) Median
mode() egen newvar = mode(var), by(group) Mode
iqr() egen newvar = iqr(var), by(group) Interquartile range
kurt() egen newvar = kurt(var), by(group) Kurtosis
skew() egen newvar = skew(var), by(group) Skewness
mad() egen newvar = mad(var), by(group) Median absolute deviation
mdev() egen newvar = mdev(var), by(group) Mean absolute deviation
pctile() egen newvar = pctile(var), p(#) Percentile
pc() egen newvar = pc(var), by(group) Percent of total
std() egen newvar = std(var), by(group) Standardized values

Utility Functions

Function Stata Equivalent Description
rank() egen newvar = rank(var) Ranking with tie options
tag() egen newvar = tag(varlist) Tag first obs in group
group() egen newvar = group(varlist) Create group identifiers
seq() egen newvar = seq() Generate sequences
anycount() egen newvar = anycount(varlist), v(values) Count matching values
anymatch() egen newvar = anymatch(varlist), v(values) Check for matches
anyvalue() egen newvar = anyvalue(var), v(values) Return matching values
concat() egen newvar = concat(varlist), punct() Concatenate variables
cut() egen newvar = cut(var), group(#) Create categorical from continuous
diff() egen newvar = diff(varlist) Check if variables differ
ends() egen newvar = ends(strvar), head|last|tail Extract string parts
fill() egen newvar = fill(numlist) Create repeating patterns

💡 Migration Recommendation

For new projects, we recommend using the unified PyStataR package which provides a comprehensive suite of Stata-equivalent commands:

pip install py-stata-commands
from py_stata_commands import egen
df['rank_var'] = egen.rank(df['income'])

Why Consider PyStataR?

  • Single installation for all Stata-equivalent commands (tabulate, egen, reghdfe, winsor2)
  • Consistent API across all modules
  • Enhanced documentation and examples
  • Active development and long-term support

PyStataR Repository: https://github.com/brycewang-stanford/PyStataR

Documentation & Examples

For comprehensive examples and function documentation, see:

📊 Function Coverage Status

  • ✅ Row-wise functions: 11/11 (100%)
  • ✅ Statistical functions: 17/17 (100%)
  • ✅ Utility functions: 12/12 (100%)
  • ✅ String functions: 2/2 (100%)
  • ✅ Sequence functions: 2/2 (100%)

Total: 45/45 functions (100% coverage)

🧪 Testing

# Run tests
python -m pytest tests/

# Run specific test
python -m pytest tests/test_core.py

🔧 Project Status

PyEgen will continue to be maintained for existing users, but new feature development will primarily focus on PyStataR. This ensures:

  • ✅ Bug fixes and compatibility updates for PyEgen
  • ✅ Stable API for existing codebases
  • 🚀 Enhanced features and new capabilities in PyStataR

Installation & Requirements

pip install pyegen

Requirements:

  • Python 3.7+
  • pandas >= 1.3.0
  • numpy >= 1.20.0

🤝 Contributing

We welcome contributions! For major changes, please consider contributing to PyStataR for maximum impact.

🔗 Stata Documentation Reference

This implementation follows the official Stata documentation for egen:

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🔗 Related Projects

  • PyStataR - Unified Stata-equivalent commands and R functions (recommended for new projects)
  • StatsPAI - StatsPAI = Stats + Econometrics + ML + AI + LLMs

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyegen-0.2.3.tar.gz (18.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pyegen-0.2.3-py3-none-any.whl (11.7 kB view details)

Uploaded Python 3

File details

Details for the file pyegen-0.2.3.tar.gz.

File metadata

  • Download URL: pyegen-0.2.3.tar.gz
  • Upload date:
  • Size: 18.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.5

File hashes

Hashes for pyegen-0.2.3.tar.gz
Algorithm Hash digest
SHA256 7625272c107caa9c2443df68ace569bc4199657d519252e6f9d0fe344905b27d
MD5 2354330089f71f20eb788545804d2ec1
BLAKE2b-256 7828974c937cab114ae22040f0c64792a932bf2e1ef954ceadbf94cb3bbaf5a7

See more details on using hashes here.

File details

Details for the file pyegen-0.2.3-py3-none-any.whl.

File metadata

  • Download URL: pyegen-0.2.3-py3-none-any.whl
  • Upload date:
  • Size: 11.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.5

File hashes

Hashes for pyegen-0.2.3-py3-none-any.whl
Algorithm Hash digest
SHA256 0b234544fe1ac2f0a29c0cd4a0a584dbe18f792f6226d094e03775c3f173d9ce
MD5 2a440a0feff77f7ddb4dbfcdabf61ecc
BLAKE2b-256 8f7a99cb2b677f0204a211457669ea173e11e9d85197c34b91da13eaa2fd0220

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page