Python implementation of Stata's egen command for pandas DataFrames
Project description
PyEgen
Python implementation of Stata's egen command for pandas DataFrames. This package provides Stata-style data manipulation functions, making it easier for researchers to transition from Stata to Python while maintaining familiar syntax and functionality.
Quick Start
pip install pyegen
import pandas as pd
import numpy as np
import pyegen as egen
# Create sample data
df = pd.DataFrame({
'group': ['A', 'A', 'B', 'B', 'C', 'C'],
'var1': [1, np.nan, 3, 4, 5, 6],
'var2': [np.nan, 2, 5, 6, 7, 8],
'var3': [10, 11, 12, 13, 14, 15]
})
# Row-wise operations
df['first_nonmiss'] = egen.rowfirst(df, ['var1', 'var2', 'var3'])
df['row_median'] = egen.rowmedian(df, ['var1', 'var2', 'var3'])
df['missing_count'] = egen.rowmiss(df, ['var1', 'var2', 'var3'])
# Group-wise operations
df['group_mean'] = egen.mean(df['var1'], by=df['group'])
df['group_median'] = egen.median(df['var1'], by=df['group'])
df['group_rank'] = egen.rank(df['var1'], method='min')
# Utility functions
df['has_value_1_or_2'] = egen.anymatch(df, ['var1', 'var2'], [1, 2])
df['concat_vars'] = egen.concat(df, ['group', 'var1'], punct='_')
Available Functions
PyEgen provides 45+ functions with 100% coverage of Stata's egen capabilities:
Row-wise Functions
rowmean(),rowtotal(),rowmax(),rowmin(),rowsd()rowfirst(),rowlast(),rowmedian(),rowmiss(),rownonmiss(),rowpctile()
Statistical Functions
rank(),count(),mean(),sum(),max(),min(),sd()median(),mode(),iqr(),kurt(),skew(),mad(),mdev()pc(),pctile(),std(),total()
Utility Functions
tag(),group(),seq(),anycount(),anymatch(),anyvalue()concat(),cut(),diff(),ends(),fill()
🎯 Key Features
- Complete Stata Coverage: All 45 egen functions implemented
- Pandas Integration: Works seamlessly with pandas DataFrames
- Missing Value Handling: Consistent with Stata behavior
- Group Operations: Full support for by-group operations with
byparameter - Type Safety: Comprehensive input validation and error handling
- Performance: Optimized for large datasets
📚 Complete Function Reference
Row-wise Functions
| Function | Stata Equivalent | Description |
|---|---|---|
rowmean() |
egen newvar = rowmean(varlist) |
Row mean |
rowtotal() |
egen newvar = rowtotal(varlist) |
Row sum |
rowmax() |
egen newvar = rowmax(varlist) |
Row maximum |
rowmin() |
egen newvar = rowmin(varlist) |
Row minimum |
rowsd() |
egen newvar = rowsd(varlist) |
Row standard deviation |
rowfirst() |
egen newvar = rowfirst(varlist) |
First non-missing value |
rowlast() |
egen newvar = rowlast(varlist) |
Last non-missing value |
rowmedian() |
egen newvar = rowmedian(varlist) |
Row median |
rowmiss() |
egen newvar = rowmiss(varlist) |
Count of missing values |
rownonmiss() |
egen newvar = rownonmiss(varlist) |
Count of non-missing values |
rowpctile() |
egen newvar = rowpctile(varlist), p(#) |
Row percentile |
Statistical Functions (with grouping support)
| Function | Stata Equivalent | Description |
|---|---|---|
count() |
egen newvar = count(var), by(group) |
Count observations |
mean() |
egen newvar = mean(var), by(group) |
Mean |
sum() |
egen newvar = sum(var), by(group) |
Sum |
total() |
egen newvar = total(var), by(group) |
Total (treats missing as 0) |
max() |
egen newvar = max(var), by(group) |
Maximum |
min() |
egen newvar = min(var), by(group) |
Minimum |
sd() |
egen newvar = sd(var), by(group) |
Standard deviation |
median() |
egen newvar = median(var), by(group) |
Median |
mode() |
egen newvar = mode(var), by(group) |
Mode |
iqr() |
egen newvar = iqr(var), by(group) |
Interquartile range |
kurt() |
egen newvar = kurt(var), by(group) |
Kurtosis |
skew() |
egen newvar = skew(var), by(group) |
Skewness |
mad() |
egen newvar = mad(var), by(group) |
Median absolute deviation |
mdev() |
egen newvar = mdev(var), by(group) |
Mean absolute deviation |
pctile() |
egen newvar = pctile(var), p(#) |
Percentile |
pc() |
egen newvar = pc(var), by(group) |
Percent of total |
std() |
egen newvar = std(var), by(group) |
Standardized values |
Utility Functions
| Function | Stata Equivalent | Description |
|---|---|---|
rank() |
egen newvar = rank(var) |
Ranking with tie options |
tag() |
egen newvar = tag(varlist) |
Tag first obs in group |
group() |
egen newvar = group(varlist) |
Create group identifiers |
seq() |
egen newvar = seq() |
Generate sequences |
anycount() |
egen newvar = anycount(varlist), v(values) |
Count matching values |
anymatch() |
egen newvar = anymatch(varlist), v(values) |
Check for matches |
anyvalue() |
egen newvar = anyvalue(var), v(values) |
Return matching values |
concat() |
egen newvar = concat(varlist), punct() |
Concatenate variables |
cut() |
egen newvar = cut(var), group(#) |
Create categorical from continuous |
diff() |
egen newvar = diff(varlist) |
Check if variables differ |
ends() |
egen newvar = ends(strvar), head|last|tail |
Extract string parts |
fill() |
egen newvar = fill(numlist) |
Create repeating patterns |
💡 Migration Recommendation
For new projects, we recommend using the unified PyStataR package which provides a comprehensive suite of Stata-equivalent commands:
pip install py-stata-commands
from py_stata_commands import egen
df['rank_var'] = egen.rank(df['income'])
Why Consider PyStataR?
- Single installation for all Stata-equivalent commands (tabulate, egen, reghdfe, winsor2)
- Consistent API across all modules
- Enhanced documentation and examples
- Active development and long-term support
PyStataR Repository: https://github.com/brycewang-stanford/PyStataR
Documentation & Examples
For comprehensive examples and function documentation, see:
📊 Function Coverage Status
- ✅ Row-wise functions: 11/11 (100%)
- ✅ Statistical functions: 17/17 (100%)
- ✅ Utility functions: 12/12 (100%)
- ✅ String functions: 2/2 (100%)
- ✅ Sequence functions: 2/2 (100%)
Total: 45/45 functions (100% coverage)
🧪 Testing
# Run tests
python -m pytest tests/
# Run specific test
python -m pytest tests/test_core.py
🔧 Project Status
PyEgen will continue to be maintained for existing users, but new feature development will primarily focus on PyStataR. This ensures:
- ✅ Bug fixes and compatibility updates for PyEgen
- ✅ Stable API for existing codebases
- 🚀 Enhanced features and new capabilities in PyStataR
Installation & Requirements
pip install pyegen
Requirements:
- Python 3.7+
- pandas >= 1.3.0
- numpy >= 1.20.0
🤝 Contributing
We welcome contributions! For major changes, please consider contributing to PyStataR for maximum impact.
🔗 Stata Documentation Reference
This implementation follows the official Stata documentation for egen:
📄 License
This project is licensed under the MIT License - see the LICENSE file for details.
🔗 Related Projects
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pyegen-0.2.4.tar.gz.
File metadata
- Download URL: pyegen-0.2.4.tar.gz
- Upload date:
- Size: 18.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a23af21c794e8d451089e20d46aca0364de89059e87b477d9de5ec2a7a9a6d60
|
|
| MD5 |
b067719a1c3bd13d35599e2ae95af747
|
|
| BLAKE2b-256 |
b0c37d08d47d4f86c6217a1dcea86a89dc57ff5a33fda7685850f862035aa6c3
|
File details
Details for the file pyegen-0.2.4-py3-none-any.whl.
File metadata
- Download URL: pyegen-0.2.4-py3-none-any.whl
- Upload date:
- Size: 11.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e590c7146c548b8888df9251aef3ba1eed6d92053315360b1cb1ab46d50b3d37
|
|
| MD5 |
758385433093a4b6bfff35c7eab6e8a9
|
|
| BLAKE2b-256 |
79080678696ed31dabb32ebfa46604367247fbb2134965a146e107eb1e99f1c9
|