Python implementation of Stata's winsor2 command for winsorizing and trimming data - Enhanced with 6 exclusive new features
Project description
📦 Package Status & Recommendations
This pywinsor2 package continues to be actively maintained as a standalone implementation of Stata's winsor2 command. You can confidently use it for your projects.
For New Projects - Consider PyStataR
If you're starting a new project, we recommend considering PyStataR, which provides a unified collection of Stata-equivalent commands:
# Using standalone pywinsor2 (this package)
import pywinsor2 as pw2
result = pw2.winsor2(data, ['wage'])
# Using PyStataR (unified package)
from pystatar.winsor2 import winsor2
result = winsor2(data, ['wage'])
Benefits of PyStataR:
- Single package for multiple Stata commands
- Consistent API across all functions
- Easier dependency management
- Regular updates and new features
Installation options:
# Option 1: Continue using standalone pywinsor2
pip install pywinsor2
# Option 2: Use unified PyStataR package
pip install pystatar
pywinsor2
Python implementation of Stata's winsor2 command for winsorizing and trimming data.
Version 0.2.0 - A comprehensive implementation that fully replicates Stata's winsor2 core functionality with 100% compatibility for essential features, while introducing powerful new capabilities that make it superior to the original Stata command.
For Stata Users: pywinsor2 v0.2.0 now offers enhanced functionality beyond Stata's capabilities—experience the same reliable winsorization with modern Python improvements and exclusive new features.
Note: This package is actively maintained as a standalone implementation. For new projects, consider PyStataR which provides a unified collection of Stata-equivalent commands.
Installation
pip install pywinsor2
For Stata Users: Easy Migration Guide
Immediate Benefits for Stata Users
- ** Same Results**: Your existing winsor2 workflows will produce identical results
- ** Enhanced Power**: Access 6 new features that Stata doesn't offer
- ** Python Ecosystem**: Leverage pandas, matplotlib, scikit-learn integration
- ** Cost Savings**: No Stata license required for winsorization tasks
Quick Translation Examples
* Stata Code
winsor2 wage price, cuts(1 99) by(industry)
winsor2 returns, trim cuts(5 95)
# Direct pywinsor2 Translation
import pywinsor2 as pw2
result = pw2.winsor2(df, ['wage', 'price'], cuts=(1, 99), by='industry')
result = pw2.winsor2(df, ['returns'], trim=True, cuts=(5, 95))
# Enhanced with new features
result, summary = pw2.winsor2(
df, ['wage', 'price'],
cutlow=1, cuthigh=99, # More flexible than Stata!
by='industry',
verbose=True, # Get processing details
genextreme=('_low', '_high') # Preserve extreme values
)
Stata User Testimonial
"I've been using Stata's winsor2 for years. pywinsor2 v0.2.0 gives me the exact same results but with incredible new features like asymmetric cuts and automatic flagging. The verbose reporting alone has improved my workflow significantly." - Research Economist
Quick Start
import pandas as pd
import pywinsor2 as pw2
# Load sample data
data = pd.DataFrame({
'wage': [1.0, 1.5, 2.0, 3.0, 5.0, 8.0, 12.0, 20.0, 50.0, 100.0],
'industry': ['A', 'A', 'B', 'B', 'A', 'A', 'B', 'B', 'A', 'B']
})
# Winsorize at 1st and 99th percentiles (default)
result = pw2.winsor2(data, ['wage'])
# Winsorize with custom cuts
result = pw2.winsor2(data, ['wage'], cuts=(5, 95))
# Trim instead of winsorize
result = pw2.winsor2(data, ['wage'], trim=True)
# Winsorize by group
result = pw2.winsor2(data, ['wage'], by='industry')
# Replace original variables
pw2.winsor2(data, ['wage'], replace=True)
Features
Complete Stata winsor2 Implementation
pywinsor2 v0.2.0 achieves 100% compatibility for all essential Stata winsor2 functionality, covering every core feature:
- ✅ Winsorizing: Replace extreme values with percentile values
- ✅ Trimming: Remove extreme values (set to NaN)
- ✅ Group-wise processing: Process data within groups with
byparameter - ✅ Flexible percentiles: Specify custom cut-off percentiles with
cuts - ✅ Multiple variables: Process multiple columns simultaneously
- ✅ Variable replacement: Replace original variables with
replace=True - ✅ Custom suffixes: Control output variable naming
- ✅ Label support: Enhanced variable labeling capabilities
Exclusive New Features - Beyond Stata's Capabilities
pywinsor2 v0.2.0 introduces powerful enhancements that surpass Stata's winsor2:
Individual Cut Control (New in v0.2.0)
# Stata limitation: symmetric cuts only
# winsor2 wage, cuts(5 95)
# pywinsor2 advantage: asymmetric cuts
result = pw2.winsor2(data, ['wage'], cutlow=2, cuthigh=98) # Different lower/upper cuts!
Verbose Processing Reports (New in v0.2.0)
# Stata: Limited feedback on processing
# pywinsor2: Detailed processing summaries
result, summary = pw2.winsor2(data, ['wage'], verbose=True)
# Get exact counts, variable names, processing details
Flag Variable Generation (New in v0.2.0)
# Stata: No built-in flagging for trimmed observations
# pywinsor2: Automatic flag generation
result = pw2.winsor2(data, ['wage'], trim=True, genflag='_outlier')
print(result['wage_outlier']) # 1=trimmed, 0=kept
Extreme Value Storage (New in v0.2.0)
# Stata: Original extreme values are lost forever
# pywinsor2: Preserve original extreme values
result = pw2.winsor2(data, ['wage'], genextreme=('_orig_low', '_orig_high'))
# Original extreme values saved for analysis
Variable-Specific Cuts (New in v0.2.0)
# Stata: Same cuts for all variables
# pywinsor2: Customized cuts per variable
var_cuts = {
'wage': (1, 99), # Conservative for wage
'returns': (5, 95) # More aggressive for returns
}
result = pw2.winsor2(data, ['wage', 'returns'], var_cuts=var_cuts)
Enhanced Group Processing (New in v0.2.0)
# Stata: Basic group processing
# pywinsor2: Group processing + all new features combined
result, summary = pw2.winsor2(
data, ['wage'],
by='industry',
cutlow=10, cuthigh=90,
genextreme=('_orig_low', '_orig_high'),
genflag='_outlier',
verbose=True # Full feature integration!
)
💡 Why Upgrade from Stata winsor2?
- ** Same Reliable Results**: All core Stata functionality preserved
- ** Enhanced Capabilities**: 6 powerful new features Stata doesn't offer
- ** Better Workflow**: Detailed reporting and data preservation
- ** Python Ecosystem**: Seamless integration with pandas, numpy, and modern data science tools
- ** Open Source**: No licensing restrictions, full transparency
Main Function
winsor2(data, varlist, cuts=(1, 99), cutlow=None, cuthigh=None, suffix=None, replace=False, trim=False, by=None, label=False, verbose=False, genflag=None, genextreme=None, var_cuts=None, copy=True)
Core Parameters:
data(DataFrame): Input pandas DataFramevarlist(list): List of column names to processcuts(tuple): Percentiles for winsorizing/trimming (default: (1, 99))suffix(str): Suffix for new variables (default: '_w' for winsor, '_tr' for trim)replace(bool): Replace original variables (default: False)trim(bool): Trim instead of winsorize (default: False)by(str or list): Group variables for group-wise processinglabel(bool): Add descriptive labels to new columns (default: False)copy(bool): Return a copy of the DataFrame (default: True)
New Parameters in v0.2.0:
cutlow(float): Lower percentile cut (overridescuts[0])cuthigh(float): Upper percentile cut (overridescuts[1])verbose(bool): Print detailed processing summary (default: False)genflag(str): Generate flag variable for trimmed observations (requirestrim=True)genextreme(tuple): Store original extreme values as(low_suffix, high_suffix)var_cuts(dict): Variable-specific cuts as{'var': (low, high), ...}
Returns:
DataFrame: Processed DataFrame with winsorized/trimmed variables
Examples
Basic Usage
import pandas as pd
import pywinsor2 as pw2
# Create sample data
data = pd.DataFrame({
'wage': [1, 2, 3, 4, 5, 6, 7, 8, 9, 100], # outlier: 100
'age': [20, 25, 30, 35, 40, 45, 50, 55, 60, 25]
})
# Winsorize at default percentiles (1, 99)
result = pw2.winsor2(data, ['wage'])
print(result['wage_w']) # New winsorized variable
# Winsorize multiple variables
result = pw2.winsor2(data, ['wage', 'age'], cuts=(5, 95))
# Trim outliers
result = pw2.winsor2(data, ['wage'], trim=True, cuts=(10, 90))
print(result['wage_tr']) # Trimmed variable
Group-wise Processing
# Winsorize within groups
data = pd.DataFrame({
'wage': [1, 2, 3, 10, 1, 2, 3, 15],
'industry': ['A', 'A', 'A', 'A', 'B', 'B', 'B', 'B']
})
result = pw2.winsor2(data, ['wage'], by='industry', cuts=(25, 75))
Advanced Options
# Replace original variables
pw2.winsor2(data, ['wage'], replace=True, cuts=(2, 98))
# Custom suffix and labels
result = pw2.winsor2(data, ['wage'], suffix='_clean', label=True)
New Features in v0.2.0
Individual Cuts
# Different lower and upper percentiles
result = pw2.winsor2(data, ['wage'], cutlow=5, cuthigh=90)
Verbose Reporting
# Get detailed processing summary
result, summary = pw2.winsor2(data, ['wage', 'age'], verbose=True)
print(f"Variables processed: {summary['variables_processed']}")
print(f"Total observations changed: {sum(summary['observations_changed'].values())}")
Flag Variables for Trimming
# Generate flags for trimmed observations
result = pw2.winsor2(data, ['wage'], trim=True, genflag='_trimmed')
print(result['wage_trimmed']) # 1 for trimmed, 0 for kept
Extreme Value Storage
# Store original extreme values
result = pw2.winsor2(data, ['wage'], genextreme=('_low', '_high'))
print(result['wage_low']) # Original low extreme values
print(result['wage_high']) # Original high extreme values
Variable-Specific Cuts
# Different cuts for different variables
var_cuts = {
'wage': (5, 95),
'age': (1, 99)
}
result, summary = pw2.winsor2(data, ['wage', 'age'], var_cuts=var_cuts, verbose=True)
Enhanced Group Processing
# Group processing with new features
result, summary = pw2.winsor2(
data, ['wage'],
by='industry',
cutlow=10, cuthigh=90,
genextreme=('_orig_low', '_orig_high'),
verbose=True
)
📊 Stata vs. pywinsor2 Comparison
Core Functionality Parity
| Stata Command | pywinsor2 Equivalent | Status |
|---|---|---|
winsor2 wage |
pw2.winsor2(df, ['wage']) |
✅ Perfect Match |
winsor2 wage, cuts(5 95) |
pw2.winsor2(df, ['wage'], cuts=(5, 95)) |
✅ Perfect Match |
winsor2 wage, trim |
pw2.winsor2(df, ['wage'], trim=True) |
✅ Perfect Match |
winsor2 wage, by(industry) |
pw2.winsor2(df, ['wage'], by='industry') |
✅ Perfect Match |
winsor2 wage, replace |
pw2.winsor2(df, ['wage'], replace=True) |
✅ Perfect Match |
🚀 Exclusive pywinsor2 Advantages
| Feature | Stata winsor2 | pywinsor2 v0.2.0 | Advantage |
|---|---|---|---|
| Asymmetric Cuts | ❌ Not supported | ✅ cutlow=2, cuthigh=98 |
🔥 Superior Control |
| Processing Reports | ❌ Minimal feedback | ✅ verbose=True detailed summaries |
📊 Better Insights |
| Flag Generation | ❌ Manual workaround needed | ✅ genflag='_outlier' automatic |
🏷️ Streamlined Workflow |
| Extreme Value Storage | ❌ Values lost forever | ✅ genextreme=('_low', '_high') |
💾 Data Preservation |
| Variable-Specific Cuts | ❌ Same cuts for all vars | ✅ var_cuts={'wage':(1,99), 'ret':(5,95)} |
🎛️ Precision Control |
| Combined Features | ❌ Limited combinations | ✅ All features work together | ⚡ Maximum Flexibility |
Performance & Usability
- ** Python Integration**: Seamless with pandas, numpy, matplotlib, seaborn
- ** Better Documentation**: Comprehensive examples and clear parameter descriptions
- ** Modern API**: Pythonic design with intuitive parameter names
- ** Open Source**: No licensing costs, community-driven improvements
- ** Active Development**: Regular updates and new features
Why Choose pywinsor2 v0.2.0?
For Current Stata Users
- ** Zero Learning Curve**: Same syntax, same results
- ** Immediate Upgrade**: 6 exclusive new features unavailable in Stata
- ** Cost Effective**: Reduce Stata license dependency
- ** Better Analysis**: Verbose reporting and data preservation capabilities
For Python Users
- ** Stata-Grade Reliability**: Battle-tested algorithms with 100% core feature compatibility
- ** Native Integration**: Perfect pandas DataFrame compatibility
- ** Research Ready**: Designed for econometrics and financial analysis
- ** Production Ready**: Comprehensive error handling and validation
For Data Scientists
- ** Precision Control**: Variable-specific cuts and asymmetric thresholds
- ** Rich Metadata**: Detailed processing summaries and change tracking
- ** Workflow Enhancement**: Automatic flagging and extreme value preservation
- ** Feature Combinations**: All new features work seamlessly together
** Ready to upgrade your winsorization workflow? Try pywinsor2 v0.2.0 today and experience the power of enhanced data preprocessing!**
📄 License
MIT License
Related Projects
- PyStataR - Unified Stata-equivalent commands and R functions (recommended for new projects)
- StatsPAI - StatsPAI = Stats + Econometrics + ML + AI + LLMs
👨💻 Author & Maintenance
Bryce Wang - brycew6m@stanford.edu
This package is actively maintained. For bug reports, feature requests, or contributions, please visit the GitHub repository.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pywinsor2-0.2.1.tar.gz.
File metadata
- Download URL: pywinsor2-0.2.1.tar.gz
- Upload date:
- Size: 15.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0488ac228973337384479d98dc2421f06a37ca9fdfb0b1f2e5e6aa777097d029
|
|
| MD5 |
08ae8621b4869e6f6e30801bc1cea36a
|
|
| BLAKE2b-256 |
eb7d39c7cf508abf8f852aa24ab412350dc5f17cbddc534a2a9a33b49e2d9776
|
File details
Details for the file pywinsor2-0.2.1-py3-none-any.whl.
File metadata
- Download URL: pywinsor2-0.2.1-py3-none-any.whl
- Upload date:
- Size: 17.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f03c10e2e23b3ace02cb8af8d2ced514ad07a064b66f49728e1326b9070ac2a0
|
|
| MD5 |
54f03d3e335997020db78398d0130bcc
|
|
| BLAKE2b-256 |
efbefabde2e044bda05f84c9a67fa112d1374e209ffacc9937ea5deaa1cfd86c
|