Python implementation of Stata's winsor2 command for winsorizing and trimming data
Project description
pywinsor2
Python implementation of Stata's winsor2 command for winsorizing and trimming data.
Installation
pip install pywinsor2
Quick Start
import pandas as pd
import pywinsor2 as pw2
# Load sample data
data = pd.DataFrame({
'wage': [1.0, 1.5, 2.0, 3.0, 5.0, 8.0, 12.0, 20.0, 50.0, 100.0],
'industry': ['A', 'A', 'B', 'B', 'A', 'A', 'B', 'B', 'A', 'B']
})
# Winsorize at 1st and 99th percentiles (default)
result = pw2.winsor2(data, ['wage'])
# Winsorize with custom cuts
result = pw2.winsor2(data, ['wage'], cuts=(5, 95))
# Trim instead of winsorize
result = pw2.winsor2(data, ['wage'], trim=True)
# Winsorize by group
result = pw2.winsor2(data, ['wage'], by='industry')
# Replace original variables
pw2.winsor2(data, ['wage'], replace=True)
Features
- Winsorizing: Replace extreme values with percentile values
- Trimming: Remove extreme values (set to NaN)
- Group-wise processing: Process data within groups
- Flexible percentiles: Specify custom cut-off percentiles
- Multiple variables: Process multiple columns simultaneously
- Stata compatibility: API designed to match Stata's
winsor2command
Main Function
winsor2(data, varlist, cuts=(1, 99), suffix=None, replace=False, trim=False, by=None, label=False)
Parameters:
data(DataFrame): Input pandas DataFramevarlist(list): List of column names to processcuts(tuple): Percentiles for winsorizing/trimming (default: (1, 99))suffix(str): Suffix for new variables (default: '_w' for winsor, '_tr' for trim)replace(bool): Replace original variables (default: False)trim(bool): Trim instead of winsorize (default: False)by(str or list): Group variables for group-wise processinglabel(bool): Add descriptive labels to new columns (default: False)
Returns:
DataFrame: Processed DataFrame with winsorized/trimmed variables
Examples
Basic Usage
import pandas as pd
import pywinsor2 as pw2
# Create sample data
data = pd.DataFrame({
'wage': [1, 2, 3, 4, 5, 6, 7, 8, 9, 100], # outlier: 100
'age': [20, 25, 30, 35, 40, 45, 50, 55, 60, 25]
})
# Winsorize at default percentiles (1, 99)
result = pw2.winsor2(data, ['wage'])
print(result['wage_w']) # New winsorized variable
# Winsorize multiple variables
result = pw2.winsor2(data, ['wage', 'age'], cuts=(5, 95))
# Trim outliers
result = pw2.winsor2(data, ['wage'], trim=True, cuts=(10, 90))
print(result['wage_tr']) # Trimmed variable
Group-wise Processing
# Winsorize within groups
data = pd.DataFrame({
'wage': [1, 2, 3, 10, 1, 2, 3, 15],
'industry': ['A', 'A', 'A', 'A', 'B', 'B', 'B', 'B']
})
result = pw2.winsor2(data, ['wage'], by='industry', cuts=(25, 75))
Advanced Options
# Replace original variables
pw2.winsor2(data, ['wage'], replace=True, cuts=(2, 98))
# Custom suffix and labels
result = pw2.winsor2(data, ['wage'], suffix='_clean', label=True)
Comparison with Stata
| Stata Command | Python Equivalent |
|---|---|
winsor2 wage |
pw2.winsor2(df, ['wage']) |
winsor2 wage, cuts(5 95) |
pw2.winsor2(df, ['wage'], cuts=(5, 95)) |
winsor2 wage, trim |
pw2.winsor2(df, ['wage'], trim=True) |
winsor2 wage, by(industry) |
pw2.winsor2(df, ['wage'], by='industry') |
winsor2 wage, replace |
pw2.winsor2(df, ['wage'], replace=True) |
License
MIT License
Author
Bryce Wang - brycew6m@stanford.edu
Contributing
Contributions are welcome! Please see CONTRIBUTING.md for guidelines.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
pywinsor2-0.1.0.tar.gz
(7.8 kB
view details)
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pywinsor2-0.1.0.tar.gz.
File metadata
- Download URL: pywinsor2-0.1.0.tar.gz
- Upload date:
- Size: 7.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7b76c4bf6973b82edba406d4265be9eb2847db1819354403b74389072bfec097
|
|
| MD5 |
ae5919f15f0324d3ba86f27e14d26654
|
|
| BLAKE2b-256 |
c0391ffabe927c529944a66dbf9db9f1d308a799516f1f77fb75cc9de9af9e50
|
File details
Details for the file pywinsor2-0.1.0-py3-none-any.whl.
File metadata
- Download URL: pywinsor2-0.1.0-py3-none-any.whl
- Upload date:
- Size: 8.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
486b51a9dc27a1cb39f0d9f952ac67ed62a0019a764c06afbcf39eadc575dfdc
|
|
| MD5 |
0a4b63293f2b5998e6bd18a7504d99d6
|
|
| BLAKE2b-256 |
9c9d7c3de50d61a50a351cced06c9615b8aaca226e413fea1ff61a7fb49ba935
|