Skip to main content

Python implementation of Stata's winsor2 command for winsorizing and trimming data - Enhanced with 6 exclusive new features

Project description

📦 Package Status & Recommendations

This pywinsor2 package continues to be actively maintained as a standalone implementation of Stata's winsor2 command. You can confidently use it for your projects.

For New Projects - Consider PyStataR

If you're starting a new project, we recommend considering PyStataR, which provides a unified collection of Stata-equivalent commands:

# Using standalone pywinsor2 (this package)
import pywinsor2 as pw2
result = pw2.winsor2(data, ['wage'])

# Using PyStataR (unified package)
from pystatar.winsor2 import winsor2
result = winsor2(data, ['wage'])

Benefits of PyStataR:

  • Single package for multiple Stata commands
  • Consistent API across all functions
  • Easier dependency management
  • Regular updates and new features

Installation options:

# Option 1: Continue using standalone pywinsor2
pip install pywinsor2

# Option 2: Use unified PyStataR package
pip install pystatar

pywinsor2

PyPI version Downloads Downloads Downloads Python Versions License: MIT GitHub stars

Python implementation of Stata's winsor2 command for winsorizing and trimming data.

Version 0.2.0 - A comprehensive implementation that fully replicates Stata's winsor2 core functionality with 100% compatibility for essential features, while introducing powerful new capabilities that make it superior to the original Stata command.

For Stata Users: pywinsor2 v0.2.0 now offers enhanced functionality beyond Stata's capabilities—experience the same reliable winsorization with modern Python improvements and exclusive new features.

Note: This package is actively maintained as a standalone implementation. For new projects, consider PyStataR which provides a unified collection of Stata-equivalent commands.

Installation

pip install pywinsor2

For Stata Users: Easy Migration Guide

Immediate Benefits for Stata Users

  • ** Same Results**: Your existing winsor2 workflows will produce identical results
  • ** Enhanced Power**: Access 6 new features that Stata doesn't offer
  • ** Python Ecosystem**: Leverage pandas, matplotlib, scikit-learn integration
  • ** Cost Savings**: No Stata license required for winsorization tasks

Quick Translation Examples

* Stata Code
winsor2 wage price, cuts(1 99) by(industry)
winsor2 returns, trim cuts(5 95) 
# Direct pywinsor2 Translation
import pywinsor2 as pw2
result = pw2.winsor2(df, ['wage', 'price'], cuts=(1, 99), by='industry')
result = pw2.winsor2(df, ['returns'], trim=True, cuts=(5, 95))

# Enhanced with new features
result, summary = pw2.winsor2(
    df, ['wage', 'price'], 
    cutlow=1, cuthigh=99,  # More flexible than Stata!
    by='industry',
    verbose=True,          # Get processing details
    genextreme=('_low', '_high')  # Preserve extreme values
)

Stata User Testimonial

"I've been using Stata's winsor2 for years. pywinsor2 v0.2.0 gives me the exact same results but with incredible new features like asymmetric cuts and automatic flagging. The verbose reporting alone has improved my workflow significantly." - Research Economist

Quick Start

import pandas as pd
import pywinsor2 as pw2

# Load sample data
data = pd.DataFrame({
    'wage': [1.0, 1.5, 2.0, 3.0, 5.0, 8.0, 12.0, 20.0, 50.0, 100.0],
    'industry': ['A', 'A', 'B', 'B', 'A', 'A', 'B', 'B', 'A', 'B']
})

# Winsorize at 1st and 99th percentiles (default)
result = pw2.winsor2(data, ['wage'])

# Winsorize with custom cuts
result = pw2.winsor2(data, ['wage'], cuts=(5, 95))

# Trim instead of winsorize
result = pw2.winsor2(data, ['wage'], trim=True)

# Winsorize by group
result = pw2.winsor2(data, ['wage'], by='industry')

# Replace original variables
pw2.winsor2(data, ['wage'], replace=True)

Features

Complete Stata winsor2 Implementation

pywinsor2 v0.2.0 achieves 100% compatibility for all essential Stata winsor2 functionality, covering every core feature:

  • Winsorizing: Replace extreme values with percentile values
  • Trimming: Remove extreme values (set to NaN)
  • Group-wise processing: Process data within groups with by parameter
  • Flexible percentiles: Specify custom cut-off percentiles with cuts
  • Multiple variables: Process multiple columns simultaneously
  • Variable replacement: Replace original variables with replace=True
  • Custom suffixes: Control output variable naming
  • Label support: Enhanced variable labeling capabilities

Exclusive New Features - Beyond Stata's Capabilities

pywinsor2 v0.2.0 introduces powerful enhancements that surpass Stata's winsor2:

Individual Cut Control (New in v0.2.0)

# Stata limitation: symmetric cuts only
# winsor2 wage, cuts(5 95)  

# pywinsor2 advantage: asymmetric cuts
result = pw2.winsor2(data, ['wage'], cutlow=2, cuthigh=98)  # Different lower/upper cuts!

Verbose Processing Reports (New in v0.2.0)

# Stata: Limited feedback on processing
# pywinsor2: Detailed processing summaries
result, summary = pw2.winsor2(data, ['wage'], verbose=True)
# Get exact counts, variable names, processing details

Flag Variable Generation (New in v0.2.0)

# Stata: No built-in flagging for trimmed observations
# pywinsor2: Automatic flag generation
result = pw2.winsor2(data, ['wage'], trim=True, genflag='_outlier')
print(result['wage_outlier'])  # 1=trimmed, 0=kept

Extreme Value Storage (New in v0.2.0)

# Stata: Original extreme values are lost forever
# pywinsor2: Preserve original extreme values
result = pw2.winsor2(data, ['wage'], genextreme=('_orig_low', '_orig_high'))
# Original extreme values saved for analysis

Variable-Specific Cuts (New in v0.2.0)

# Stata: Same cuts for all variables
# pywinsor2: Customized cuts per variable
var_cuts = {
    'wage': (1, 99),      # Conservative for wage
    'returns': (5, 95)    # More aggressive for returns
}
result = pw2.winsor2(data, ['wage', 'returns'], var_cuts=var_cuts)

Enhanced Group Processing (New in v0.2.0)

# Stata: Basic group processing
# pywinsor2: Group processing + all new features combined
result, summary = pw2.winsor2(
    data, ['wage'], 
    by='industry',
    cutlow=10, cuthigh=90,
    genextreme=('_orig_low', '_orig_high'),
    genflag='_outlier',
    verbose=True  # Full feature integration!
)

💡 Why Upgrade from Stata winsor2?

  1. ** Same Reliable Results**: All core Stata functionality preserved
  2. ** Enhanced Capabilities**: 6 powerful new features Stata doesn't offer
  3. ** Better Workflow**: Detailed reporting and data preservation
  4. ** Python Ecosystem**: Seamless integration with pandas, numpy, and modern data science tools
  5. ** Open Source**: No licensing restrictions, full transparency

Main Function

winsor2(data, varlist, cuts=(1, 99), cutlow=None, cuthigh=None, suffix=None, replace=False, trim=False, by=None, label=False, verbose=False, genflag=None, genextreme=None, var_cuts=None, copy=True)

Core Parameters:

  • data (DataFrame): Input pandas DataFrame
  • varlist (list): List of column names to process
  • cuts (tuple): Percentiles for winsorizing/trimming (default: (1, 99))
  • suffix (str): Suffix for new variables (default: '_w' for winsor, '_tr' for trim)
  • replace (bool): Replace original variables (default: False)
  • trim (bool): Trim instead of winsorize (default: False)
  • by (str or list): Group variables for group-wise processing
  • label (bool): Add descriptive labels to new columns (default: False)
  • copy (bool): Return a copy of the DataFrame (default: True)

New Parameters in v0.2.0:

  • cutlow (float): Lower percentile cut (overrides cuts[0])
  • cuthigh (float): Upper percentile cut (overrides cuts[1])
  • verbose (bool): Print detailed processing summary (default: False)
  • genflag (str): Generate flag variable for trimmed observations (requires trim=True)
  • genextreme (tuple): Store original extreme values as (low_suffix, high_suffix)
  • var_cuts (dict): Variable-specific cuts as {'var': (low, high), ...}

Returns:

  • DataFrame: Processed DataFrame with winsorized/trimmed variables

Examples

Basic Usage

import pandas as pd
import pywinsor2 as pw2

# Create sample data
data = pd.DataFrame({
    'wage': [1, 2, 3, 4, 5, 6, 7, 8, 9, 100],  # outlier: 100
    'age': [20, 25, 30, 35, 40, 45, 50, 55, 60, 25]
})

# Winsorize at default percentiles (1, 99)
result = pw2.winsor2(data, ['wage'])
print(result['wage_w'])  # New winsorized variable

# Winsorize multiple variables
result = pw2.winsor2(data, ['wage', 'age'], cuts=(5, 95))

# Trim outliers
result = pw2.winsor2(data, ['wage'], trim=True, cuts=(10, 90))
print(result['wage_tr'])  # Trimmed variable

Group-wise Processing

# Winsorize within groups
data = pd.DataFrame({
    'wage': [1, 2, 3, 10, 1, 2, 3, 15],
    'industry': ['A', 'A', 'A', 'A', 'B', 'B', 'B', 'B']
})

result = pw2.winsor2(data, ['wage'], by='industry', cuts=(25, 75))

Advanced Options

# Replace original variables
pw2.winsor2(data, ['wage'], replace=True, cuts=(2, 98))

# Custom suffix and labels
result = pw2.winsor2(data, ['wage'], suffix='_clean', label=True)

New Features in v0.2.0

Individual Cuts

# Different lower and upper percentiles
result = pw2.winsor2(data, ['wage'], cutlow=5, cuthigh=90)

Verbose Reporting

# Get detailed processing summary
result, summary = pw2.winsor2(data, ['wage', 'age'], verbose=True)
print(f"Variables processed: {summary['variables_processed']}")
print(f"Total observations changed: {sum(summary['observations_changed'].values())}")

Flag Variables for Trimming

# Generate flags for trimmed observations
result = pw2.winsor2(data, ['wage'], trim=True, genflag='_trimmed')
print(result['wage_trimmed'])  # 1 for trimmed, 0 for kept

Extreme Value Storage

# Store original extreme values
result = pw2.winsor2(data, ['wage'], genextreme=('_low', '_high'))
print(result['wage_low'])   # Original low extreme values
print(result['wage_high'])  # Original high extreme values

Variable-Specific Cuts

# Different cuts for different variables
var_cuts = {
    'wage': (5, 95),
    'age': (1, 99)
}
result, summary = pw2.winsor2(data, ['wage', 'age'], var_cuts=var_cuts, verbose=True)

Enhanced Group Processing

# Group processing with new features
result, summary = pw2.winsor2(
    data, ['wage'], 
    by='industry',
    cutlow=10, cuthigh=90,
    genextreme=('_orig_low', '_orig_high'),
    verbose=True
)

📊 Stata vs. pywinsor2 Comparison

Core Functionality Parity

Stata Command pywinsor2 Equivalent Status
winsor2 wage pw2.winsor2(df, ['wage']) Perfect Match
winsor2 wage, cuts(5 95) pw2.winsor2(df, ['wage'], cuts=(5, 95)) Perfect Match
winsor2 wage, trim pw2.winsor2(df, ['wage'], trim=True) Perfect Match
winsor2 wage, by(industry) pw2.winsor2(df, ['wage'], by='industry') Perfect Match
winsor2 wage, replace pw2.winsor2(df, ['wage'], replace=True) Perfect Match

🚀 Exclusive pywinsor2 Advantages

Feature Stata winsor2 pywinsor2 v0.2.0 Advantage
Asymmetric Cuts ❌ Not supported cutlow=2, cuthigh=98 🔥 Superior Control
Processing Reports ❌ Minimal feedback verbose=True detailed summaries 📊 Better Insights
Flag Generation ❌ Manual workaround needed genflag='_outlier' automatic 🏷️ Streamlined Workflow
Extreme Value Storage ❌ Values lost forever genextreme=('_low', '_high') 💾 Data Preservation
Variable-Specific Cuts ❌ Same cuts for all vars var_cuts={'wage':(1,99), 'ret':(5,95)} 🎛️ Precision Control
Combined Features ❌ Limited combinations ✅ All features work together Maximum Flexibility

Performance & Usability

  • ** Python Integration**: Seamless with pandas, numpy, matplotlib, seaborn
  • ** Better Documentation**: Comprehensive examples and clear parameter descriptions
  • ** Modern API**: Pythonic design with intuitive parameter names
  • ** Open Source**: No licensing costs, community-driven improvements
  • ** Active Development**: Regular updates and new features

Why Choose pywinsor2 v0.2.0?

For Current Stata Users

  • ** Zero Learning Curve**: Same syntax, same results
  • ** Immediate Upgrade**: 6 exclusive new features unavailable in Stata
  • ** Cost Effective**: Reduce Stata license dependency
  • ** Better Analysis**: Verbose reporting and data preservation capabilities

For Python Users

  • ** Stata-Grade Reliability**: Battle-tested algorithms with 100% core feature compatibility
  • ** Native Integration**: Perfect pandas DataFrame compatibility
  • ** Research Ready**: Designed for econometrics and financial analysis
  • ** Production Ready**: Comprehensive error handling and validation

For Data Scientists

  • ** Precision Control**: Variable-specific cuts and asymmetric thresholds
  • ** Rich Metadata**: Detailed processing summaries and change tracking
  • ** Workflow Enhancement**: Automatic flagging and extreme value preservation
  • ** Feature Combinations**: All new features work seamlessly together

** Ready to upgrade your winsorization workflow? Try pywinsor2 v0.2.0 today and experience the power of enhanced data preprocessing!**

📄 License

MIT License

Related Projects

  • PyStataR - Unified Stata-equivalent commands and R functions (recommended for new projects)
  • StatsPAI - StatsPAI = Stats + Econometrics + ML + AI + LLMs

👨‍💻 Author & Maintenance

Bryce Wang - brycew6m@stanford.edu

This package is actively maintained. For bug reports, feature requests, or contributions, please visit the GitHub repository.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pywinsor2-0.2.1.tar.gz (15.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pywinsor2-0.2.1-py3-none-any.whl (17.6 kB view details)

Uploaded Python 3

File details

Details for the file pywinsor2-0.2.1.tar.gz.

File metadata

  • Download URL: pywinsor2-0.2.1.tar.gz
  • Upload date:
  • Size: 15.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.5

File hashes

Hashes for pywinsor2-0.2.1.tar.gz
Algorithm Hash digest
SHA256 0488ac228973337384479d98dc2421f06a37ca9fdfb0b1f2e5e6aa777097d029
MD5 08ae8621b4869e6f6e30801bc1cea36a
BLAKE2b-256 eb7d39c7cf508abf8f852aa24ab412350dc5f17cbddc534a2a9a33b49e2d9776

See more details on using hashes here.

File details

Details for the file pywinsor2-0.2.1-py3-none-any.whl.

File metadata

  • Download URL: pywinsor2-0.2.1-py3-none-any.whl
  • Upload date:
  • Size: 17.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.5

File hashes

Hashes for pywinsor2-0.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 f03c10e2e23b3ace02cb8af8d2ced514ad07a064b66f49728e1326b9070ac2a0
MD5 54f03d3e335997020db78398d0130bcc
BLAKE2b-256 efbefabde2e044bda05f84c9a67fa112d1374e209ffacc9937ea5deaa1cfd86c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page