Skip to main content

Python implementation of Stata's winsor2 command for winsorizing and trimming data

Project description

pywinsor2

Python implementation of Stata's winsor2 command for winsorizing and trimming data.

Installation

pip install pywinsor2

Quick Start

import pandas as pd
import pywinsor2 as pw2

# Load sample data
data = pd.DataFrame({
    'wage': [1.0, 1.5, 2.0, 3.0, 5.0, 8.0, 12.0, 20.0, 50.0, 100.0],
    'industry': ['A', 'A', 'B', 'B', 'A', 'A', 'B', 'B', 'A', 'B']
})

# Winsorize at 1st and 99th percentiles (default)
result = pw2.winsor2(data, ['wage'])

# Winsorize with custom cuts
result = pw2.winsor2(data, ['wage'], cuts=(5, 95))

# Trim instead of winsorize
result = pw2.winsor2(data, ['wage'], trim=True)

# Winsorize by group
result = pw2.winsor2(data, ['wage'], by='industry')

# Replace original variables
pw2.winsor2(data, ['wage'], replace=True)

Features

  • Winsorizing: Replace extreme values with percentile values
  • Trimming: Remove extreme values (set to NaN)
  • Group-wise processing: Process data within groups
  • Flexible percentiles: Specify custom cut-off percentiles
  • Multiple variables: Process multiple columns simultaneously
  • Stata compatibility: API designed to match Stata's winsor2 command

Main Function

winsor2(data, varlist, cuts=(1, 99), suffix=None, replace=False, trim=False, by=None, label=False)

Parameters:

  • data (DataFrame): Input pandas DataFrame
  • varlist (list): List of column names to process
  • cuts (tuple): Percentiles for winsorizing/trimming (default: (1, 99))
  • suffix (str): Suffix for new variables (default: '_w' for winsor, '_tr' for trim)
  • replace (bool): Replace original variables (default: False)
  • trim (bool): Trim instead of winsorize (default: False)
  • by (str or list): Group variables for group-wise processing
  • label (bool): Add descriptive labels to new columns (default: False)

Returns:

  • DataFrame: Processed DataFrame with winsorized/trimmed variables

Examples

Basic Usage

import pandas as pd
import pywinsor2 as pw2

# Create sample data
data = pd.DataFrame({
    'wage': [1, 2, 3, 4, 5, 6, 7, 8, 9, 100],  # outlier: 100
    'age': [20, 25, 30, 35, 40, 45, 50, 55, 60, 25]
})

# Winsorize at default percentiles (1, 99)
result = pw2.winsor2(data, ['wage'])
print(result['wage_w'])  # New winsorized variable

# Winsorize multiple variables
result = pw2.winsor2(data, ['wage', 'age'], cuts=(5, 95))

# Trim outliers
result = pw2.winsor2(data, ['wage'], trim=True, cuts=(10, 90))
print(result['wage_tr'])  # Trimmed variable

Group-wise Processing

# Winsorize within groups
data = pd.DataFrame({
    'wage': [1, 2, 3, 10, 1, 2, 3, 15],
    'industry': ['A', 'A', 'A', 'A', 'B', 'B', 'B', 'B']
})

result = pw2.winsor2(data, ['wage'], by='industry', cuts=(25, 75))

Advanced Options

# Replace original variables
pw2.winsor2(data, ['wage'], replace=True, cuts=(2, 98))

# Custom suffix and labels
result = pw2.winsor2(data, ['wage'], suffix='_clean', label=True)

Comparison with Stata

Stata Command Python Equivalent
winsor2 wage pw2.winsor2(df, ['wage'])
winsor2 wage, cuts(5 95) pw2.winsor2(df, ['wage'], cuts=(5, 95))
winsor2 wage, trim pw2.winsor2(df, ['wage'], trim=True)
winsor2 wage, by(industry) pw2.winsor2(df, ['wage'], by='industry')
winsor2 wage, replace pw2.winsor2(df, ['wage'], replace=True)

License

MIT License

Author

Bryce Wang - brycew6m@stanford.edu

Contributing

Contributions are welcome! Please see CONTRIBUTING.md for guidelines.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pywinsor2-0.1.0.tar.gz (7.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pywinsor2-0.1.0-py3-none-any.whl (8.6 kB view details)

Uploaded Python 3

File details

Details for the file pywinsor2-0.1.0.tar.gz.

File metadata

  • Download URL: pywinsor2-0.1.0.tar.gz
  • Upload date:
  • Size: 7.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.5

File hashes

Hashes for pywinsor2-0.1.0.tar.gz
Algorithm Hash digest
SHA256 7b76c4bf6973b82edba406d4265be9eb2847db1819354403b74389072bfec097
MD5 ae5919f15f0324d3ba86f27e14d26654
BLAKE2b-256 c0391ffabe927c529944a66dbf9db9f1d308a799516f1f77fb75cc9de9af9e50

See more details on using hashes here.

File details

Details for the file pywinsor2-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: pywinsor2-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 8.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.5

File hashes

Hashes for pywinsor2-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 486b51a9dc27a1cb39f0d9f952ac67ed62a0019a764c06afbcf39eadc575dfdc
MD5 0a4b63293f2b5998e6bd18a7504d99d6
BLAKE2b-256 9c9d7c3de50d61a50a351cced06c9615b8aaca226e413fea1ff61a7fb49ba935

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page