A package for outlier detection in phenome datasets
Project description
phenome-outlier-analysis
OutlierDetector Class Documentation
Overview
The OutlierDetector
class is designed for detecting outliers in datasets using various normalization methods. It supports both context-specific and global outlier detection strategies, making it versatile for different types of data analysis.
Class Initialization
OutlierDetector(df, analyte_columns, segment_columns=['sex'])
Parameters:
df
(pandas.DataFrame): The input DataFrame containing the data to be analyzed.analyte_columns
(list): A list of column names to be analyzed for outliers.segment_columns
(list, optional): A list of column names used for segmentation in context-specific outlier detection. Defaults to ['sex'].
Main Methods
1. perform_outlier_detection
perform_outlier_detection(lower_percentile=0.01, upper_percentile=0.99, method='double_mad', take_log=False)
This is the primary method to perform outlier detection on the given DataFrame.
Parameters:
lower_percentile
(float): Lower percentile for cutoff calculation. Default is 0.01.upper_percentile
(float): Upper percentile for cutoff calculation. Default is 0.99.method
(str): Normalization method. Can be 'double_mad' or 'zscore'. Default is 'double_mad'.take_log
(bool): Whether to apply log transformation before normalization. Default is False.
Returns:
A tuple containing two dictionaries:
- Context-specific results
- Super-global results
2. context_specific_outlier_detection
context_specific_outlier_detection(method='double_mad', take_log=False)
Performs context-specific outlier detection by segmenting the DataFrame based on the segment_columns
.
3. super_global_outlier_detection
super_global_outlier_detection(method='double_mad', take_log=False)
Evaluates outliers on a global scale, considering all data points together.
Helper Methods
calculate_double_mad
Calculates left and right Median Absolute Deviations (MADs) from the median.
normalize_series
Normalizes a series using the specified method (double_mad or zscore).
calculate_percentile_cutoffs
Calculates global percentile cutoffs based on the specified columns of a DataFrame.
create_binary_matrix
Creates a binary matrix indicating outliers based on specified cutoffs.
normalize_dataframe
Normalizes specified columns in a DataFrame.
detect_outliers
Detects outliers in the specified columns of a DataFrame.
get_global_cutoffs
Gets global cutoffs for outlier detection.
Usage Example
import pandas as pd
from outlier_detection import OutlierDetector
# Load your data
df = pd.read_csv('your_data.csv')
# Define columns
analyte_columns = ['column1', 'column2', 'column3']
segment_columns = ['sex', 'age_group']
# Create OutlierDetector instance
detector = OutlierDetector(df, analyte_columns, segment_columns)
# Perform outlier detection
context_results, global_results = detector.perform_outlier_detection(
lower_percentile=0.01,
upper_percentile=0.99,
method='double_mad',
take_log=True
)
# Analyze results
for (segment, value), result in context_results.items():
print(f"Outliers for {segment}={value}:")
print(result['binary_matrix'].sum())
print("Global outliers:")
print(global_results[('global', 'global')]['binary_matrix'].sum())
Notes
- The class uses logging to provide information and warnings during the outlier detection process.
- The
tqdm
library is used to show progress bars for long-running operations. - The class can handle both context-specific (segmented) and global outlier detection.
- Two normalization methods are supported: 'double_mad' (double Median Absolute Deviation) and 'zscore'.
- Log transformation can be applied before normalization if needed.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for phenome_outlier_analysis-0.1.0.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | fadd5fbd5befc06f2e8f97c232dd3ae003b7da4afc58eded04181cda165fb0b5 |
|
MD5 | bf68e33662e2a91f0f211f18369251c5 |
|
BLAKE2b-256 | da67a4edc5c168a8fdd90d80c40708c66ad12443a9441cab2fc80458031d38ab |
Hashes for phenome_outlier_analysis-0.1.0-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 37fce37970dc8e0aa6de056acba5719c8c578750b53aa975d0091dbbaf309f78 |
|
MD5 | aaf8e8ca3974fb754999df418537654c |
|
BLAKE2b-256 | 564b9ccddb69fcf58ca09bc5de16dfd9d7328efdb4afcd16c6a18df799300d5a |