A Python package for data manipulation and analysis utilities
Project description
📚 Python Script Documentation for main.py
Welcome to the documentation for the main.py
file. This file contains a series of utility functions designed to manipulate, transform, and analyze pandas DataFrames. The main modules used in this script are pandas, numpy, itertools, and matplotlib.
Installation
pip install phenome-utils
OR
pip install git+https://git.phenome.health/trent.leslie/phenome-utils
📑 Index
- sum_and_sort_columns
- binary_threshold_matrix_by_col
- binary_threshold_matrix_by_row
- concatenate_csvs_in_directory
- aggregate_function
- aggregate_duplicates
- generate_subcategories
- generate_subcategories_with_proportions
- bin_continuous
- load_latest_yyyymmdd_file
- remove_rows_with_na_threshold
- impute_na_in_columns
1. sum_and_sort_columns
Description
Sum the numerical columns of a DataFrame, remove columns with a sum of zero, and sort the columns in descending order based on their sum. Optionally, plot a histogram of the non-zero column sums.
Parameters
Parameter | Type | Description |
---|---|---|
df |
DataFrame |
The input DataFrame. |
plot_histogram |
bool |
Whether to plot a histogram of the non-zero column sums. Defaults to False. |
Returns
Type | Description |
---|---|
DataFrame |
A DataFrame with columns sorted in descending order based on their sum, zero-sum columns removed, and non-numeric columns preserved as the first columns. |
Example Usage
# Example code demonstrating usage
sorted_df = sum_and_sort_columns(df, plot_histogram=True)
📊 Visualization
If plot_histogram
is set to True, a histogram of the non-zero column sums will be displayed.
2. binary_threshold_matrix_by_col
Description
Convert a DataFrame's numerical columns to a binary matrix based on percentile thresholds and filter based on a second DataFrame.
Parameters
Parameter | Type | Description |
---|---|---|
df |
DataFrame |
The input DataFrame. |
lower_threshold |
int |
The lower percentile threshold. Defaults to 1. |
upper_threshold |
int |
The upper percentile threshold. Defaults to 99. |
second_df |
DataFrame |
A second DataFrame with 'subcategory' and 'decimal_proportion' columns. |
decimal_proportion_threshold |
float |
Threshold for filtering the second DataFrame. Defaults to 0.1. |
filter_column |
str |
Column name in the original DataFrame to filter based on 'subcategory' from the second DataFrame. |
Returns
Type | Description |
---|---|
DataFrame |
A binary matrix where numerical column values outside the thresholds are 1 and within the thresholds are 0. Object columns are preserved. |
Example Usage
# Example code demonstrating usage
binary_df = binary_threshold_matrix_by_col(df, lower_threshold=5, upper_threshold=95, second_df=second_df, filter_column='category')
3. binary_threshold_matrix_by_row
Description
Convert a DataFrame's numerical rows to a binary matrix based on percentile thresholds and filter based on a second DataFrame.
Parameters
Parameter | Type | Description |
---|---|---|
df |
DataFrame |
The input DataFrame. |
lower_threshold |
int |
The lower percentile threshold. Defaults to 1. |
upper_threshold |
int |
The upper percentile threshold. Defaults to 99. |
second_df |
DataFrame |
A second DataFrame with 'subcategory' and 'decimal_proportion' columns. |
decimal_proportion_threshold |
float |
Threshold for filtering the second DataFrame. Defaults to 0.1. |
filter_column |
str |
Column name in the original DataFrame to filter based on 'subcategory' from the second DataFrame. |
Returns
Type | Description |
---|---|
DataFrame |
A binary matrix where numerical row values outside the thresholds are 1 and within the thresholds are 0. Object columns are preserved. |
Example Usage
# Example code demonstrating usage
binary_df = binary_threshold_matrix_by_row(df, lower_threshold=5, upper_threshold=95, second_df=second_df, filter_column='category')
4. concatenate_csvs_in_directory
Description
Concatenate CSV files from a root directory and its subdirectories.
Parameters
Parameter | Type | Description |
---|---|---|
root_dir |
str |
The root directory to start the search. |
filter_string |
str |
A string that must be in the filename to be included. Defaults to None. |
file_extension |
str |
The file extension to search for. Defaults to "csv". |
csv_filename |
str |
The filename to save the concatenated CSV. If not provided, returns the DataFrame. |
Returns
Type | Description |
---|---|
DataFrame or None |
A concatenated DataFrame of all the CSVs if csv_filename is not provided. Otherwise, saves the DataFrame and returns None. |
Example Usage
# Example code demonstrating usage
concatenated_df = concatenate_csvs_in_directory('/path/to/directory', filter_string='data', csv_filename='output.csv')
5. aggregate_function
Description
General-purpose aggregation function.
Parameters
Parameter | Type | Description |
---|---|---|
x |
pd.Series |
Input series |
numeric_method |
str |
Method to aggregate numeric data. Supports 'median', 'mean', and 'mode'. Default is 'median'. |
substitute |
any |
Value to substitute when all values are NaN or mode is empty. Default is np.nan. |
Returns
Type | Description |
---|---|
any |
Aggregated value |
Example Usage
# Example code demonstrating usage
aggregated_value = aggregate_function(pd.Series([1, 2, 3, np.nan]), numeric_method='mean')
6. aggregate_duplicates
Description
Aggregates duplicates in a dataframe based on specified grouping columns and a chosen aggregation method for numeric types.
Parameters
Parameter | Type | Description |
---|---|---|
df |
pd.DataFrame |
Input dataframe |
group_columns |
list |
List of column names to group by |
numeric_method |
str |
Method to aggregate numeric data. Supports 'median', 'mean', and 'mode'. Default is 'median'. |
substitute |
any |
Value to substitute when all values are NaN or mode is empty. Default is np.nan. |
Returns
Type | Description |
---|---|
pd.DataFrame |
Dataframe with aggregated duplicates |
Example Usage
# Example code demonstrating usage
aggregated_df = aggregate_duplicates(df, group_columns=['category'], numeric_method='mean')
7. generate_subcategories
Description
Generate subcategories by combining values from specified columns.
Parameters
Parameter | Type | Description |
---|---|---|
df |
pd.DataFrame |
The input dataframe. |
columns |
list |
List of columns to generate subcategories from. |
col_separator |
str |
Separator to use between column names for new columns. Defaults to '_'. |
val_separator |
str |
Separator to use between values when combining. Defaults to ' '. |
missing_val |
str |
Value to replace missing data in specified columns. Defaults to 'NA'. |
Returns
Type | Description |
---|---|
pd.DataFrame |
DataFrame with new subcategory columns. |
list |
List of all column names (original + generated). |
Example Usage
# Example code demonstrating usage
df, all_columns = generate_subcategories(df, columns=['col1', 'col2'])
8. generate_subcategories_with_proportions
Description
Generate subcategories by combining values from specified columns and calculate their proportions.
Parameters
Parameter | Type | Description |
---|---|---|
df |
pd.DataFrame |
The input dataframe. |
columns |
list |
List of columns to generate subcategories from. |
solo_columns |
list |
List of columns to be considered on their own. |
col_separator |
str |
Separator to use between column names for new columns. Defaults to '_'. |
val_separator |
str |
Separator to use between values when combining. Defaults to ' '. |
missing_val |
str |
Value to replace missing data in specified columns. Defaults to 'NA'. |
overall_category_name |
str |
Name for the overall category. Defaults to 'overall'. |
Returns
Type | Description |
---|---|
pd.DataFrame |
DataFrame with new subcategory columns. |
list |
List of all column names (original + generated). |
pd.DataFrame |
DataFrame with subcategory and its decimal proportion. |
Example Usage
# Example code demonstrating usage
df, all_columns, proportions_df = generate_subcategories_with_proportions(df, columns=['col1', 'col2'], solo_columns=['col3'])
9. bin_continuous
Description
Bins continuous data in a specified column of a dataframe.
Parameters
Parameter | Type | Description |
---|---|---|
dataframe |
pd.DataFrame |
The input dataframe. |
column_name |
str |
The name of the column containing continuous data to be binned. |
bin_size |
int |
The size of each bin. Default is 10. |
range_start |
int |
The starting value of the range for binning. Default is 0. |
Returns
Type | Description |
---|---|
pd.DataFrame |
The dataframe with an additional column for binned data. |
Example Usage
# Example code demonstrating usage
binned_df = bin_continuous(df, column_name='age', bin_size=5)
10. load_latest_yyyymmdd_file
Description
Load the latest file from the specified directory based on its date.
Parameters
Parameter | Type | Description |
---|---|---|
directory |
str |
Path to the directory containing the files. |
base_filename |
str |
Base name of the file. |
file_extension |
str |
File extension including the dot (e.g., '.csv'). |
na_values |
list or dict |
Additional strings to recognize as NA/NaN. |
Returns
Type | Description |
---|---|
pd.DataFrame |
The loaded data. |
Example Usage
# Example code demonstrating usage
df = load_latest_yyyymmdd_file("/path/to/directory", "data_", ".csv")
print(df.head())
11. remove_rows_with_na_threshold
Description
Removes rows from the dataframe that have a fraction of NA values greater than the specified threshold.
Parameters
Parameter | Type | Description |
---|---|---|
df |
pd.DataFrame |
The input dataframe. |
stringified_ids |
list |
List of column names to consider for NA value calculation. |
threshold |
float |
The fraction of NA values for a row to be removed. Default is 0.5. |
save_starting_df |
bool |
Whether to save the initial dataframe to a CSV file. Default is False. |
Returns
Type | Description |
---|---|
pd.DataFrame |
The dataframe with rows removed based on the threshold. |
Example Usage
# Example code demonstrating usage
cleaned_df = remove_rows_with_na_threshold(df, stringified_ids=['col1', 'col2'], threshold=0.3)
12. impute_na_in_columns
Description
Imputes NA values in columns with either the minimum or median value of the column.
Parameters
Parameter | Type | Description |
---|---|---|
df |
pd.DataFrame |
The input dataframe. |
method |
str |
Method for imputation. Either 'min' or 'median'. Default is 'median'. |
Returns
Type | Description |
---|---|
pd.DataFrame |
The dataframe with NA values imputed. |
Example Usage
# Example code demonstrating usage
imputed_df = impute_na_in_columns(df, method='median')
Each function is meticulously crafted to handle specific tasks related to data manipulation, transformation, and analysis. This documentation provides a comprehensive understanding of the capabilities and usage of each function within the main.py
script. Happy coding! 🎉
ph_utils is a Python package that provides utility functions for data manipulation and analysis, particularly focused on working with pandas DataFrames.
Features
- Sum and sort DataFrame columns
- Generate binary threshold matrices
- Concatenate CSV files from directories
- Aggregate duplicates in DataFrames
- Generate subcategories from DataFrame columns
- Bin continuous data
- Load latest files based on date in filename
- Remove rows with NA values above a threshold
- Impute NA values in DataFrame columns
Installation
You can install ph_utils using pip:
pip install ph_utils
Usage
Here are some examples of how to use ph_utils:
import pandas as pd
from ph_utils import sum_and_sort_columns, binary_threshold_matrix_by_col, aggregate_duplicates
# Example 1: Sum and sort columns
df = pd.DataFrame({
'A': [1, 2, 3],
'B': [0, 0, 0],
'C': [4, 5, 6],
'D': ['x', 'y', 'z']
})
result = sum_and_sort_columns(df)
print(result)
# Example 2: Create a binary threshold matrix
binary_df = binary_threshold_matrix_by_col(df, lower_threshold=25, upper_threshold=75)
print(binary_df)
# Example 3: Aggregate duplicates
aggregated_df = aggregate_duplicates(df, group_columns=['D'], numeric_method='mean')
print(aggregated_df)
For more detailed information on each function, please refer to the function docstrings in the source code.
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
License
This project is licensed under the MIT License - see the LICENSE file for details.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file phenome_utils-0.7.0.tar.gz
.
File metadata
- Download URL: phenome_utils-0.7.0.tar.gz
- Upload date:
- Size: 14.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.12.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 4e1ccb5b4402ab4bfd97042312d40b2f6097f9b79f185ffc481f333c01523dc5 |
|
MD5 | 16379023de754281dd0d5b7966b14583 |
|
BLAKE2b-256 | 234e59a861a8fd2e439dca60c966bdc43a75313f93956c6664c943d6ec7bedbe |
File details
Details for the file phenome_utils-0.7.0-py3-none-any.whl
.
File metadata
- Download URL: phenome_utils-0.7.0-py3-none-any.whl
- Upload date:
- Size: 17.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.12.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | bc866d2b88958dc786118befdb41ab2cb5096112752f519fda8f7f9a207d85a5 |
|
MD5 | bd07797906f3215c25967d17248adc0b |
|
BLAKE2b-256 | 06361e334d170517f7f70a5f6df8c2003816a0e43cbfc4501874f18e7030b96a |