A Python package for data manipulation and analysis utilities
Project description
📚 Python Script Documentation for main.py
Welcome to the documentation for the main.py file. This file contains a series of utility functions designed to manipulate, transform, and analyze pandas DataFrames. The main modules used in this script are pandas, numpy, itertools, and matplotlib.
Installation
pip install phenome-utils
OR
pip install git+https://git.phenome.health/trent.leslie/phenome-utils
📑 Index
- sum_and_sort_columns
- binary_threshold_matrix_by_col
- binary_threshold_matrix_by_row
- concatenate_csvs_in_directory
- aggregate_function
- aggregate_duplicates
- generate_subcategories
- generate_subcategories_with_proportions
- bin_continuous
- load_latest_yyyymmdd_file
- remove_rows_with_na_threshold
- impute_na_in_columns
1. sum_and_sort_columns
Description
Sum the numerical columns of a DataFrame, remove columns with a sum of zero, and sort the columns in descending order based on their sum. Optionally, plot a histogram of the non-zero column sums.
Parameters
| Parameter | Type | Description |
|---|---|---|
df |
DataFrame |
The input DataFrame. |
plot_histogram |
bool |
Whether to plot a histogram of the non-zero column sums. Defaults to False. |
Returns
| Type | Description |
|---|---|
DataFrame |
A DataFrame with columns sorted in descending order based on their sum, zero-sum columns removed, and non-numeric columns preserved as the first columns. |
Example Usage
# Example code demonstrating usage
sorted_df = sum_and_sort_columns(df, plot_histogram=True)
📊 Visualization
If plot_histogram is set to True, a histogram of the non-zero column sums will be displayed.
2. binary_threshold_matrix_by_col
Description
Convert a DataFrame's numerical columns to a binary matrix based on percentile thresholds and filter based on a second DataFrame.
Parameters
| Parameter | Type | Description |
|---|---|---|
df |
DataFrame |
The input DataFrame. |
lower_threshold |
int |
The lower percentile threshold. Defaults to 1. |
upper_threshold |
int |
The upper percentile threshold. Defaults to 99. |
second_df |
DataFrame |
A second DataFrame with 'subcategory' and 'decimal_proportion' columns. |
decimal_proportion_threshold |
float |
Threshold for filtering the second DataFrame. Defaults to 0.1. |
filter_column |
str |
Column name in the original DataFrame to filter based on 'subcategory' from the second DataFrame. |
Returns
| Type | Description |
|---|---|
DataFrame |
A binary matrix where numerical column values outside the thresholds are 1 and within the thresholds are 0. Object columns are preserved. |
Example Usage
# Example code demonstrating usage
binary_df = binary_threshold_matrix_by_col(df, lower_threshold=5, upper_threshold=95, second_df=second_df, filter_column='category')
3. binary_threshold_matrix_by_row
Description
Convert a DataFrame's numerical rows to a binary matrix based on percentile thresholds and filter based on a second DataFrame.
Parameters
| Parameter | Type | Description |
|---|---|---|
df |
DataFrame |
The input DataFrame. |
lower_threshold |
int |
The lower percentile threshold. Defaults to 1. |
upper_threshold |
int |
The upper percentile threshold. Defaults to 99. |
second_df |
DataFrame |
A second DataFrame with 'subcategory' and 'decimal_proportion' columns. |
decimal_proportion_threshold |
float |
Threshold for filtering the second DataFrame. Defaults to 0.1. |
filter_column |
str |
Column name in the original DataFrame to filter based on 'subcategory' from the second DataFrame. |
Returns
| Type | Description |
|---|---|
DataFrame |
A binary matrix where numerical row values outside the thresholds are 1 and within the thresholds are 0. Object columns are preserved. |
Example Usage
# Example code demonstrating usage
binary_df = binary_threshold_matrix_by_row(df, lower_threshold=5, upper_threshold=95, second_df=second_df, filter_column='category')
4. concatenate_csvs_in_directory
Description
Concatenate CSV files from a root directory and its subdirectories.
Parameters
| Parameter | Type | Description |
|---|---|---|
root_dir |
str |
The root directory to start the search. |
filter_string |
str |
A string that must be in the filename to be included. Defaults to None. |
file_extension |
str |
The file extension to search for. Defaults to "csv". |
csv_filename |
str |
The filename to save the concatenated CSV. If not provided, returns the DataFrame. |
Returns
| Type | Description |
|---|---|
DataFrame or None |
A concatenated DataFrame of all the CSVs if csv_filename is not provided. Otherwise, saves the DataFrame and returns None. |
Example Usage
# Example code demonstrating usage
concatenated_df = concatenate_csvs_in_directory('/path/to/directory', filter_string='data', csv_filename='output.csv')
5. aggregate_function
Description
General-purpose aggregation function.
Parameters
| Parameter | Type | Description |
|---|---|---|
x |
pd.Series |
Input series |
numeric_method |
str |
Method to aggregate numeric data. Supports 'median', 'mean', and 'mode'. Default is 'median'. |
substitute |
any |
Value to substitute when all values are NaN or mode is empty. Default is np.nan. |
Returns
| Type | Description |
|---|---|
any |
Aggregated value |
Example Usage
# Example code demonstrating usage
aggregated_value = aggregate_function(pd.Series([1, 2, 3, np.nan]), numeric_method='mean')
6. aggregate_duplicates
Description
Aggregates duplicates in a dataframe based on specified grouping columns and a chosen aggregation method for numeric types.
Parameters
| Parameter | Type | Description |
|---|---|---|
df |
pd.DataFrame |
Input dataframe |
group_columns |
list |
List of column names to group by |
numeric_method |
str |
Method to aggregate numeric data. Supports 'median', 'mean', and 'mode'. Default is 'median'. |
substitute |
any |
Value to substitute when all values are NaN or mode is empty. Default is np.nan. |
Returns
| Type | Description |
|---|---|
pd.DataFrame |
Dataframe with aggregated duplicates |
Example Usage
# Example code demonstrating usage
aggregated_df = aggregate_duplicates(df, group_columns=['category'], numeric_method='mean')
7. generate_subcategories
Description
Generate subcategories by combining values from specified columns.
Parameters
| Parameter | Type | Description |
|---|---|---|
df |
pd.DataFrame |
The input dataframe. |
columns |
list |
List of columns to generate subcategories from. |
col_separator |
str |
Separator to use between column names for new columns. Defaults to '_'. |
val_separator |
str |
Separator to use between values when combining. Defaults to ' '. |
missing_val |
str |
Value to replace missing data in specified columns. Defaults to 'NA'. |
Returns
| Type | Description |
|---|---|
pd.DataFrame |
DataFrame with new subcategory columns. |
list |
List of all column names (original + generated). |
Example Usage
# Example code demonstrating usage
df, all_columns = generate_subcategories(df, columns=['col1', 'col2'])
8. generate_subcategories_with_proportions
Description
Generate subcategories by combining values from specified columns and calculate their proportions.
Parameters
| Parameter | Type | Description |
|---|---|---|
df |
pd.DataFrame |
The input dataframe. |
columns |
list |
List of columns to generate subcategories from. |
solo_columns |
list |
List of columns to be considered on their own. |
col_separator |
str |
Separator to use between column names for new columns. Defaults to '_'. |
val_separator |
str |
Separator to use between values when combining. Defaults to ' '. |
missing_val |
str |
Value to replace missing data in specified columns. Defaults to 'NA'. |
overall_category_name |
str |
Name for the overall category. Defaults to 'overall'. |
Returns
| Type | Description |
|---|---|
pd.DataFrame |
DataFrame with new subcategory columns. |
list |
List of all column names (original + generated). |
pd.DataFrame |
DataFrame with subcategory and its decimal proportion. |
Example Usage
# Example code demonstrating usage
df, all_columns, proportions_df = generate_subcategories_with_proportions(df, columns=['col1', 'col2'], solo_columns=['col3'])
9. bin_continuous
Description
Bins continuous data in a specified column of a dataframe.
Parameters
| Parameter | Type | Description |
|---|---|---|
dataframe |
pd.DataFrame |
The input dataframe. |
column_name |
str |
The name of the column containing continuous data to be binned. |
bin_size |
int |
The size of each bin. Default is 10. |
range_start |
int |
The starting value of the range for binning. Default is 0. |
Returns
| Type | Description |
|---|---|
pd.DataFrame |
The dataframe with an additional column for binned data. |
Example Usage
# Example code demonstrating usage
binned_df = bin_continuous(df, column_name='age', bin_size=5)
10. load_latest_yyyymmdd_file
Description
Load the latest file from the specified directory based on its date.
Parameters
| Parameter | Type | Description |
|---|---|---|
directory |
str |
Path to the directory containing the files. |
base_filename |
str |
Base name of the file. |
file_extension |
str |
File extension including the dot (e.g., '.csv'). |
na_values |
list or dict |
Additional strings to recognize as NA/NaN. |
Returns
| Type | Description |
|---|---|
pd.DataFrame |
The loaded data. |
Example Usage
# Example code demonstrating usage
df = load_latest_yyyymmdd_file("/path/to/directory", "data_", ".csv")
print(df.head())
11. remove_rows_with_na_threshold
Description
Removes rows from the dataframe that have a fraction of NA values greater than the specified threshold.
Parameters
| Parameter | Type | Description |
|---|---|---|
df |
pd.DataFrame |
The input dataframe. |
stringified_ids |
list |
List of column names to consider for NA value calculation. |
threshold |
float |
The fraction of NA values for a row to be removed. Default is 0.5. |
save_starting_df |
bool |
Whether to save the initial dataframe to a CSV file. Default is False. |
Returns
| Type | Description |
|---|---|
pd.DataFrame |
The dataframe with rows removed based on the threshold. |
Example Usage
# Example code demonstrating usage
cleaned_df = remove_rows_with_na_threshold(df, stringified_ids=['col1', 'col2'], threshold=0.3)
12. impute_na_in_columns
Description
Imputes NA values in columns with either the minimum or median value of the column.
Parameters
| Parameter | Type | Description |
|---|---|---|
df |
pd.DataFrame |
The input dataframe. |
method |
str |
Method for imputation. Either 'min' or 'median'. Default is 'median'. |
Returns
| Type | Description |
|---|---|
pd.DataFrame |
The dataframe with NA values imputed. |
Example Usage
# Example code demonstrating usage
imputed_df = impute_na_in_columns(df, method='median')
Each function is meticulously crafted to handle specific tasks related to data manipulation, transformation, and analysis. This documentation provides a comprehensive understanding of the capabilities and usage of each function within the main.py script. Happy coding! 🎉
ph_utils is a Python package that provides utility functions for data manipulation and analysis, particularly focused on working with pandas DataFrames.
Features
- Sum and sort DataFrame columns
- Generate binary threshold matrices
- Concatenate CSV files from directories
- Aggregate duplicates in DataFrames
- Generate subcategories from DataFrame columns
- Bin continuous data
- Load latest files based on date in filename
- Remove rows with NA values above a threshold
- Impute NA values in DataFrame columns
Installation
You can install ph_utils using pip:
pip install ph_utils
Usage
Here are some examples of how to use ph_utils:
import pandas as pd
from ph_utils import sum_and_sort_columns, binary_threshold_matrix_by_col, aggregate_duplicates
# Example 1: Sum and sort columns
df = pd.DataFrame({
'A': [1, 2, 3],
'B': [0, 0, 0],
'C': [4, 5, 6],
'D': ['x', 'y', 'z']
})
result = sum_and_sort_columns(df)
print(result)
# Example 2: Create a binary threshold matrix
binary_df = binary_threshold_matrix_by_col(df, lower_threshold=25, upper_threshold=75)
print(binary_df)
# Example 3: Aggregate duplicates
aggregated_df = aggregate_duplicates(df, group_columns=['D'], numeric_method='mean')
print(aggregated_df)
For more detailed information on each function, please refer to the function docstrings in the source code.
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
License
This project is licensed under the MIT License - see the LICENSE file for details.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file phenome_utils-0.7.0.tar.gz.
File metadata
- Download URL: phenome_utils-0.7.0.tar.gz
- Upload date:
- Size: 14.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.12.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4e1ccb5b4402ab4bfd97042312d40b2f6097f9b79f185ffc481f333c01523dc5
|
|
| MD5 |
16379023de754281dd0d5b7966b14583
|
|
| BLAKE2b-256 |
234e59a861a8fd2e439dca60c966bdc43a75313f93956c6664c943d6ec7bedbe
|
File details
Details for the file phenome_utils-0.7.0-py3-none-any.whl.
File metadata
- Download URL: phenome_utils-0.7.0-py3-none-any.whl
- Upload date:
- Size: 17.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.12.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
bc866d2b88958dc786118befdb41ab2cb5096112752f519fda8f7f9a207d85a5
|
|
| MD5 |
bd07797906f3215c25967d17248adc0b
|
|
| BLAKE2b-256 |
06361e334d170517f7f70a5f6df8c2003816a0e43cbfc4501874f18e7030b96a
|