Automatically select the most relevant features based on correlation.
Project description
AutoCorrFeatureSelection
Automatically select the most relevant features based on correlation.
How it works
The AutoCorrFeatureSelection class utilizes correlation analysis to automatically select relevant features from a given dataset. Here's a step-by-step overview of how it works:
- Correlation Matrix:
The first step is to calculate the correlation matrix, which measures the pairwise correlation between all features in the dataset. The correlation matrix provides insight into the relationships between the features.
sepal.length | sepal.width | petal.length | petal.width | variety | |
---|---|---|---|---|---|
sepal.length | 1.0 | -0.11 | 0.87 | 0.81 | 0.72 |
sepal.width | -0.11 | 1.0 | -0.42 | -0.36 | -0.42 |
petal.length | 0.87 | -0.42 | 1.0 | 0.96 | 0.94 |
petal.width | 0.81 | -0.36 | 0.96 | 1.0 | 0.95 |
variety | 0.72 | -0.42 | 0.94 | 0.95 | 1.0 |
- Threshold-based Selection:
Next, the class applies a threshold to the correlation matrix to identify columns with correlations above the specified threshold (for example 0.85). These columns are considered highly correlated and may contain redundant or similar information.
sepal.length | sepal.width | petal.length | petal.width | variety | |
---|---|---|---|---|---|
sepal.length | 0.87 | ||||
sepal.width | |||||
petal.length | 0.87 | 0.96 | 0.94 | ||
petal.width | 0.96 | 0.95 | |||
variety | 0.94 | 0.95 |
- Selected Columns and Relationships:
The selected columns are visually represented, showcasing the relationships between the highly correlated features. This diagram helps visualize the interconnectedness of these features.
By following these steps, the AutoCorrFeatureSelection class automates the process of feature selection based on correlation analysis, enabling you to identify and focus on the most informative and non-redundant features in your dataset.
Example
Examples can be found in examples/.
# set up auto correlation
auto_corr = AutoCorrFeatureSelection(df)
# select low correlated columns
selected_columns = auto_corr.select_columns_above_threshold(threshold=0.85)
filtered_df = df[selected_columns]
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for auto_corr_feature_selection-0.1.0.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | 6d17236ebb86566d4661c55d349ba68754cc9141715afc83fe2eff0f89fb94e3 |
|
MD5 | c50e9ec30a230f2d6a1e7c3129a8d9e2 |
|
BLAKE2b-256 | b1a2019738ba4f1f046f9ad32573eee8e40b583e9b294da1ae55aa12e6109deb |
Hashes for auto_corr_feature_selection-0.1.0-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 8a25cb8d3bcfe579e7dbd3196d758e4cd48a51a3ffc0ff47b7e3d13772f1fde3 |
|
MD5 | 81335553cdaef663563365404145f905 |
|
BLAKE2b-256 | 7a2e8b6f6354280fec076409f4506727b50a5fdbf4e43ea64c4281e60fe16abf |