This library automates data summarization, feature selection, and model evaluation, allowing you to focus on building and refining your models.
Project description
PrismaML
PrismaML is a Python package designed to streamline and simplify the machine learning model building and data pre-processing process.
Explore the docs »
Report Bug
.
Request Feature
Table Of Contents
About The Project
[!Warning]
Package is Unstable:
This is a beta release of PrismaML. As such, you may encounter issues or incomplete functionalities. Some features may not work as expected, and others might not be fully implemented yet. We appreciate your patience and feedback as we continue to improve the package. If you encounter any problems or have suggestions for improvements, please report them through the Report Bug link.
[!Important]
Tests are Yet to Be Added:
This is a beta release of PrismaML. As such, Unit tests have not yet been added as of this release. However, unit tests are a neccessity and will be added in the future.
PrismaML is a comprehensive package designed to streamline and simplify the machine learning model building process. With features for data analysis, feature selection, model evaluation, and visualization, PrismaML empowers data scientists and machine learning practitioners to build and refine their models more efficiently.
Key Features
- Data Summarization: Automatically generate detailed summaries of your datasets, including metadata, statistical measures, and visualizations.
- Feature Selection: Select the best features for your models using various feature selection techniques, enhancing model performance and interpretability.
- Model Evaluation: Evaluate machine learning models with robust metrics tailored for both classification and regression tasks.
- Visualization: Create powerful visualizations to communicate insights effectively, including categorical and numerical data distributions and algorithm performance comparisons.
Built With
Getting Started
Prerequisites
Before you begin, ensure you have met the following requirements:
-
Python: PrismaML requires Python 3.6 or later. You can download the latest version of Python from the official website.
-
pip: Ensure you have the latest version of
pip
installed. You can upgradepip
using the following command:python -m pip install --upgrade pip
-
Poetry: If you prefer using Poetry for package management, you can install it by following the instructions on the Poetry website.
Installation
You can install PrismaML using either pip or Poetry.
To install PrismaML, use pip:
pip install prisma_ml
Or using Poetry
poetry add prisma-ml
Usage
Class: Plotting
The Plotting class provides functionalities for visualizing categorical and numerical data in a pandas DataFrame, as well as comparing the performance of different algorithms using bar charts.
Initialization:
Plotting(df)
draw_categorical_plots()
Description:
This method identifies all categorical columns in the DataFrame and generates count plots for each column. The plots are arranged in a grid layout with a specified number of columns per row.
Parameters:
n_cols (int)
: Number of columns per row in the plot grid. Default is 3.
Returns:
None
: This method displays the plot directly.
Example Usage:
df = pd.read_csv("your_dataset.csv")
plotter = Plotting(df)
plotter.draw_categorical_plots(n_cols=3)
draw_numerical_plots()
Description:
This method identifies all numerical columns in the DataFrame and generates histograms for each column. The plots are arranged in a grid layout with a specified number of columns per row.
Parameters:
n_cols (int)
: Number of columns per row in the plot grid. Default is 3.
Returns:
None
: This method displays the plot directly.
Example Usage:
df = pd.read_csv("your_dataset.csv")
plotter = Plotting(df)
plotter.draw_numerical_plots(n_cols=3)
plot_algorithm_comparison()
Description:
This method generates a bar chart comparing the performance of different algorithms based on various scores. Each algorithm is represented by a different color, and the scores are displayed as bars.
Parameters:
algorithm_scores (dict)
: A dictionary where keys are algorithm names and values are dictionaries containing score names as keys and scores as values.bar_width (float)
: Width of the bars in the plot. Default is 0.2.
Returns:
None
: This method displays the plot directly.
Example Usage:
scores = {
'Algorithm1': {'Accuracy': 0.95, 'Precision': 0.90, 'Recall': 0.92},
'Algorithm2': {'Accuracy': 0.93, 'Precision': 0.88, 'Recall': 0.91}
}
plotter = Plotting(df)
plotter.plot_algorithm_comparison(scores, bar_width=0.2)
Class: MachineLearning
The MachineLearning class provides functionalities for feature selection, model evaluation, and visualization of model performance using scikit-learn models.
Initialization:
MachineLearning()
select_best_features()
Description:
This method iteratively applies a feature selection strategy to the dataset, fits the specified machine learning model, and evaluates its performance. It tracks performance metrics for each number of features used and identifies the optimal number of features that yield the best performance score. Additionally, it returns the names of the best features along with the fitted feature selector for the best model.
Parameters:
X (pd.DataFrame)
: The input features DataFrame.y (pd.Series)
: The target variable Series.model (sklearn estimator)
: An instance of a scikit-learn regression or classification model.feature_selector (function)
: A function that returns a feature selector object from scikit-learn (e.g., SelectKBest). The object must havefit_transform
andget_support
methods.
Returns:
int
: The optimal number of features yielding the best performance score.pd.DataFrame
: A DataFrame containing the performance metrics for each number of features.list
: The names of the best features selected by the feature selection method.selector
: The fitted feature selector object for the best model.
Example Usage:
from sklearn.linear_model import LinearRegression
from sklearn.feature_selection import SelectKBest, f_regression
ml = MachineLearning()
model = LinearRegression()
feature_selector = lambda k: SelectKBest(f_regression, k=k)
best_features, results, best_feature_names, best_selector = ml.select_best_features(X, y, model, feature_selector)
print("Best number of features:", best_features)
print("Best features:", best_feature_names)
plot_accuracy_vs_features()
Description:
This method should be run after the select_best_features
method. It uses the results from select_best_features
to create the plot. The method automatically detects the model type (classification or regression) based on the metrics available in the results and plots the relevant metric.
Parameters:
save_path (str, optional)
: Absolute path to save the plot. If None, the plot is not saved to a file.
Returns:
None
: This method displays the plot directly.
Example Usage:
ml = MachineLearning()
ml.plot_accuracy_vs_features(save_path='/absolute/path/to/save/plot.png')
evaluate_model()
Description:
This method evaluates the given model over multiple train-test splits and computes average performance metrics, which are different for classifiers and regressors. It returns the metrics recorded for each iteration, the average performance metrics, and the model fitted in the last iteration.
Parameters:
X (pd.DataFrame)
: The input features DataFrame.y (pd.Series)
: The target variable Series.test_size (float)
: The proportion of the dataset to include in the test split.iterations_number (int)
: The number of iterations to repeat the train-test split and evaluation.model (sklearn estimator)
: An instance of a scikit-learn regression or classification model.
Returns:
pd.DataFrame
: DataFrame containing the metrics recorded for each iteration.dict
: Average performance metrics over all iterations, rounded to three decimal places.model
: The model fitted in the last iteration.
Example Usage:
ml = MachineLearning()
from sklearn.neighbors import KNeighborsRegressor
model = KNeighborsRegressor(n_neighbors=3) # or any classifier
iteration_metrics, average_metrics, final_model = ml.evaluate_model(X, y, 0.3, 30, model)
print(average_metrics)
plot_iteration_metrics()
Description:
This method plots boxplots for the metrics recorded in each iteration by the evaluate_model
method. It requires that evaluate_model
has been run beforehand to generate iteration metrics.
Returns:
None
: This method displays the plot directly.
Example Usage:
ml = MachineLearning()
from sklearn.neighbors import KNeighborsRegressor
model = KNeighborsRegressor(n_neighbors=3)
_, _, iteration_metrics = ml.evaluate_model(X, y, 0.3, 30, model)
ml.plot_iteration_metrics()
Class: DatasetInformation
The DatasetInformation class provides functionalities for summarizing and visualizing various aspects of a pandas DataFrame.
Initialization:
DatasetInformation(df)
Parameters:
df (pandas.DataFrame)
: The DataFrame to be summarized.
dataframe_summary()
Description:
This method provides a detailed summary of the DataFrame, including the shape, column data types, count of null values, and duplicated rows. If column metadata is provided, it is included in the summary.
Parameters:
col_metadata (dict, optional)
: A dictionary with column names as keys and their descriptions as values. If provided, these descriptions are included in the summary.
Returns:
None
: This method displays the summary directly.
Example Usage:
df = pd.read_csv("your_dataset.csv")
col_metadata = {
'column1': 'Description for column1',
'column2': 'Description for column2',
# ... other columns
}
info = DatasetInformation(df)
info.dataframe_summary(col_metadata)
categorical_summary()
Description:
This method summarizes categorical columns in the DataFrame, showing the number of unique values, the most frequent value, and its percentage for each column. For columns with fewer than 20 unique values, it also displays a detailed list of unique values and their distributions.
Returns:
None
: This method displays the summary directly.
Example Usage:
df = pd.read_csv("your_dataset.csv")
info = DatasetInformation(df)
info.categorical_summary()
numerical_summary()
Description:
This method summarizes numerical columns in the DataFrame by providing statistical measures such as mean, median, mode, standard deviation, variance, range, minimum, and maximum for each column. It also displays a correlation matrix and a heatmap of the correlations.
Returns:
None
: This method displays the summary directly.
Example Usage:
df = pd.read_csv("your_dataset.csv")
info = DatasetInformation(df)
info.numerical_summary()
Contributing
We welcome contributions to PrismaML! Whether you want to report a bug, request a feature, or contribute code, your input is valuable to us. Here's how you can get started:
Reporting Bugs
If you encounter any bugs while using PrismaML, please let us know by creating a new issue. Provide as much detail as possible to help us understand and address the issue. Report a Bug
Requesting Features
Have an idea for a new feature? We’d love to hear it! Use the link below to request a new feature. Request a feature
Submitting Pull Requests
We appreciate your contributions! Follow these steps to submit a pull request:
-
Fork the Repository: Click the "Fork" button at the top right of this page to create a copy of the repository on your GitHub account.
-
Clone Your Fork: Clone your fork to your local machine.
git clone https://github.com/yourusername/PrismaML.git cd PrismaML
-
Create a New Branch: Create a new branch for your changes.
git checkout -b feature-or-bugfix-description
-
Make Your Changes: Make your changes to the code.
-
Commit Your Changes: Commit your changes with a descriptive commit message.
git add . git commit -m "Description of your changes"
-
Push to Your Fork: Push your changes to your fork on GitHub.
git push origin feature-or-bugfix-description
-
Open a Pull Request: Go to the original repository and click the "New Pull Request" button. Provide a clear description of your changes and why they are needed.
License
This project is licensed under the Apache 2.0 License - see the LICENSE file for details.
Authors
- Yousinator - AI Engineer - Yousinator - Wrote the codes and README\
Acknowledgements
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for prisma_ml-0.1.1-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 6da89e2bb611b02a8524fb7fa14ea4a6c559bb2070bd9452b6366ead6ee4e824 |
|
MD5 | 80307f673d201ac12a5f47544011edf2 |
|
BLAKE2b-256 | 53e7a28f8490849f1785073fe74497f3655280fb1b08f33225de482c8c3d9b0a |