Mini Scikit Learn
Project description
Mini Sklearn Implementation Project
Overview
This project involves the implementation of a mini version of the sklearn library, designed to provide essential machine learning tools and algorithms from scratch. It aims to replicate key functionalities of sklearn, offering a customizable and scalable toolkit for data analysis and modeling.
Motivation
- Learning Objective: To gain a deeper understanding of the inner workings of machine learning algorithms by implementing them from scratch.
- Skill Enhancement: To improve programming and problem-solving skills.
- Custom Solution: To provide a lightweight and customizable alternative to the comprehensive sklearn library.
Goals
- Core Implementations:
- To implement core machine learning algorithms and utilities from scratch.
- Efficiency and Reliability:
- To ensure the implementations are efficient and reliable.
- Modular and Extensible Codebase:
- To create a modular and extensible codebase for future enhancements.
Modules Implemented
Cluster
- KMeans: Clustering algorithm that partitions data into k clusters.
Covariance
- EmpiricalCovariance: Computes the covariance matrix of a dataset.
Decomposition
- PCA (Principal Component Analysis): Dimensionality reduction technique that identifies principal components.
Feature Extraction
- SimpleCountVectorizer: Converts text documents into a matrix of token counts.
Feature Selection
- FeatureSelector: Utility for selecting a subset of features based on certain criteria.
Impute
- MissingIndicator: Detects missing values and encodes their presence as binary indicators.
- SimpleImputer: Handles missing data by imputing values using specified statistics.
Linear Model
- LinearRegression: Predicts target variables as a linear combination of input features.
Metrics
- Accuracy Metrics:
accuracy_score
,average_precision_score
,confusion_matrix
,f1_score
,precision_score
,recall_score
,roc_auc_score
,roc_curve
- Error Metrics:
mean_absolute_error
,mean_squared_error
,median_absolute_error
,r2_score
- Similarity Metrics:
cosine_similarity
,cosine_distances
,euclidean_distances
- Kernel Functions:
linear_kernel
,rbf_kernel
Neighbors
- KNeighborsClassifier: Classification based on the majority class of nearest neighbors.
- KNeighborsRegressor: Regression based on the average value of nearest neighbors.
- NearestCentroid: Classifies data points based on the centroid of their nearest class neighbors.
Model Selection
- KFold: Cross-validation technique that splits data into k folds.
- GridSearchCV: Hyperparameter tuning method that searches through a specified parameter grid.
Naive Bayes
- GaussianNB: Naive Bayes classifier assuming features follow a Gaussian distribution.
Preprocessing
- Scalers:
MaxAbsScaler
,MinMaxScaler
,Normalizer
,RobustScaler
,StandardScaler
- Encoders:
Binarizer
,KBinsDiscretizer
,LabelBinarizer
,LabelEncoder
SVM
- LinearSVC: Linear support vector classifier for binary classification tasks.
Tree
- DecisionTreeClassifier: Classification algorithm that builds a decision tree.
- DecisionTreeRegressor: Regression algorithm that builds a decision tree to predict continuous target variables.
Technical Stack
- Programming Language: Python
- Libraries and Tools: NumPy, matplotlib, scikit-learn (as reference)
- Development Environment: Jupyter Notebooks
Architecture
- Modular Design: Follows a structure similar to sklearn, with modules for different machine learning tasks.
- Components: Each module contains classes and functions for specific tasks.
- Interconnection: Modules are interconnected to enable seamless integration and workflow.
Implementation Details
- Key Algorithms: KMeans, PCA, Linear Regression, and various metrics.
- Code Structure: Organized into modules with hierarchical class structures, promoting code reusability and maintainability.
- Documentation: Comprehensive documentation provided for classes and functions.
Challenges and Solutions
- Challenge: Ensuring efficiency and scalability.
- Solution: Employing vectorization and algorithmic optimizations.
Testing
- Test Cases: Designed comprehensive test cases covering various scenarios.
- Testing Framework: Used pytest for automated testing to ensure correctness and consistency across implementations.
- Results: High correlation and accuracy with sklearn algorithms.
Packaging
- Distribution: The project is packaged and distributed using pip. To install, use the following command: pip install minisklearn
Future Work
- Enhancements: Addition of new algorithms and optimization techniques.
- Scalability: Further scaling of the project to handle larger datasets and more complex tasks.
Conclusion
- Summary: Successfully implemented core machine learning algorithms and utilities, creating a modular and extensible codebase.
- Learnings: Gained valuable insights into algorithm design and modular programming.
- Acknowledgements: Special thanks to collaborators, mentors, and resources that supported this project.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
minisklearn-0.1.3.tar.gz
(29.5 kB
view details)
Built Distribution
File details
Details for the file minisklearn-0.1.3.tar.gz
.
File metadata
- Download URL: minisklearn-0.1.3.tar.gz
- Upload date:
- Size: 29.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.0 CPython/3.12.3
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 5fe24eee8d1f7c93af0de15d2f822648e28f524864c023bd92fe5712203563de |
|
MD5 | 9fd147ab162af062e53a018d716b2d2c |
|
BLAKE2b-256 | f84360b88c47698933a5a57a359c8f18b39d84dc7a7bffd9db3a2e67d639ad53 |
File details
Details for the file minisklearn-0.1.3-py3-none-any.whl
.
File metadata
- Download URL: minisklearn-0.1.3-py3-none-any.whl
- Upload date:
- Size: 36.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.0 CPython/3.12.3
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 20ecf1bb8f61dd9d4416c90d6852a9e92255448ec3be0325e20297fb6935f11c |
|
MD5 | 53c880de261309b07db8e424e7ab0963 |
|
BLAKE2b-256 | 6640d90a0aa1faa2040f9e135816c452a0622db6fba0246214e82812f9fb7be2 |