Skip to main content

Mini Scikit Learn

Project description

Mini Sklearn Implementation Project

Overview

This project involves the implementation of a mini version of the sklearn library, designed to provide essential machine learning tools and algorithms from scratch. It aims to replicate key functionalities of sklearn, offering a customizable and scalable toolkit for data analysis and modeling.

Motivation

  • Learning Objective: To gain a deeper understanding of the inner workings of machine learning algorithms by implementing them from scratch.
  • Skill Enhancement: To improve programming and problem-solving skills.
  • Custom Solution: To provide a lightweight and customizable alternative to the comprehensive sklearn library.

Goals

  1. Core Implementations:
    • To implement core machine learning algorithms and utilities from scratch.
  2. Efficiency and Reliability:
    • To ensure the implementations are efficient and reliable.
  3. Modular and Extensible Codebase:
    • To create a modular and extensible codebase for future enhancements.

Modules Implemented

Cluster

  • KMeans: Clustering algorithm that partitions data into k clusters.

Covariance

  • EmpiricalCovariance: Computes the covariance matrix of a dataset.

Decomposition

  • PCA (Principal Component Analysis): Dimensionality reduction technique that identifies principal components.

Feature Extraction

  • SimpleCountVectorizer: Converts text documents into a matrix of token counts.

Feature Selection

  • FeatureSelector: Utility for selecting a subset of features based on certain criteria.

Impute

  • MissingIndicator: Detects missing values and encodes their presence as binary indicators.
  • SimpleImputer: Handles missing data by imputing values using specified statistics.

Linear Model

  • LinearRegression: Predicts target variables as a linear combination of input features.

Metrics

  • Accuracy Metrics: accuracy_score, average_precision_score, confusion_matrix, f1_score, precision_score, recall_score, roc_auc_score, roc_curve
  • Error Metrics: mean_absolute_error, mean_squared_error, median_absolute_error, r2_score
  • Similarity Metrics: cosine_similarity, cosine_distances, euclidean_distances
  • Kernel Functions: linear_kernel, rbf_kernel

Neighbors

  • KNeighborsClassifier: Classification based on the majority class of nearest neighbors.
  • KNeighborsRegressor: Regression based on the average value of nearest neighbors.
  • NearestCentroid: Classifies data points based on the centroid of their nearest class neighbors.

Model Selection

  • KFold: Cross-validation technique that splits data into k folds.
  • GridSearchCV: Hyperparameter tuning method that searches through a specified parameter grid.

Naive Bayes

  • GaussianNB: Naive Bayes classifier assuming features follow a Gaussian distribution.

Preprocessing

  • Scalers: MaxAbsScaler, MinMaxScaler, Normalizer, RobustScaler, StandardScaler
  • Encoders: Binarizer, KBinsDiscretizer, LabelBinarizer, LabelEncoder

SVM

  • LinearSVC: Linear support vector classifier for binary classification tasks.

Tree

  • DecisionTreeClassifier: Classification algorithm that builds a decision tree.
  • DecisionTreeRegressor: Regression algorithm that builds a decision tree to predict continuous target variables.

Technical Stack

  • Programming Language: Python
  • Libraries and Tools: NumPy, matplotlib, scikit-learn (as reference)
  • Development Environment: Jupyter Notebooks

Architecture

  • Modular Design: Follows a structure similar to sklearn, with modules for different machine learning tasks.
  • Components: Each module contains classes and functions for specific tasks.
  • Interconnection: Modules are interconnected to enable seamless integration and workflow.

Implementation Details

  • Key Algorithms: KMeans, PCA, Linear Regression, and various metrics.
  • Code Structure: Organized into modules with hierarchical class structures, promoting code reusability and maintainability.
  • Documentation: Comprehensive documentation provided for classes and functions.

Challenges and Solutions

  • Challenge: Ensuring efficiency and scalability.
  • Solution: Employing vectorization and algorithmic optimizations.

Testing

  • Test Cases: Designed comprehensive test cases covering various scenarios.
  • Testing Framework: Used pytest for automated testing to ensure correctness and consistency across implementations.
  • Results: High correlation and accuracy with sklearn algorithms.

Packaging

  • Distribution: The project is packaged and distributed using pip. To install, use the following command: pip install minisklearn

Future Work

  • Enhancements: Addition of new algorithms and optimization techniques.
  • Scalability: Further scaling of the project to handle larger datasets and more complex tasks.

Conclusion

  • Summary: Successfully implemented core machine learning algorithms and utilities, creating a modular and extensible codebase.
  • Learnings: Gained valuable insights into algorithm design and modular programming.
  • Acknowledgements: Special thanks to collaborators, mentors, and resources that supported this project.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

minisklearn-0.1.3.tar.gz (29.5 kB view details)

Uploaded Source

Built Distribution

minisklearn-0.1.3-py3-none-any.whl (36.7 kB view details)

Uploaded Python 3

File details

Details for the file minisklearn-0.1.3.tar.gz.

File metadata

  • Download URL: minisklearn-0.1.3.tar.gz
  • Upload date:
  • Size: 29.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.0 CPython/3.12.3

File hashes

Hashes for minisklearn-0.1.3.tar.gz
Algorithm Hash digest
SHA256 5fe24eee8d1f7c93af0de15d2f822648e28f524864c023bd92fe5712203563de
MD5 9fd147ab162af062e53a018d716b2d2c
BLAKE2b-256 f84360b88c47698933a5a57a359c8f18b39d84dc7a7bffd9db3a2e67d639ad53

See more details on using hashes here.

File details

Details for the file minisklearn-0.1.3-py3-none-any.whl.

File metadata

  • Download URL: minisklearn-0.1.3-py3-none-any.whl
  • Upload date:
  • Size: 36.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.0 CPython/3.12.3

File hashes

Hashes for minisklearn-0.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 20ecf1bb8f61dd9d4416c90d6852a9e92255448ec3be0325e20297fb6935f11c
MD5 53c880de261309b07db8e424e7ab0963
BLAKE2b-256 6640d90a0aa1faa2040f9e135816c452a0622db6fba0246214e82812f9fb7be2

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page