Skip to main content

A modular, object-oriented framework for machine learning and data preprocessing

Project description

Machine Learning & Data Preprocessing Library

Introduction

This is a simple machine learning algorithm library consists of Linear Regression , KNN Classifier and some other data processing algorithms from scratch based on numpy and pandas libraries.


Mapping Core Learning Outcomes

The 6 required patterns were applied appropriately in the project. Every single one is explained below.

1. Object-Oriented Programming (OOP)

  • Where: In core.py and data.py.
  • How:
    • Inheritance & Abstraction: Employs abstract base classes (BaseAlgorithm, RegressionStrategy, DistanceMetric, DataLoader, DataCreator, ImputeStrategy, EncodingStrategy) to enforce blueprints.
    • Polymorphism: Concrete implementations dynamically substitute base behavior. For example, LinearRegression executes .train() polymorphic actions via different assigned regression strategies without altering its own structure.
    • Encapsulation: State variables are protected internally. In data.py, the raw dataframe is hidden behind a protected attribute self._data and managed safely using the @property getter.

2. Functional Programming

  • Where: In core.py and utils.py.
  • How:
    • Pure Functions & Lambda: evaluate_model avoids modifying external states and relies entirely on input arguments, calculating mean squared errors via a clean pure lambda routine.
    • Higher-Order Functions & Map/Reduce:
      • reduce combined with lambda is used inside evaluate_model to sum squared errors.
      • map is used inside series_to_ndarray to cast panda series rows to float representations.
      • apply_pipeline utilizes reduce to sequentially compose list-based transformation callables across data boundaries (reduce(lambda d, func: func(d), transformations, data)).

3. Concurrency (Multi-threading)

  • Where: Implemented in core.py inside the KNNClassifier class.
  • How:
    • Predicting classes for massive feature maps sequentially is computationally bound. The predict method generates individual threading.Thread operations for every distinct evaluation sample.
    • The _predict_single worker calculates specific row-by-row matrix operations concurrently, storing structural outputs inside a shared pre-allocated numpy results matrix (results[index]).
    • Thread control structures utilize t.start() loops followed by systematic t.join() barriers to synchronize and block primary execution until parallel estimations conclude safely.

4. Recursion / Dynamic Programming

  • Where: In core.py inside the EuclideanDistance class.
  • How:
    • Distance metrics typically resolve dimensions via nested iterative syntax or high-level library functions. This implementation achieves element-wise vector difference accumulations via a custom recursive function recursive_sum_sq(a, b, idx).
    • It recursively accumulates squared parameter differences index-by-index until it reaches the base case (idx == len(a)), gracefully returning the final structural matrix sqrt reduction.

5. SOLID Principles

  • Where: In core.py and data.py.
  • How:
    • Single Responsibility Principle (SRP): Classes do exactly one thing. CSVLoader only ingests data streams; MeanImputer strictly provides missing value fillings; DataProcessor focuses on data manipulation.
    • Open/Closed Principle (OCP): The system is open for extension but closed for modification. Introducing a new distance metric (e.g., Cosine Distance) requires subclassing DistanceMetric without touching KNNClassifier.
    • Liskov Substitution Principle (LSP): Derived classes are completely interchangeable with their abstractions. Any encoder (LabelEncoder, OneHotEncoder, TargetEncoder) fulfills the signature constraints expected by DataProcessor.
    • Interface Segregation Principle (ISP): Interfaces remain lean and decoupled. RegressionStrategy enforces a single clear contractual point (train), avoiding bulky, unrelated structural configurations.
    • Dependency Inversion Principle (DIP): High-level objects depend on abstractions rather than low-level concrete modules. LinearRegression binds entirely against the RegressionStrategy interface, decoupling model training mechanisms from specific analytical algorithms.

6. Architectural & Design Patterns

  • Where: Full design of data.py and core.py.
  • How:
    • Pipeline Architecture: Managed by DataPipeline which neatly bridges file checking, concrete factory creation, loading, and structured feature preparation routines into a uniform linear API stream (run_default_preprocessing).
    • Strategy Pattern: Implemented multiple times to provide interchangeable components:
      • Optimization algorithms in LinearRegression via LeastSquaresStrategy and GradientDescentStrategy.
      • Distance formulations in KNNClassifier via EuclideanDistance and ManhattanDistance.
      • Data imputation in DataProcessor via MeanImputer, MedianImputer, and ModeImputer.
      • Variable transformations via LabelEncoder, OneHotEncoder, and TargetEncoder.
    • Factory Method Pattern: Used to create appropriate data loaders without binding to concrete files. DataCreator acts as the creator interface, declaring create_document(). Concrete implementations CSVCreator and JSONCreator override this method to instantiate and return CSVLoader or JSONLoader respectively, abstracting the instantiation process away from the main pipeline.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

my_python_lib_tarik-0.1.0.tar.gz (12.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

my_python_lib_tarik-0.1.0-py3-none-any.whl (9.1 kB view details)

Uploaded Python 3

File details

Details for the file my_python_lib_tarik-0.1.0.tar.gz.

File metadata

  • Download URL: my_python_lib_tarik-0.1.0.tar.gz
  • Upload date:
  • Size: 12.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for my_python_lib_tarik-0.1.0.tar.gz
Algorithm Hash digest
SHA256 7b43a248a77ef1438d4f83345a642d09f4243237186167b9a64bb78653853291
MD5 8817b0f8220178bd4dda61b9ce0a2f2a
BLAKE2b-256 7f7300693470a80f8a89b6415d9921ace406e82b34c0a1fa5e3e91540852c1c7

See more details on using hashes here.

File details

Details for the file my_python_lib_tarik-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for my_python_lib_tarik-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 4b6cc21118d4a2d060fc548e18d694a6ca6a1eac28f8d51428fb8a8aabb72bb3
MD5 c34abc6adeceb33c0bf4e6fe525832cd
BLAKE2b-256 f31feb9def8b00ccdb9564f410ca261e871b412cd98957438bc128f7c0813f96

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page