Skip to main content

Basic ML algorithms library built from scratch (KNN + Linear Regression)

Project description

CoreLearn

A lightweight Python machine learning library built from scratch using only NumPy.
Implements KNN classification and Linear Regression with a focus on software design, not just accuracy.


Installation

# Clone or download the project, then from the coreLearn/ directory:
pip install -e .

# Install all dependencies (including dev tools):
pip install -r requirements.txt

After installation, import from anywhere:

from coreLearn import KNNClassifier, LinearRegression, Evaluator

Quick Start

from coreLearn import KNNClassifier, LinearRegression, Evaluator, accuracy, mae

# --- KNN Classification ---
knn = KNNClassifier(k=5, distance="euclidean", n_jobs=2)
knn.fit(X_train, y_train)
predictions = knn.predict(X_test)
print(accuracy(y_test, predictions))

# --- Linear Regression ---
lr = LinearRegression(strategy="normal")
lr.fit(X_train, y_train)
predictions = lr.predict(X_test)
print(mae(y_test, predictions))

# --- Evaluator ---
print(Evaluator.evaluate_regression(y_test, predictions))
# {'mae': ..., 'mse': ..., 'rmse': ...}

print(Evaluator.evaluate_classification(y_test, knn_preds))
# {'accuracy': ..., 'precision': ..., 'recall': ..., 'f1': ...}

Package Structure

coreLearn/
├── __init__.py          ← Public API
├── base.py              ← Abstract base class — Template Method Pattern
├── distances.py         ← Distance metrics — Factory Pattern
├── knn.py               ← KNN Classifier — Recursion + Concurrency + OOP
├── linear_regression.py ← Linear Regression — Strategy Pattern + OOP
├── evaluator.py         ← Metric engine — Functional Programming
├── examples/
│   ├── demo_notebook.ipynb
│   ├── housing.csv
│   └── penguin.csv
└── tests/
    ├── test_knn.py
    ├── test_linear_regression.py
    ├── test_distances.py
    └── test_evaluator.py

Running Tests

cd coreLearn/
pytest coreLearn/tests/ -v

Learning Outcomes

1 — Object-Oriented Programming (OOP)

File: base.py, knn.py, linear_regression.py, distances.py

Abstract Base Class & Inheritance

BaseModel is an abstract class that defines the contract every model must follow.
KNNClassifier and LinearRegression both inherit from it:

# base.py
class BaseModel(ABC):
    @abstractmethod
    def fit(self, X, y) -> "BaseModel": ...

    @abstractmethod
    def predict(self, X) -> list: ...

# knn.py
class KNNClassifier(BaseModel):   # ← inheritance
    def fit(self, X, y): ...
    def predict(self, X): ...

# linear_regression.py
class LinearRegression(BaseModel):  # ← inheritance
    def fit(self, X, y): ...
    def predict(self, X): ...

Polymorphism

Both models share the same interface — they can be used interchangeably:

for model in [KNNClassifier(k=3), LinearRegression()]:
    model.fit(X_train, y_train)   # same call, different behaviour
    model.predict(X_test)         # same call, different behaviour

Encapsulation

Internal state is hidden with _ prefixes. Users interact only through the public API:

# knn.py
self._metric = DistanceMetricFactory.create(distance)  # private
self._tree   = None                                     # private

# linear_regression.py — controlled read access via properties
@property
def coef_(self) -> np.ndarray:
    return self._weights[1:]

@property
def intercept_(self) -> float:
    return float(self._weights[0])

OptimizationStrategy, NormalEquationStrategy, and GradientDescentStrategy inside
linear_regression.py form an additional hierarchy demonstrating inheritance within the library.


2 — Functional Programming

File: evaluator.py

Functions as First-Class Objects

Metric functions are stored in dictionaries as values and called dynamically:

# evaluator.py
_regression_metrics: dict[str, callable] = {
    "mae":  mae,
    "mse":  mse,
    "rmse": rmse,
}

@classmethod
def evaluate_regression(cls, y_true, y_pred) -> dict:
    # applies every registered function — no if/elif chain
    return {name: fn(y_true, y_pred) for name, fn in cls._regression_metrics.items()}

Higher-Order Function — register()

Evaluator.register() accepts any callable and plugs it in at runtime.
This is the classic higher-order function pattern: a function (or method) that takes another function as an argument.

# Add a custom metric without modifying the Evaluator class
Evaluator.register(
    "max_error",
    lambda y_true, y_pred: max(abs(a - b) for a, b in zip(y_true, y_pred)),
    kind="regression",
)
result = Evaluator.evaluate_regression(y_test, y_pred)
print(result["max_error"])   # available immediately

Pure Functions

mae, mse, rmse, accuracy, precision, recall, f1_score are all pure functions:

  • No side effects
  • No mutation of inputs
  • Same inputs always produce the same output
from coreLearn import mae, accuracy
mae([1.0, 2.0, 3.0], [1.5, 2.5, 3.5])   # → 0.5  (always)
accuracy([0, 1, 1], [0, 1, 0])           # → 0.666 (always)

3 — Concurrency

File: knn.pyKNNClassifier.predict()

KNNClassifier uses ProcessPoolExecutor to classify test samples in parallel across
multiple CPU processes. Unlike threads, each worker runs in its own process with its
own GIL — enabling true CPU-bound parallelism.

# knn.py
def predict(self, X) -> list:
    ...
    if self.n_jobs == 1:
        # sequential — no overhead for small datasets
        return [self._predict_one(x) for x in samples]

    # parallel — distribute samples across n_jobs worker processes
    args = [(self._tree, x, self.k, self._metric) for x in samples]
    with ProcessPoolExecutor(max_workers=self.n_jobs) as executor:
        return list(executor.map(_predict_worker, args))

Why no race conditions?
Each worker receives its own pickled copy of the KD-Tree and metric via ProcessPoolExecutor.
No shared memory is used, so no synchronization primitives are needed.

# n_jobs=1  → sequential (default, safe for notebooks)
knn = KNNClassifier(k=5, n_jobs=1)

# n_jobs=4  → 4 parallel worker processes
knn = KNNClassifier(k=5, n_jobs=4)
knn.fit(X_train, y_train)
preds = knn.predict(X_test)

Note: ProcessPoolExecutor requires the if __name__ == "__main__": guard on
Windows/macOS when used in scripts. The n_jobs=1 default is safe everywhere.


4 — Recursion

File: knn.pyKDTree

The KD-Tree data structure is built and searched using mutual recursion.
Both _build and _search call themselves with a strictly smaller subproblem each time.

_build — Recursive Tree Construction

Base case: empty data → return None.
Recursive case: split on the median, call _build on each half with depth + 1.

# knn.py
def _build(self, data: list, depth: int):
    if not data:          # ← base case
        return None
    axis = depth % len(data[0][0])
    data.sort(key=lambda item: item[0][axis])
    mid = len(data) // 2
    return KDNode(
        point = data[mid][0],
        label = data[mid][1],
        left  = self._build(data[:mid],     depth + 1),  # ← recursion
        right = self._build(data[mid + 1:], depth + 1),  # ← recursion
    )

_search — Recursive Nearest-Neighbour Search

Base case: node is None → return.
Recursive case: visit the near branch, then prune and optionally visit the far branch.

# knn.py
def _search(self, node, target, k, metric, depth, best):
    if node is None:      # ← base case
        return
    dist = metric(target, node.point)
    # update best list ...
    self._search(near, target, k, metric, depth + 1, best)  # ← recursion
    if len(best) < k or abs(diff) < best[-1][0]:
        self._search(far, target, k, metric, depth + 1, best)  # ← recursion (pruned)

Pruning: the abs(diff) < best[-1][0] condition skips the far branch when it cannot
contain a closer neighbour — achieving O(log n) average search complexity.


5 — SOLID Principles

Files: all modules

S — Single Responsibility

Every class has exactly one reason to change:

Class Sole Responsibility
BaseModel Define the common model contract
KDTree Spatial nearest-neighbour search
KNNClassifier KNN classification logic
LinearRegression Linear regression logic
NormalEquationStrategy Closed-form weight computation
GradientDescentStrategy Iterative gradient-based weight computation
DistanceMetricFactory Instantiate distance metric objects by name
Evaluator Compute and manage evaluation metrics

O — Open/Closed

Classes are open for extension, closed for modification.
New metrics and distance functions can be added without editing any existing class:

# Add a new metric — Evaluator source code untouched
Evaluator.register("r2", lambda t, p: ..., kind="regression")

# Add a new distance — KNNClassifier source code untouched
DistanceMetricFactory.register("chebyshev", ChebyshevDistance)
knn = KNNClassifier(k=3, distance="chebyshev")

L — Liskov Substitution

Any BaseModel subclass can replace BaseModel without breaking callers:

def train_and_score(model: BaseModel, X_train, y_train, X_test, y_test):
    preds = model.fit_predict(X_train, y_train, X_test)
    return accuracy(y_test, preds)

train_and_score(KNNClassifier(k=3), ...)   # works
train_and_score(LinearRegression(), ...)   # works

I — Interface Segregation

DistanceMetric exposes only what is needed — a single compute() method.
Implementors are not forced to implement anything they do not use:

# distances.py
class DistanceMetric(ABC):
    @abstractmethod
    def compute(self, a: list, b: list) -> float: ...
    # nothing else required

D — Dependency Inversion

LinearRegression depends on the abstraction OptimizationStrategy,
not on any concrete strategy class:

# linear_regression.py
self._weights = self._strategy.fit(X_b, y)
#               ↑ OptimizationStrategy interface — concrete class unknown here

6 — Architectural & Design Patterns

Architecture: Layered

  • Core layer (base.py, distances.py): abstractions and shared contracts
  • Algorithm layer (knn.py, linear_regression.py): concrete ML algorithms
  • Evaluation layer (evaluator.py): metric computation
  • Public API (__init__.py): single entry point, re-exports everything

Pattern 1 — Template Method (base.py)

fit_predict defines the fixed skeleton (fit → predict).
Subclasses fill in each step without altering the sequence:

# base.py
def fit_predict(self, X_train, y_train, X_test) -> list:
    self.fit(X_train, y_train)   # ← step 1: implemented by subclass
    return self.predict(X_test)  # ← step 2: implemented by subclass

Every model gets fit_predict for free through inheritance.

Pattern 2 — Strategy (linear_regression.py)

The optimisation algorithm is swapped at construction time.
LinearRegression.fit() never knows which concrete strategy it is using:

lr_ne = LinearRegression(strategy="normal")           # uses NormalEquationStrategy
lr_gd = LinearRegression(strategy="gradient_descent") # uses GradientDescentStrategy

# Both models have the same interface — caller code is identical
lr_ne.fit(X_train, y_train)
lr_gd.fit(X_train, y_train)

To add a third optimiser (e.g. Adam), only a new OptimizationStrategy subclass is needed.

Pattern 3 — Factory (distances.py)

DistanceMetricFactory centralises object creation.
KNNClassifier never imports EuclideanDistance or ManhattanDistance directly:

# distances.py
class DistanceMetricFactory:
    _registry = {"euclidean": EuclideanDistance, "manhattan": ManhattanDistance}

    @classmethod
    def create(cls, name: str) -> DistanceMetric:
        return cls._registry[name]()   # create and return

    @classmethod
    def register(cls, name: str, metric_class: type) -> None:
        cls._registry[name] = metric_class  # extend without modifying

# knn.py — only depends on the factory, not the concrete classes
self._metric = DistanceMetricFactory.create(distance)

API Reference

KNNClassifier

Parameter Type Default Description
k int 5 Number of neighbours
distance str "euclidean" "euclidean" or "manhattan" (or any registered name)
n_jobs int 1 Worker processes for prediction (1 = sequential)

LinearRegression

Parameter Type Default Description
strategy str "normal" "normal" (closed-form) or "gradient_descent"
learning_rate float 0.01 Learning rate — gradient descent only
epochs int 1000 Iterations — gradient descent only

Evaluator

Method Description
evaluate_regression(y_true, y_pred) Returns {"mae", "mse", "rmse"}
evaluate_classification(y_true, y_pred) Returns {"accuracy", "precision", "recall", "f1"}
register(name, fn, kind) Add a custom metric at runtime

Standalone metric functions

from coreLearn import accuracy, mae, mse, rmse, precision, recall, f1_score

Dependencies

Package Purpose
numpy Matrix operations, vectorised arithmetic
pytest Unit testing
scikit-learn Datasets and preprocessing in examples only
pandas Data loading in examples only
matplotlib Visualisation in examples only

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

corelearn-0.1.0.tar.gz (20.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

corelearn-0.1.0-py3-none-any.whl (19.8 kB view details)

Uploaded Python 3

File details

Details for the file corelearn-0.1.0.tar.gz.

File metadata

  • Download URL: corelearn-0.1.0.tar.gz
  • Upload date:
  • Size: 20.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for corelearn-0.1.0.tar.gz
Algorithm Hash digest
SHA256 e2b3c0c24ddf7900bd38ec891a7e0e0839b7a9441b029cd12ffa9640c5516c38
MD5 c9d54270be70c5c07b423f689ee54f0f
BLAKE2b-256 f7726598d0bef743a898da2bdd26666375e361fca85c1a9ff87970bc7c0e3d47

See more details on using hashes here.

File details

Details for the file corelearn-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: corelearn-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 19.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for corelearn-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 fe739a800ece66f7c36be6b14d9d2724b4c5523945ae4de880b3a71c645411aa
MD5 151b79863a5d390490f671b86a560379
BLAKE2b-256 0541b6c6b0340ec841c171ee7093ddab89d9d43dd710616ac8ae0689f721daba

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page