Skip to main content

Spam/ham classifier with an MLOps-style training pipeline.

Project description

Spam classifier

This project demonstrates how to package and train a simple spam/ham classifier with MLOps practices. It is designed for students learning how to structure ML code into modules, build training pipelines, configure via YAML, and add tests and CI.

Project structure

  • spam_classifier/ — package code (pipeline, training, inference)
  • data/ — raw and processed datasets
  • config.yaml — pipeline and training configuration
  • tests/ — pytest suite (unit + quality)
  • .github/workflows/ci.yml — GitHub Actions CI

Setup (uv)

uv venv --seed --python 3.13
uv pip install -e ".[dev]"

Minimum supported Python version is 3.11. If you prefer venv, you can still use it, but the project CI and Makefile expect uv.

Data

Download and prepare the dataset:

make download_data
make process_data

make process_data builds data/processed/train.csv and data/processed/test.csv. The holdout split is controlled by:

  • data.test_size in config.yaml (default 0.1)
  • training.use_holdout (True/False)

Training

Train with cross-validation and optional holdout evaluation:

make train

Training behavior is controlled in config.yaml:

  • training.cv_folds — number of CV folds
  • training.metrics — metrics to log (accuracy/precision/recall/f1/roc_auc)
  • training.use_holdout — evaluate on test.csv if True
  • training.run_validation — run CV if True

Versioned artifacts

Package version is stored in spam_classifier/_VERSION. Model and log filenames include this version:

  • Model: spam_classifier/models/spam_classifier_vX.Y.Z.pkl
  • Logs: spam_classifier/logs/logs_X.Y.Z.log

Inference

Single message:

uv run python -m spam_classifier.predict "Free prize! Call now"

Batch inference from file (one message per line):

uv run python -m spam_classifier.predict data/processed/test.csv -o results/preds.csv

Options:

  • -o/--output — output CSV path (default: project root)
  • --no-message — exclude message text from output CSV
  • --model-path — path to a trained .pkl model (overrides default)

If you installed the package from PyPI, you must train a model or pass --model-path because no weights are bundled with the package by default.

If you have activated the virtual environment, you can omit uv run and call python directly.

If you have activated the virtual environment, you can omit uv run and call python directly.

Tests

Run full test suite:

uv run pytest tests

Quality tests (require trained model and holdout data):

uv run pytest -m quality

If you have activated the virtual environment, you can omit uv run for pytest as well.

CI

GitHub Actions runs on PRs to main and develop:

  • black --check
  • flake8
  • mypy
  • pytest tests

Pre-commit

Install and run pre-commit hooks:

pre-commit install
pre-commit run --all-files

Hooks included: black, flake8, mypy.

Publishing

TestPyPI (manual)

  1. Update spam_classifier/_VERSION
  2. Create a GitHub Actions run:
    • Go to Actions → Publish → Run workflow
    • Select testpypi
  3. The package is built and published to TestPyPI

PyPI (release)

  1. Update spam_classifier/_VERSION
  2. Create a GitHub Release (tag should match the version, e.g. v0.1.0)
  3. The Publish workflow will build and upload to PyPI

Trusted publishing

This project uses GitHub Actions OIDC (trusted publishing). You must configure the trusted publisher on PyPI and TestPyPI to allow the Publish workflow from this repository to upload packages.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

spam_classifier-0.2.0.tar.gz (14.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

spam_classifier-0.2.0-py3-none-any.whl (12.3 kB view details)

Uploaded Python 3

File details

Details for the file spam_classifier-0.2.0.tar.gz.

File metadata

  • Download URL: spam_classifier-0.2.0.tar.gz
  • Upload date:
  • Size: 14.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for spam_classifier-0.2.0.tar.gz
Algorithm Hash digest
SHA256 8022423853907359d4b9a2951d485fa20424f90ae64a93e750a7082f6dd9aeb9
MD5 b8544cf640d0f0ba0536ab2a157c8110
BLAKE2b-256 4ddd3345cffbd3553d3bfc9d3064f657df206b56e56dccae7eba0a08db097cdc

See more details on using hashes here.

Provenance

The following attestation bundles were made for spam_classifier-0.2.0.tar.gz:

Publisher: publish.yml on Emilien-mipt/spam_classifier

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file spam_classifier-0.2.0-py3-none-any.whl.

File metadata

File hashes

Hashes for spam_classifier-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 d55ccb384f02beaed666f195a6789987f797d837946fa2a28a87d1777f411fc5
MD5 7dff8285f70666e26f8ac14bbb925de0
BLAKE2b-256 c8ea9f475b56af57a74ab2ecbaf607be6197735ace20f7374c73cf51c8767785

See more details on using hashes here.

Provenance

The following attestation bundles were made for spam_classifier-0.2.0-py3-none-any.whl:

Publisher: publish.yml on Emilien-mipt/spam_classifier

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page