Skip to main content

Spam/ham classifier with an MLOps-style training pipeline.

Project description

Spam classifier

This project demonstrates how to package and train a simple spam/ham classifier with MLOps practices. It is designed for students learning how to structure ML code into modules, build training pipelines, configure via YAML, and add tests and CI.

Project structure

  • spam_classifier/ — package code (pipeline, training, inference)
  • data/ — raw and processed datasets
  • config.yaml — pipeline and training configuration
  • tests/ — pytest suite (unit + quality)
  • .github/workflows/ci.yml — GitHub Actions CI

Setup (uv)

uv venv --seed --python 3.13
uv pip install -e ".[dev]"

Minimum supported Python version is 3.11. If you prefer venv, you can still use it, but the project CI and Makefile expect uv.

Data

Download and prepare the dataset:

make download_data
make process_data

make process_data builds data/processed/train.csv and data/processed/test.csv. The holdout split is controlled by:

  • data.test_size in config.yaml (default 0.1)
  • training.use_holdout (True/False)

Training

Train with cross-validation and optional holdout evaluation:

make train

Training behavior is controlled in config.yaml:

  • training.cv_folds — number of CV folds
  • training.metrics — metrics to log (accuracy/precision/recall/f1/roc_auc)
  • training.use_holdout — evaluate on test.csv if True
  • training.run_validation — run CV if True

Versioned artifacts

Package version is stored in spam_classifier/_VERSION. Model and log filenames include this version:

  • Model: spam_classifier/models/spam_classifier_vX.Y.Z.pkl
  • Logs: spam_classifier/logs/logs_X.Y.Z.log

Inference

Single message:

uv run python -m spam_classifier.predict "Free prize! Call now"

Batch inference from file (one message per line):

uv run python -m spam_classifier.predict data/processed/test.csv -o results/preds.csv

Options:

  • -o/--output — output CSV path (default: project root)
  • --no-message — exclude message text from output CSV

If you have activated the virtual environment, you can omit uv run and call python directly.

Tests

Run full test suite:

uv run pytest tests

Quality tests (require trained model and holdout data):

uv run pytest -m quality

If you have activated the virtual environment, you can omit uv run for pytest as well.

CI

GitHub Actions runs on PRs to main and develop:

  • black --check
  • flake8
  • mypy
  • pytest tests

Pre-commit

Install and run pre-commit hooks:

pre-commit install
pre-commit run --all-files

Hooks included: black, flake8, mypy.

Publishing

TestPyPI (manual)

  1. Update spam_classifier/_VERSION
  2. Create a GitHub Actions run:
    • Go to Actions → Publish → Run workflow
    • Select testpypi
  3. The package is built and published to TestPyPI

PyPI (release)

  1. Update spam_classifier/_VERSION
  2. Create a GitHub Release (tag should match the version, e.g. v0.1.0)
  3. The Publish workflow will build and upload to PyPI

Trusted publishing

This project uses GitHub Actions OIDC (trusted publishing). You must configure the trusted publisher on PyPI and TestPyPI to allow the Publish workflow from this repository to upload packages.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

spam_classifier-0.1.0.tar.gz (13.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

spam_classifier-0.1.0-py3-none-any.whl (12.1 kB view details)

Uploaded Python 3

File details

Details for the file spam_classifier-0.1.0.tar.gz.

File metadata

  • Download URL: spam_classifier-0.1.0.tar.gz
  • Upload date:
  • Size: 13.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for spam_classifier-0.1.0.tar.gz
Algorithm Hash digest
SHA256 81385354fc68bd0db5004d1051541cf281615aa5f3fa76a7a78eb0d9fcc802b7
MD5 e7ef326e6b01513d8facc399dc1ab368
BLAKE2b-256 695ff00f703b1d315db4e8a3401e0c0c23379a8326b4ef6404c9cdec63f42481

See more details on using hashes here.

Provenance

The following attestation bundles were made for spam_classifier-0.1.0.tar.gz:

Publisher: publish.yml on Emilien-mipt/spam_classifier

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file spam_classifier-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for spam_classifier-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 765ed13c98e18bebacd706f77f8dc6a27090ce9e065c0d561c8db6d455c2d375
MD5 9eda2928904d0774d0fce49d1777f111
BLAKE2b-256 58063a30e5e3ee07fdada431c1144c55ce74be3df4d05a3ae0377afc7c5b04f5

See more details on using hashes here.

Provenance

The following attestation bundles were made for spam_classifier-0.1.0-py3-none-any.whl:

Publisher: publish.yml on Emilien-mipt/spam_classifier

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page