An end-to-end machine learning pipeline to train ml model and deploy it to realtime inference endpoint

These details have not been verified by PyPI

Project description

personalization

An end-to-end demo machine learning pipeline to provide an artifact for a real-time inference service

Aim

We want to create a machine learning training code which satisfies the following properties that given data can train the model and save it to artifact

Solution

Our implementation of the package 'personalization' We choose to use Polars to read data, it is roughly 2-3 times faster than Pandas and supports nice API for aggregations and features creation. For the model part, we decided to take lightGBM due to ts speed, small size (model artifact size up to 50 Mb on 300 million rows of search data) and explainability. The user should choose lightGBM parameters carefully. We tested an example lightgbm params in notebooks/train.ipynb.

Offline evaluation

The offline evaluation has been done in notebooks/train.ipynb, we can see significant increase in NDCG levels across venues with our model against the baseline.

CICD: code style and PyPI

The code is checked with pre-commit configs, tested and published in Github Actions, current coverage is around 80 percent.

The inference service code can be found here https://github.com/ra312/model-server

How to run

Obtain sessions.csv and venues.csv and move them to the root folder
Install personalization

    python -m pip instal personalization

Run the following command in shell to train pipeline and get artifact:

python3 -m personalization \
    --sessions-bucket-path sessions.csv \
    --venues-bucket-path venues.csv \
    --objective lambdarank \
    --num_leaves 100 \
    --min_sum_hessian_in_leaf 10 \
    --metric ndcg --ndcg_eval_at 10 20 \
    --learning_rate 0.8 \
    --force_row_wise True \
    --num_iterations 10 \
    --trained-model-path trained_model.joblib

TODO

Next steps:

Scalability(e.g. use Flyte)
Data: add support to ingest sessions and venues data from a database
Versioning: add MLFlow integration

Development

Clone this repository
Requirements:
- Poetry
- Python 3.8.1+
Create a virtual environment and install the dependencies

poetry install

Activate the virtual environment

poetry shell

Testing

pytest

Pre-commit

Pre-commit hooks run all the auto-formatters (e.g. black, isort), linters (e.g. mypy, flake8), and other quality checks to make sure the changeset is in good shape before a commit/push happens.

You can install the hooks with (runs for each commit):

pre-commit install

Or if you want them to run only for each push:

pre-commit install -t pre-push

Or if you want e.g. want to run all checks manually for all files:

pre-commit run --all-files

This project was generated using the wolt-python-package-cookiecutter template.

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.1.1

Mar 27, 2023

0.1.0

Feb 23, 2023

0.0.1

Feb 22, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

personalization-0.1.1-py3-none-any.whl (13.0 kB view details)

Uploaded Mar 27, 2023 Python 3

File details

Details for the file personalization-0.1.1-py3-none-any.whl.

File metadata

Download URL: personalization-0.1.1-py3-none-any.whl
Upload date: Mar 27, 2023
Size: 13.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.10.6

File hashes

Hashes for personalization-0.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`e109e5abe2111fd89cb1001d7327fcb832b277cdfc7ffbd284ba5b069962275e`
MD5	`cce882d2dad2a0bd35ff1be36aa1bc74`
BLAKE2b-256	`7b280a539d6f4366f0d027607ccc952bd8ebdc9413f9d01725124201b4f203c4`

See more details on using hashes here.

personalization 0.1.1

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers