A package using topological data analysis to achieve robust product recommendations.

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Project description

top-choice

INTRODUCTION

Ever misclick on an ad, only to receive similar ads again and again? Or maybe your friend asks you to order him some junk food on your online grocery account, where you normally only order healthy foods, and now you are served bad recommendations? Product recommendation systems can be oversensitive to outlier orders, which we can think of as noise in the data the system uses to find user preferences. This can lead to user frustration.

My goal is to develop a product recommendation system which more robust to these outliers. My approach is based on topological data analysis.

The author of the repo is Brian Willett (bmwillett1 at gmail).

BACKGROUND

A product recommendation system is an algorithm which determines for each user a set of products or services which they would like to purchase or interact with. Typically these rely on a large dataset of users and products, for example, a history of past interactions between users and products. A common approach to this problem is to use a machine learning algorithm trained on this dataset. In particular, modern approaches often use neural networks and deep learning techniques to achieve impressive accuracy.

A potential pitfall of these techniques occurs when a user has some behavior that lies outside their normal preferences. The recommendation system may then be led to suggest products based on this behavior that are not desired by the user. These outliers can be interpreted as "noise" in the user-product dataset, and it is desirable for the recommendation algorithm to be somewhat robust against this noise.

Many techniques have been developed for improving the performance of machine-learning algorithms on noisy data. In particular, topological data analysis (TDA) uses mathematical analysis of patterns in data to extract features of data robust to small variations such as noise. In particular, we will be interested in the Mapper-Classifier algorithm (MCA) of arXiv:1910.08103, which uses the concept of a Mapper graph from TDA to achieve improved robustness in image classification. We will apply this algorithm to determine which products a user is likely to be interested in.

In this repo we implement some models with and without the MCA to evaluate the performance of the recommendation systems in the presence of noise. Concretely, we will focus on the Instacart dataset. Our task will be to predict what products a user reorders given their previous orders.

MODELS

Main models

Models

We develop models in a modular fashion by first creating the following components:

User latent models: produces a latent vector depending only on the user data. In principle this and other latent models can be customized and used in the full models below. Here we use an autoencoder based on collaborative filtering.
Product latent model: produces a latent vector depending only on the product data. Our main latent model here involves applying word2vec to the sequence of orders to group products that are typically bought together, along with a tf-idf analysis of the words appearing in product names
Feature model: directly extract features from the order data. Here we extract features such as total product orders across all users, average position in cart, and so on.

From these models, we assemble our two main models:

Model A: We concatenate the outputs of the three models above and feed the result into a dense neural network. In our case, the network has two layers and a single output, which predicts for each user and product pair whether the user will purchase that product in their next order
Model B: Similar to Model A, except after concatenating the models, we also feed the result into the mapper-classifier algorithm to obtain additional topological features. The result is also fed into a dense neural network.

Baseline models:

As baseline models for comparison, we consider simple models based on the following algorithms:

Random model (predicts reorders randomly)
Logistic regression
Random Forest
LGBoost

TESTS

We test models on basic classification metrics, such as accuracy, precision, recall, and f1 score. In addition, the normalized discounted cumulative gain (NDCG) is a useful measure of how high in the predicted rankings the actual products appear.

In addition to performing these metrics on the original dataset, we test robustness, or performance in the presence of spurious products, as follows. We perform adversarial tests by randomly changing some products in a user's prior order history and seeing how model performance suffers.

RESULTS

Here we show a plot showing the performance of the two main models and the gradient boost (LGboost) baseline model on the robustness tests

Robustness tests

We see the model with topological encoding (Model B) performs slightly better than the non-topological model (Model A), as well as the baseline gradient boost model (GB model), as we replace items in the robustness test. These results are suggestive, but further study is needed to conclusively determine the effectiveness of this approach.

DIRECTORY STRUCTURE

See above for more details on the models.

.
├── README.md               - this readme file
├── requirements.txt        - package requirements
├── lib
│   ├── data_class.py       - definition of main dataset class
│   ├── mapper_class.py     - mapper classifier implementation
│   ├── process_data.py     - helper function to process data
│   └── tools.py            - general helper functions
├── models
│   ├── base_model.py       - base model class which others inherit
│   ├── baseline_models.py  - baseline models for comparison
│   ├── feature_models.py   - feature models
│   ├── latent_models.py    - user and product latent models
│   └── main_models.py      - main models
└── tests
    ├── mnist_tests.ipynb   - some tests of mapper classifier on MNIST dataset
    ├── run_unit_tests.py    - runs all unit tests
    └── runtests.py         - runs main accuracy and robustness tests

SETUP

To run tests in this repo:

git clone https://github.com/bmwillett/topological-recommendations
pip install -r requirements.txt
python tests/runtests.py

(PACKAGE COMING SOON!)

To install top-choice package:

In the command line, run:
pip install top-choice

Project details

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Release history Release notifications | RSS feed

This version

0.0.3

Jul 8, 2020

0.0.2

Jul 8, 2020

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

top-choice-bmwillett-0.0.3.tar.gz (25.9 kB view details)

Uploaded Jul 8, 2020 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

top_choice_bmwillett-0.0.3-py3-none-any.whl (28.5 kB view details)

Uploaded Jul 8, 2020 Python 3

File details

Details for the file top-choice-bmwillett-0.0.3.tar.gz.

File metadata

Download URL: top-choice-bmwillett-0.0.3.tar.gz
Upload date: Jul 8, 2020
Size: 25.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/49.1.0 requests-toolbelt/0.9.1 tqdm/4.42.1 CPython/3.7.6

File hashes

Hashes for top-choice-bmwillett-0.0.3.tar.gz
Algorithm	Hash digest
SHA256	`a4075d2cbe56082ab9a902fd404178aee301dcedd961a9c17748d89ac0318655`
MD5	`b89e4a6e6b95fbc4769bd2de9ada3384`
BLAKE2b-256	`29f418c1283b3b3bf16350b44523a79bbf9c8be0789132eba024ef894c4254ca`

See more details on using hashes here.

File details

Details for the file top_choice_bmwillett-0.0.3-py3-none-any.whl.

File metadata

Download URL: top_choice_bmwillett-0.0.3-py3-none-any.whl
Upload date: Jul 8, 2020
Size: 28.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/49.1.0 requests-toolbelt/0.9.1 tqdm/4.42.1 CPython/3.7.6

File hashes

Hashes for top_choice_bmwillett-0.0.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`ae8786fb336df362e3ad5fb5e97ce90eebd7bc65baf5318b6c67187a46a8cd7d`
MD5	`cf741f1a694f6b2a9722fae2cc98e5d8`
BLAKE2b-256	`16b457810176b413e1c0f239fda770528a393dfde5dcdc0a0c5e3947024f2aca`

See more details on using hashes here.

top-choice-bmwillett 0.0.3

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

top-choice

INTRODUCTION

BACKGROUND

MODELS

Main models

Baseline models:

TESTS

RESULTS

DIRECTORY STRUCTURE

SETUP

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes