Skip to main content

Recommender System Utilities

Project description

Recommenders

Documentation Status

What's New (February 4, 2021)

We have a new relase Recommenders 2021.2!

It comes with lots of bug fixes, optimizations and 3 new algorithms, GeoIMC, Standard VAE and Multinomial VAE. We also added tools to facilitate the use of Microsoft News dataset (MIND). In addition, we publised our KDD2020 tutorial where we built a recommender of COVID papers using Microsoft Academic Graph.

We also changed the default branch from master to main. Now when you download the repo, you will get main branch.

See past announcements in NEWS.md.

Introduction

This repository contains examples and best practices for building recommendation systems, provided as Jupyter notebooks. The examples detail our learnings on five key tasks:

  • Prepare Data: Preparing and loading data for each recommender algorithm
  • Model: Building models using various classical and deep learning recommender algorithms such as Alternating Least Squares (ALS) or eXtreme Deep Factorization Machines (xDeepFM).
  • Evaluate: Evaluating algorithms with offline metrics
  • Model Select and Optimize: Tuning and optimizing hyperparameters for recommender models
  • Operationalize: Operationalizing models in a production environment on Azure

Several utilities are provided in reco_utils to support common tasks such as loading datasets in the format expected by different algorithms, evaluating model outputs, and splitting training/test data. Implementations of several state-of-the-art algorithms are included for self-study and customization in your own applications. See the reco_utils documentation.

For a more detailed overview of the repository, please see the documents on the wiki page.

Getting Started

Please see the setup guide for more details on setting up your machine locally, on a data science virtual machine (DSVM) or on Azure Databricks.

To setup on your local machine:

  1. Install Anaconda with Python >= 3.6. Miniconda is a quick way to get started.

  2. Clone the repository

git clone https://github.com/Microsoft/Recommenders
  1. Run the generate conda file script to create a conda environment: (This is for a basic python environment, see SETUP.md for PySpark and GPU environment setup)
cd Recommenders
python tools/generate_conda_file.py
conda env create -f reco_base.yaml  
  1. Activate the conda environment and register it with Jupyter:
conda activate reco_base
python -m ipykernel install --user --name reco_base --display-name "Python (reco)"
  1. Start the Jupyter notebook server
jupyter notebook
  1. Run the SAR Python CPU MovieLens notebook under the 00_quick_start folder. Make sure to change the kernel to "Python (reco)".

NOTE - The Alternating Least Squares (ALS) notebooks require a PySpark environment to run. Please follow the steps in the setup guide to run these notebooks in a PySpark environment. For the deep learning algorithms, it is recommended to use a GPU machine.

Algorithms

The table below lists the recommender algorithms currently available in the repository. Notebooks are linked under the Environment column when different implementations are available.

Algorithm Environment Type Description
Alternating Least Squares (ALS) PySpark Collaborative Filtering Matrix factorization algorithm for explicit or implicit feedback in large datasets, optimized by Spark MLLib for scalability and distributed computing capability
Attentive Asynchronous Singular Value Decomposition (A2SVD)* Python CPU / Python GPU Collaborative Filtering Sequential-based algorithm that aims to capture both long and short-term user preferences using attention mechanism
Cornac/Bayesian Personalized Ranking (BPR) Python CPU Collaborative Filtering Matrix factorization algorithm for predicting item ranking with implicit feedback
Convolutional Sequence Embedding Recommendation (Caser) Python CPU / Python GPU Collaborative Filtering Algorithm based on convolutions that aim to capture both user’s general preferences and sequential patterns
Deep Knowledge-Aware Network (DKN)* Python CPU / Python GPU Content-Based Filtering Deep learning algorithm incorporating a knowledge graph and article embeddings to provide powerful news or article recommendations
Extreme Deep Factorization Machine (xDeepFM)* Python CPU / Python GPU Hybrid Deep learning based algorithm for implicit and explicit feedback with user/item features
FastAI Embedding Dot Bias (FAST) Python CPU / Python GPU Collaborative Filtering General purpose algorithm with embeddings and biases for users and items
LightFM/Hybrid Matrix Factorization Python CPU Hybrid Hybrid matrix factorization algorithm for both implicit and explicit feedbacks
LightGBM/Gradient Boosting Tree* Python CPU / PySpark Content-Based Filtering Gradient Boosting Tree algorithm for fast training and low memory usage in content-based problems
LightGCN Python CPU / Python GPU Collaborative Filtering Deep learning algorithm which simplifies the design of GCN for predicting implicit feedback
GeoIMC Python CPU Hybrid Matrix completion algorithm that has into account user and item features using Riemannian conjugate gradients optimization and following a geometric approach.
GRU4Rec Python CPU / Python GPU Collaborative Filtering Sequential-based algorithm that aims to capture both long and short-term user preferences using recurrent neural networks
Multinomial VAE Python CPU / Python GPU Collaborative Filtering Generative Model for predicting user/item interactions
Neural Recommendation with Long- and Short-term User Representations (LSTUR)* Python CPU / Python GPU Content-Based Filtering Neural recommendation algorithm with long- and short-term user interest modeling
Neural Recommendation with Attentive Multi-View Learning (NAML)* Python CPU / Python GPU Content-Based Filtering Neural recommendation algorithm with attentive multi-view learning
Neural Collaborative Filtering (NCF) Python CPU / Python GPU Collaborative Filtering Deep learning algorithm with enhanced performance for implicit feedback
Neural Recommendation with Personalized Attention (NPA)* Python CPU / Python GPU Content-Based Filtering Neural recommendation algorithm with personalized attention network
Neural Recommendation with Multi-Head Self-Attention (NRMS)* Python CPU / Python GPU Content-Based Filtering Neural recommendation algorithm with multi-head self-attention
Next Item Recommendation (NextItNet) Python CPU / Python GPU Collaborative Filtering Algorithm based on dilated convolutions and residual network that aims to capture sequential patterns
Restricted Boltzmann Machines (RBM) Python CPU / Python GPU Collaborative Filtering Neural network based algorithm for learning the underlying probability distribution for explicit or implicit feedback
Riemannian Low-rank Matrix Completion (RLRMC)* Python CPU Collaborative Filtering Matrix factorization algorithm using Riemannian conjugate gradients optimization with small memory consumption.
Simple Algorithm for Recommendation (SAR)* Python CPU Collaborative Filtering Similarity-based algorithm for implicit feedback dataset
Short-term and Long-term preference Integrated Recommender (SLi-Rec)* Python CPU / Python GPU Collaborative Filtering Sequential-based algorithm that aims to capture both long and short-term user preferences using attention mechanism, a time-aware controller and a content-aware controller
Standard VAE Python CPU / Python GPU Collaborative Filtering Generative Model for predicting user/item interactions
Surprise/Singular Value Decomposition (SVD) Python CPU Collaborative Filtering Matrix factorization algorithm for predicting explicit rating feedback in datasets that are not very large
Term Frequency - Inverse Document Frequency (TF-IDF) Python CPU Content-Based Filtering Simple similarity-based algorithm for content-based recommendations with text datasets
Vowpal Wabbit (VW)* Python CPU (online training) Content-Based Filtering Fast online learning algorithms, great for scenarios where user features / context are constantly changing
Wide and Deep Python CPU / Python GPU Hybrid Deep learning algorithm that can memorize feature interactions and generalize user features
xLearn/Factorization Machine (FM) & Field-Aware FM (FFM) Python CPU Content-Based Filtering Quick and memory efficient algorithm to predict labels with user/item features

NOTE: * indicates algorithms invented/contributed by Microsoft.

Independent or incubating algorithms and utilities are candidates for the contrib folder. This will house contributions which may not easily fit into the core repository or need time to refactor or mature the code and add necessary tests.

Algorithm Environment Type Description
SARplus * PySpark Collaborative Filtering Optimized implementation of SAR for Spark

Preliminary Comparison

We provide a benchmark notebook to illustrate how different algorithms could be evaluated and compared. In this notebook, the MovieLens dataset is split into training/test sets at a 75/25 ratio using a stratified split. A recommendation model is trained using each of the collaborative filtering algorithms below. We utilize empirical parameter values reported in literature here. For ranking metrics we use k=10 (top 10 recommended items). We run the comparison on a Standard NC6s_v2 Azure DSVM (6 vCPUs, 112 GB memory and 1 P100 GPU). Spark ALS is run in local standalone mode. In this table we show the results on Movielens 100k, running the algorithms for 15 epochs.

Algo MAP nDCG@k Precision@k Recall@k RMSE MAE R2 Explained Variance
ALS 0.004732 0.044239 0.048462 0.017796 0.965038 0.753001 0.255647 0.251648
BPR 0.105365 0.389948 0.349841 0.181807 N/A N/A N/A N/A
FastAI 0.025503 0.147866 0.130329 0.053824 0.943084 0.744337 0.285308 0.287671
LightGCN 0.088526 0.419846 0.379626 0.144336 N/A N/A N/A N/A
NCF 0.107720 0.396118 0.347296 0.180775 N/A N/A N/A N/A
SAR 0.110591 0.382461 0.330753 0.176385 1.253805 1.048484 -0.569363 0.030474
SVD 0.012873 0.095930 0.091198 0.032783 0.938681 0.742690 0.291967 0.291971

Contributing

This project welcomes contributions and suggestions. Before contributing, please see our contribution guidelines.

Build Status

These tests are the nightly builds, which compute the smoke and integration tests. main is our principal branch and staging is our development branch. We use pytest for testing python utilities in reco_utils and papermill for the notebooks. For more information about the testing pipelines, please see the test documentation.

DSVM Build Status

The following tests run on a Windows and Linux DSVM daily. These machines run 24/7.

Build Type Branch Status Branch Status
Linux CPU main Build Status staging Build Status
Linux GPU main Build Status staging Build Status
Linux Spark main Build Status staging Build Status

Related projects

Reference papers

  • A. Argyriou, M. González-Fierro, and L. Zhang, "Microsoft Recommenders: Best Practices for Production-Ready Recommendation Systems", WWW 2020: International World Wide Web Conference Taipei, 2020. Available online: https://dl.acm.org/doi/abs/10.1145/3366424.3382692
  • L. Zhang, T. Wu, X. Xie, A. Argyriou, M. González-Fierro and J. Lian, "Building Production-Ready Recommendation System at Scale", ACM SIGKDD Conference on Knowledge Discovery and Data Mining 2019 (KDD 2019), 2019.
  • S. Graham, J.K. Min, T. Wu, "Microsoft recommenders: tools to accelerate developing recommender systems", RecSys '19: Proceedings of the 13th ACM Conference on Recommender Systems, 2019. Available online: https://dl.acm.org/doi/10.1145/3298689.3346967

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pre_reco_utils-2021.2.17.tar.gz (148.0 kB view details)

Uploaded Source

Built Distribution

pre_reco_utils-2021.2.17-py3-none-any.whl (192.4 kB view details)

Uploaded Python 3

File details

Details for the file pre_reco_utils-2021.2.17.tar.gz.

File metadata

  • Download URL: pre_reco_utils-2021.2.17.tar.gz
  • Upload date:
  • Size: 148.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.25.1 setuptools/53.0.0 requests-toolbelt/0.9.1 tqdm/4.58.0 CPython/3.9.2

File hashes

Hashes for pre_reco_utils-2021.2.17.tar.gz
Algorithm Hash digest
SHA256 1417e22c1b3b98ee7f07cce437b2013cbe4cd22ae8438263d794d880bc15d18c
MD5 502e5c7a4507849dc9209ae6642827a7
BLAKE2b-256 3fa619e49672cf9fbfade0021286a87c4daeecf3f4e6051cb4f2ea4b4b3dc8e5

See more details on using hashes here.

File details

Details for the file pre_reco_utils-2021.2.17-py3-none-any.whl.

File metadata

  • Download URL: pre_reco_utils-2021.2.17-py3-none-any.whl
  • Upload date:
  • Size: 192.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.25.1 setuptools/53.0.0 requests-toolbelt/0.9.1 tqdm/4.58.0 CPython/3.9.2

File hashes

Hashes for pre_reco_utils-2021.2.17-py3-none-any.whl
Algorithm Hash digest
SHA256 c2405b689aefc07895ca3a073407f1b128b00e935fa4e38238e1a809fa074fed
MD5 b2c7efdf59608c49be6b9376a358bb29
BLAKE2b-256 84a4f1d43d1f74cf33aea6a9333dcfed20965636408e1ce7ba04a2956fe1028a

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page