Skip to main content

A Rust-powered library for scoring data importance in sequential knn-recommendation systems

Project description

Illoominate - Data Importance for Recommender Systems

Illoominate is a scalable library designed to compute data importance scores for interaction data in recommender systems. It supports the computation of Data Shapley values (DSV) and leave-one-out (LOO) scores, offering insights into the relevance and quality of data in large-scale sequential kNN-based recommendation models. This library is tailored for sequential kNN-based algorithms including session-based recommendation and next-basket recommendation tasks, and it efficiently handles real-world datasets with millions of interactions.

Illoominate Framework

This repository contains the code for the illoominate framework, which accompanies the scientific manuscript which is under review.

Important Notice

This code is made available exclusively for peer review purposes.

  • Upon acceptance of the manuscript, this repository including all code and experiment results described in the paper will be released under the open-source Apache License 2.0.

Copyright

© 2024 Barrie Kersbergen. All rights reserved.

Overview

Illoominate is implemented in Rust with a Python frontend. It is optimized to scale with datasets containing millions of interactions, commonly found in real-world recommender systems. The library includes KNN-based models VMIS-kNN and TIFU-kNN, used for session-based recommendations and next-basket recommendations.

By leveraging the Data Shapley value, Illoominate helps data scientists and engineers:

  • Debug potentially corrupted data
  • Improve recommendation quality by identifying impactful data points
  • Prune training data for sustainable item recommendations

Installation

  • Python >= 3.10

pip install illoominate

Example Use Case: Computing Data Shapley Values

To compute Data Shapley values for a specific recommendation model (e.g., VMIS-kNN), use the illoominate.data_shapley_values() function. This function calculates the impact of each data point in your training data based on a specified recommendation model and evaluation metric.

import illoominate
import matplotlib.pyplot as plt
import pandas as pd

# Load training and validation datasets
train_df = pd.read_csv("data/nowplaying1m/train.csv", sep='\t')
validation_df = pd.read_csv("data/nowplaying1m/valid.csv", sep='\t')

# Compute Data Shapley values
shapley_values = illoominate.data_shapley_values(
    train_df=train_df,
    validation_df=validation_df,
    model='vmis',  # Model to be used (e.g., 'vmis' for VMIS-kNN)
    metric='mrr@20',  # Evaluation metric (e.g., Mean Reciprocal Rank at 20)
    params={'m':100, 'k':100, 'seed': 42},  # Model-specific parameters
)

negative = shapley_values[shapley_values.score < 0]
corrupt_sessions = train_df.merge(negative, on='session_id')

# Visualize the distribution of Data Shapley values
plt.hist(shapley_values['score'], density=False, bins=100)
plt.title('Distribution of Data Shapley Values')
plt.yscale('log')
plt.ylabel('Frequency')
plt.xlabel('Data Shapley Values')
plt.savefig('images/shapley.png', dpi=300)
plt.show()

Sample Output

The distribution of Data Shapley values can be visualized or used for further analysis. Distribution of Data Shapley Values

print(corrupt_sessions)

    session_id	item_id	timestamp	score
0	5076	64	1585507853	-2.931978e-05
1	13946	119	1584189394	-2.606203e-05
2	13951	173	1585417176	-6.515507e-06
3	3090	199	1584196605	-2.393995e-05
4	5076	205	1585507872	-2.931978e-05
...	...	...	...	...
956	13951	5860	1585416925	-6.515507e-06
957	447	3786	1584448579	-5.092383e-06
958	7573	14467	1584450303	-7.107826e-07
959	5123	47	1584808576	-4.295939e-07
960	11339	4855	1585391332	-1.579517e-06
961 rows × 4 columns

Key Features

  • Scalable: Optimized for large datasets with millions of interactions.
  • Efficient Computation: Uses the KMC-Shapley algorithm to speed up the estimation of Data Shapley values, making it suitable for real-world sequential kNN-based recommendation systems.
  • Customizable: Supports multiple recommendation models, including VMIS-kNN (session-based) and TIFU-kNN (next-basket), and supports popular metrics such as MRR, NDCG, Recall, F1 etc.
  • Visualization: Easily visualize the distribution of Data Shapley values to analyze data quality and identify potential issues.
  • Real-World Application: Focuses on practical use cases, including debugging, data pruning, and improving sustainability in recommendations.

How KMC-Shapley Optimizes DSV Estimation

KMC-Shapley (K-nearest Monte Carlo Shapley) is a custom-tailored, scalable variant of the Data Shapley value computation, designed specifically for sequential kNN-based recommendation systems. KMC-Shapley improves the efficiency of Truncated Data Shapley (TMC-Shapley) by leveraging the sparsity and nearest-neighbor characteristics of the data, making it feasible to apply Data Shapley values in large-scale recommender systems. By only computing the utility change when necessary (i.e., when a neighbor's addition impacts the top-k set), KMC-Shapley skips redundant computations, significantly reducing the time complexity.

Development Installation

To get started with developing Illoominate or conducting the experiments from the paper, follow these steps:

Requirements:

  • Rust >= 1.82
  • Python >= 3.10
  1. Clone the repository:
git clone https://github.com/bkersbergen/illoominate.git
cd illoominate
  1. Create the python wheel by:
pip install -r requirements.txt
maturin develop --release

Conduct experiments from paper

The experiments from the paper are available in Rust code.

Prepare a config file for a dataset, describing the model, model parameters and the evaluation metric.

$ cat config.toml
[model]
name = "vmis"

[hpo]
k = 50
m = 500

[metric]
name="MRR"
length=20

The software expects the config file for the experiment in the same directory as the data files.

DATA_LOCATION=data/tafeng/processed CONFIG_FILENAME=config.toml cargo run --release --bin removal_impact

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

illoominate-0.1.2.tar.gz (6.4 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

illoominate-0.1.2-cp310-none-win_amd64.whl (3.8 MB view details)

Uploaded CPython 3.10Windows x86-64

File details

Details for the file illoominate-0.1.2.tar.gz.

File metadata

  • Download URL: illoominate-0.1.2.tar.gz
  • Upload date:
  • Size: 6.4 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.7.4

File hashes

Hashes for illoominate-0.1.2.tar.gz
Algorithm Hash digest
SHA256 4a48e8a6b27107c1474e5756437e732e50ab7e24e3360c518b820eb1a57719a5
MD5 3f58acca0b04b508c7ebe64e19c88c6f
BLAKE2b-256 32fe063689fd7e64dbe767dfca5cfe367edc807040da9acb5b89caaf401cba4b

See more details on using hashes here.

File details

Details for the file illoominate-0.1.2-cp310-none-win_amd64.whl.

File metadata

File hashes

Hashes for illoominate-0.1.2-cp310-none-win_amd64.whl
Algorithm Hash digest
SHA256 a9c4b95aab0694ac111d1a7cb608e60d2501dffcc3311221f9bf2b424e6ee8ea
MD5 4c03b247142baeda1d82f03dd646e4ea
BLAKE2b-256 0ae79d619729182c1f2279679dbb1816722ea5ed5b7673cd00e920cf28eea91a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page