A Rust-powered library for scoring data importance in sequential knn-recommendation systems
Project description
Illoominate - Data Importance for Recommender Systems
Illoominate is a scalable library designed to compute data importance scores for interaction data in recommender systems. It supports the computation of Data Shapley values (DSV) and leave-one-out (LOO) errors, offering insights into the relevance and quality of data in large-scale recommendation models. This library is tailored for sequential kNN based algorithms including session-based recommendation and next-basket recommendation tasks, and it efficiently handles real-world datasets with millions of interactions.
Illoominate Framework
This repository contains the code for the illoominate framework, which accompanies the scientific manuscript which is under review.
Important Notice
This code is made available exclusively for peer review purposes.
- Any use, reproduction, or modification of this code is prohibited without explicit written permission from the authors.
- Upon acceptance of the manuscript, the repository will be updated to include a standalone, pip-installable version of the library. All code and experiments described in the paper will be released under the open-source Apache License 2.0.
Copyright
© 2024 Barrie Kersbergen. All rights reserved.
Overview
Illoominate is implemented in Rust with a Python frontend. It is optimized to scale with datasets containing millions of interactions, commonly found in real-world recommender systems. The library includes KNN-based models VMIS-kNN and TIFU-kNN, used for session-based recommendations and next-basket recommendations.
By leveraging the Data Shapley value, Illoominate helps data scientists and engineers:
- Debug potentially corrupted data
- Improve recommendation quality by identifying impactful data points
- Prune training data for sustainable item recommendations
Example Use Case: Computing Data Shapley Values
To compute Data Shapley values for a specific recommendation model (e.g., VMIS-kNN), use the illoominate.data_shapley_values() function. This function calculates the impact of each data point in your training data based on a specified recommendation model and evaluation metric.
import illoominate
import matplotlib.pyplot as plt
import pandas as pd
# Load training and validation datasets
train_df = pd.read_csv("data/largersample/train.csv", sep='\t')
validation_df = pd.read_csv("data/largersample/valid.csv", sep='\t')
# Compute Data Shapley values
shapley_values = illoominate.data_shapley_values(
train_df=train_df,
validation_df=validation_df,
model='vmis', # Model to be used (e.g., 'vmis' for VMIS-kNN)
metric='mrr@20', # Evaluation metric (e.g., Mean Reciprocal Rank at 20)
params={'m':100, 'k':100, 'seed': 42}, # Model-specific parameters
)
negative = shapley_values[shapley_values.score < 0]
corrupt_sessions = train_df.merge(negative, on='session_id')
# Visualize the distribution of Data Shapley values
plt.hist(shapley_values['score'], density=False, bins=100)
plt.title('Distribution of Data Shapley Values')
plt.yscale('log')
plt.ylabel('Frequency')
plt.xlabel('Data Shapley Values')
plt.savefig('images/shapley.png', dpi=300)
plt.show()
Sample Output
The distribution of Data Shapley values can be visualized or used for further analysis.
print(corrupt_sessions)
session_id item_id timestamp score
0 5076 64 1585507853 -2.931978e-05
1 13946 119 1584189394 -2.606203e-05
2 13951 173 1585417176 -6.515507e-06
3 3090 199 1584196605 -2.393995e-05
4 5076 205 1585507872 -2.931978e-05
... ... ... ... ...
956 13951 5860 1585416925 -6.515507e-06
957 447 3786 1584448579 -5.092383e-06
958 7573 14467 1584450303 -7.107826e-07
959 5123 47 1584808576 -4.295939e-07
960 11339 4855 1585391332 -1.579517e-06
961 rows × 4 columns
Key Features
- Scalable: Optimized for large datasets with millions of interactions.
- Efficient Computation: Uses the KMC-Shapley algorithm to speed up the estimation of Data Shapley values, making it suitable for real-world recommender systems.
- Customizable: Supports multiple recommendation models, including VMIS-kNN (session-based) and TIFU-kNN (next-basket), and supports popular metrics such as MRR, NDCG, Recall, F1 etc.
- Visualization: Easily visualize the distribution of Data Shapley values to analyze data quality and identify potential issues.
- Real-World Application: Focuses on practical use cases, including debugging, data pruning, and improving sustainability in recommendations.
How KMC-Shapley Optimizes DSV Estimation
KMC-Shapley (K-nearest Monte Carlo Shapley) is a custom-tailored, scalable variant of the Data Shapley value computation, designed specifically for sequential KNN-based recommendation systems. KMC-Shapley improves the efficiency of Truncated Data Shapley (TMC-Shapley) by leveraging the sparsity and nearest-neighbor characteristics of the data, making it feasible to apply Data Shapley values in large-scale recommender systems. By only computing the utility change when necessary (i.e., when a neighbor's addition impacts the top-k set), KMC-Shapley skips redundant computations, significantly reducing the time complexity.
Installation
Requirements:
- Rust >= 1.82
- Python >= 3.10
To get started with Illoominate, follow these steps:
- Clone the repository:
git clone https://github.com/bkersbergen/illoominate.git
cd illoominate
- Install the required dependencies:
pip install -r requirements.txt
maturin develop --release
Conduct experiments from paper
The experiments from the paper are available in Rust code.
Prepare a config file for a dataset, describing the model, model parameters and the evaluation metric.
$ cat config.toml
[model]
name = "vmis"
[hpo]
k = 50
m = 500
[metric]
name="MRR"
length=20
The software expects the config file for the experiment in the same directory as the data files.
DATA_LOCATION=data/convergence_11306 CONFIG_FILENAME=config.toml cargo run --release --bin mc_convergence_experiment
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file illoominate-0.1.0.tar.gz.
File metadata
- Download URL: illoominate-0.1.0.tar.gz
- Upload date:
- Size: 188.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.7.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1abf614cfa1505256a0460ad32c2cd6666189537198740c6935dabef1db76caf
|
|
| MD5 |
c7fe80429d3916812d8f3e9e01f37a5e
|
|
| BLAKE2b-256 |
aa37fead942a6bc77acc2113a5e537f465e174324219d62446a20c1abe6cc33a
|
File details
Details for the file illoominate-0.1.0-cp310-none-win_amd64.whl.
File metadata
- Download URL: illoominate-0.1.0-cp310-none-win_amd64.whl
- Upload date:
- Size: 3.7 MB
- Tags: CPython 3.10, Windows x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.7.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e208bccf0413a5a6b8cc807096f540cd374b515f9c1ce512c90021520114b17e
|
|
| MD5 |
fa8dcdcb1dc9763f3f1f8916b54e0157
|
|
| BLAKE2b-256 |
060a80871762b2d4ef9e5488abbd5eb520df8ea7953593123d121f4c3e888b44
|