Recommender Systems Dataset from FINN.no containing the presented items and whether and what the user clicked on.
Project description
FINN.no Slate Dataset for Recommender Systems
Data and helper functions for FINN.no slate dataset containing both viewed items and clicks from the FINN.no second hand marketplace.
We release the FINN.no slate dataset to improve recommender systems research. The dataset includes both search and recommendation interactions between users and the platform over a 30 day period. The dataset has logged both exposures and clicks, including interactions where the user did not click on any of the items in the slate. To our knowledge there exist no such large-scale dataset, and we hope this contribution can help researchers constructing improved models and improve offline evaluation metrics.
For each user u and interaction step t we recorded all items in the visible slate (up to the scroll length ), and the user's click response . The dataset consists of 37.4 million interactions, |U| ≈ 2.3) million users and |I| ≈ 1.3 million items that belong to one of G = 290 item groups. For a detailed description of the data please see the paper.
FINN.no is the leading marketplace in the Norwegian classifieds market and provides users with a platform to buy and sell general merchandise, cars, real estate, as well as house rentals and job offerings. For questions, email simen.eide@finn.no or file an issue.
Install
pip install recsys_slates_dataset
How to use
To download the generic numpy data files:
from recsys_slates_dataset import datahelper
datahelper.download_data_files(data_dir="data")
Download and prepare data into ready-to-use pytorch dataloaders:
from recsys_slates_dataset import dataset_torch
ind2val, itemattr, dataloaders = dataset_torch.load_dataloaders(data_dir="data")
Organization
The repository is organized as follows:
- The dataset is placed in
data/
and stored using git-lfs. We also provide an automatic download function in the pip package (preferred usage). - The code open sourced from the article "Dynamic Slate Recommendation with Gated Recurrent Units and Thompson Sampling" is found in (
code_eide_et_al21/
). However, we are in the process of making the data more generally available which makes the code incompatible with the current (newer) version of the data. Please use the v1.0 release of the repository for a compatible version of the code and dataset.
Quickstart dataset
We provide a quickstart jupyter notebook that runs on Google Colab (quickstart-finn-recsys-slate-data.ipynb) which includes all necessary steps above. It gives a quick introduction to how to use the dataset.
Citations
This repository accompany the paper "Dynamic Slate Recommendation with Gated Recurrent Units and Thompson Sampling" by Simen Eide, David S. Leslie and Arnoldo Frigessi. The article is under review, and the pre-print can be obtained here.
If you use either the code, data or paper, please consider citing the paper.
@article{eide2021dynamic,
title={Dynamic Slate Recommendation with Gated Recurrent Units and Thompson Sampling},
author={Simen Eide and David S. Leslie and Arnoldo Frigessi},
year={2021},
eprint={2104.15046},
archivePrefix={arXiv},
primaryClass={stat.ML}
}
Todo
This repository is currently work in progress, and we will provide descriptions and tutorials. Suggestions and contributions to make the material more available is welcome. There are some features of the repository that we are working on:
- Release the dataset as numpy objects instead of pytorch arrays. This will help non-pytorch users to more easily utilize the data
- Maintain a pytorch dataset for easy usage
- Create a pip package for easier installation and usage. the package should download the dataset using a function.
- Make the quickstart guide compatible with the pip package and numpy format.
- The git lfs is currently broken by removing some lines in .gitattributes that is in conflict with nbdev. The dataset is still usable using the building download functions as they use a different source. However, we should fix this. An issue is posted on nbdev.
- Add easily useable functions that compute relevant metrics such as hitrate, log-likelihood etc.
- Distribute the data on other platforms such as kaggle.
- Add a short description of the data in the readme.md directly.
As the current state is in early stage, it makes sense to allow the above changes non-backward compatible. However, this should be completed within the next couple of months.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for recsys_slates_dataset-0.0.4.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7b2ada5476c90017f56f7693500b217ce0b0003bebdf6f2f029b01a51664a27c |
|
MD5 | 6bcf231be2296b6de65fb726d60f6068 |
|
BLAKE2b-256 | f92de75e52cea0f9c4440d2314920a98b0314755fb4f61143d7a431c4d3a53d0 |
Hashes for recsys_slates_dataset-0.0.4-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 9511ba873cdde356f6842e86867fdcc61893a6ac0485b02df072f0ced8dcd44f |
|
MD5 | 90f4875abec121ea3c008588599796cc |
|
BLAKE2b-256 | ddb308f77e1b3de7483cfe6f1f5e2384ef1d26ad33114ed8d1a895cc9f281610 |