RecSys Library
Project description
RePlay
RePlay is a library providing tools for all stages of creating a recommendation system, from data preprocessing to model evaluation and comparison.
RePlay can use PySpark to handle big data.
You can
- Filter and split data
- Train models
- Optimize hyper parameters
- Evaluate predictions with metrics
- Combine predictions from different models
- Create a two-level model
Documentation is available here.
Table of Contents
Installation
Installation via pip
package manager is recommended by default:
pip install replay-rec
In this case it will be installed the core
package without PySpark
and PyTorch
dependencies.
Also experimental
submodule will not be installed.
To install experimental
submodule please specify the version with rc0
suffix.
For example:
pip install replay-rec==XX.YY.ZZrc0
Extras
In addition to the core package, several extras are also provided, including:
[spark]
: Install PySpark functionality[torch]
: Install PyTorch and Lightning functionality[all]
:[spark]
[torch]
Example:
# Install core package with PySpark dependency
pip install replay-rec[spark]
# Install package with experimental submodule and PySpark dependency
pip install replay-rec[spark]==XX.YY.ZZrc0
To build RePlay from sources please use the instruction.
If you encounter an error during RePlay installation, check the troubleshooting guide.
Quickstart
from rs_datasets import MovieLens
from replay.data import Dataset, FeatureHint, FeatureInfo, FeatureSchema, FeatureType
from replay.data.dataset_utils import DatasetLabelEncoder
from replay.metrics import HitRate, NDCG, Experiment
from replay.models import ItemKNN
from replay.utils import convert2spark
from replay.utils.session_handler import State
from replay.splitters import RatioSplitter
spark = State().session
ml_1m = MovieLens("1m")
K=10
# data preprocessing
interactions = convert2spark(ml_1m.ratings)
# data splitting
splitter = RatioSplitter(
test_size=0.3,
divide_column="user_id",
query_column="user_id",
item_column="item_id",
timestamp_column="timestamp",
drop_cold_items=True,
drop_cold_users=True,
)
train, test = splitter.split(interactions)
# dataset creating
feature_schema = FeatureSchema(
[
FeatureInfo(
column="user_id",
feature_type=FeatureType.CATEGORICAL,
feature_hint=FeatureHint.QUERY_ID,
),
FeatureInfo(
column="item_id",
feature_type=FeatureType.CATEGORICAL,
feature_hint=FeatureHint.ITEM_ID,
),
FeatureInfo(
column="rating",
feature_type=FeatureType.NUMERICAL,
feature_hint=FeatureHint.RATING,
),
FeatureInfo(
column="timestamp",
feature_type=FeatureType.NUMERICAL,
feature_hint=FeatureHint.TIMESTAMP,
),
]
)
train_dataset = Dataset(
feature_schema=feature_schema,
interactions=train,
)
test_dataset = Dataset(
feature_schema=feature_schema,
interactions=test,
)
# data encoding
encoder = DatasetLabelEncoder()
train_dataset = encoder.fit_transform(train_dataset)
test_dataset = encoder.transform(test_dataset)
# model training
model = ItemKNN()
model.fit(train_dataset)
# model inference
encoded_recs = model.predict(
dataset=train_dataset,
k=K,
queries=test_dataset.query_ids,
filter_seen_items=True,
)
recs = encoder.query_and_item_id_encoder.inverse_transform(encoded_recs)
# model evaluation
metrics = Experiment(
[NDCG(K), HitRate(K)],
test,
query_column="user_id",
item_column="item_id",
rating_column="rating",
)
metrics.add_result("ItemKNN", recs)
print(metrics.results)
Resources
Usage examples
- 01_replay_basics.ipynb - get started with RePlay.
- 02_models_comparison.ipynb - reproducible models comparison on MovieLens-1M dataset.
- 03_features_preprocessing_and_lightFM.ipynb - LightFM example with pyspark for feature preprocessing.
- 04_splitters.ipynb - An example of using RePlay data splitters.
- 05_feature_generators.ipynb - Feature generation with RePlay.
Videos and papers
-
Video guides:
-
Research papers:
- Yan-Martin Tamm, Rinchin Damdinov, Alexey Vasilev Quality Metrics in Recommender Systems: Do We Calculate Metrics Consistently?
Contributing to RePlay
We welcome community contributions. For details please check our contributing guidelines.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for replay_rec-0.13.0-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | fd1abaa654844033c19937997e7a70652a516f8873fa5af07e7eb5d36b60299f |
|
MD5 | 87d354caf4cf4e3140f65716ae49c177 |
|
BLAKE2b-256 | 6f4d488eaf02c18143576a093042d9c523144fd4935f35ef23a0bdf79a6576ab |