No project description provided

Project description

Rsdiv: Diversity improvement framework for recommender systems

rsdiv is a Python package for recommender systems to provide the measurements and improvements for the diversity of results.

Some of its features include:

various kinds of metrics to measure the diversity of recommender systems from a quantitative view.
various implementations for diversify algorithms and models.
various implementations of core recommender algorithms.
benchmarks for comparing and further analysis.
hyperparameter optimization based on Optuna.

Installation

You can simply install the pre-build binaries with:

$ pip install rsdiv

Or you may want to build from source:

$ cd rsdiv && pip install .

Basic Usage

Prepare for a benchmark dataset

Load a benchmark, say, MovieLens 1M Dataset. This is a table benchmark dataset which contains 1 million ratings from 6000 users on 4000 movies.

>>> import rsdiv as rs
>>> loader = rs.MovieLens1MDownLoader()

Get the user-item interactions (ratings):

>>> ratings = loader.read_ratings()

	userId	movieId	rating	timestamp
0	1	1193	5	2000-12-31 22:12:40
1	1	661	3	2000-12-31 22:35:09
...	...	...	...	...
1000207	6040	1096	4	2000-04-26 02:20:48
1000208	6040	1097	4	2000-04-26 02:19:29

Get the users' infomation:

>>> users = loader.read_users()

	userId	gender	age	occupation	zipcode
0	1	F	1	10	48067
1	2	M	56	16	70072
...	...	...	...	...	...
6038	6039	F	45	0	01060
6039	6040	M	25	6	11106

Get the items' information:

>>> movies = loader.read_items()

	movieId	title	genres	release_date
0	1	Toy Story	['Animation', "Children's", 'Comedy']	1995
1	2	Jumanji	['Adventure', "Children's", 'Fantasy']	1995
...	...	...	...	...
3881	3951	Two Family House	['Drama']	2000
3882	3952	Contender, The	['Drama', 'Thriller']	2000

Evaluate the results in various aspects

Load the evaluator to analyse the results, say, Gini coefficient metric:

>>> metrics = rs.DiversityMetrics()
>>> metrics.gini_coefficient(ratings['movieId'])
>>> 0.6335616301416965

The nested input type (List[List[str]]-like) is also favorable. This is especially usful to evaluate the diversity on topic-scale:

>>> metrics.gini_coefficient(items['genres'])
>>> 0.5158655846858095

Shannon Index and Effective Catalog Size are also available with same usage.

Show the distribution of a given data source

The unbalance of the data distribution can be well illustrated by both barplot and sorted dataframe:

>>> distribution = metrics.get_distribution(items['genres'])

distribution

>>> distribution

	category	percentage
0	Drama	0.250156
1	Comedy	0.187266
2	Action	0.0784956
...	...	...
16	Western	0.0106117
17	Film-Noir	0.00686642

Draw a Lorenz curve graph for insights

Lorenz curve is a graphical representation of the distribution, the cumulative proportion of species is plotted against the cumulative proportion of individuals. This feature is also supported by rsdiv for helping practitioners' analysis.

metrics.get_lorenz_curve(ratings['movieId'])

Lorenz

Evaluate the unblance from a sense of location

rsdiv provides the encoders including geography encoding function to improve the intuitive understanding for practitioners, to start with the random values:

>>> geo = rs.GeoEncoder()
>>> df = geo.read_source()
>>> rng = np.random.RandomState(42)
>>> df['random_values'] = rng.rand(len(df))
>>> geo.graw_geo_graph(df, 'random_values')

GeoEncoder

Train a recommender

rsdiv provides various implementations of core recommender algorithms. To start with, a wrapper for LightFM is also supported:

>>> rc = rs.FMRecommender(ratings, items, 0.3).fit()

30% of interactions are split for test set, the precision at top 5 can be calculated with:

>>> rc.precision_at_top_k(5)
>>> 0.15639074

the top 100 unseen recommended items for an arbitrary user, say userId: 1024, can be simply given by:

>>> rc.predict_top_n_item(1024, 100)

	itemId	scores	title	genres	release_date
0	916	1.77356	Roman Holiday	['Comedy', 'Romance']	1953
1	1296	1.74696	Room with a View	['Drama', 'Romance']	1986
...	...	...	...	...	...
98	3079	0.371897	Mansfield Park	['Drama']	1999
99	2570	0.369199	Walk on the Moon	['Drama', 'Romance']	1999

Improve the diversity

Not only for categorical labels, rsdiv also supports embedding for items, for example, the pretrained 300-dim embedding based on wiki_en by fastText can be simply imported as:

>>> emb = rs.FastTextEmbedder()
>>> emb.embedding_list(['Comedy', 'Romance'])
>>> array([-0.02061814,  0.06264187,  0.00729847, -0.04322025,  0.04619966, ...])

TODO

implement the Maximal Marginal Relevance, MMR diversify algorithm
implement the Bounded Greedy Selection Strategy, BGS diversify algorithm
implement the Determinantal Point Process, DPP diversify algorithm
implement the Modified Gram-Schmidt, MGS diversify algorithm

Hyperparameter optimization

TODO

compatible with Optuna.

For developers

Contributions welcome! Please contact us.

During your development stage, make sure you have pre-commit installed in your local enviroment:

pip install pre-commit
pre-commit install

Project details

Release history Release notifications | RSS feed

0.2.7.1

Jan 10, 2023

0.2.6

Sep 19, 2022

0.2.5

Sep 10, 2022

0.2.4

Aug 16, 2022

0.2.3

Aug 7, 2022

0.2.2

Jul 20, 2022

0.2.0

Jul 18, 2022

This version

0.1.10

Jul 18, 2022

0.1.9

Jul 17, 2022

0.1.8

Jul 16, 2022

0.1.7

Jul 13, 2022

0.1.6

Jul 13, 2022

0.1.5

Jul 11, 2022

0.1.4

Jul 11, 2022

0.1.3

Jul 10, 2022

0.1.2

Jul 7, 2022

0.1.1

Jul 5, 2022

0.1.0

Jul 5, 2022

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rsdiv-0.1.10.tar.gz (944.2 kB view hashes)

Uploaded Jul 18, 2022 Source

Built Distribution

rsdiv-0.1.10-py3-none-any.whl (947.6 kB view hashes)

Uploaded Jul 18, 2022 Python 3

Hashes for rsdiv-0.1.10.tar.gz

Hashes for rsdiv-0.1.10.tar.gz
Algorithm	Hash digest
SHA256	`8defc820b643913b2ff06d71e9603e987abdbb6af993c19b363b41ea6a9ae914`
MD5	`a709e3417733f1bcba5e5674b2fdee18`
BLAKE2b-256	`59df077edbbb312afa849905c7e7f5b9df32921d2c8214e78a9e37f1137b907b`

Hashes for rsdiv-0.1.10-py3-none-any.whl

Hashes for rsdiv-0.1.10-py3-none-any.whl
Algorithm	Hash digest
SHA256	`bbd30291190884b651b6883eed2041092e0d7d09f3be328c1ffc4037cacd1d14`
MD5	`381987661712563de7b8750ed51503bf`
BLAKE2b-256	`edc0ea8ce1d16f1394019fc36a61d47c8a765e7d0faba3ac5cd102a2e40ca755`