Incremental collaborative filtering algorithms for recommender systems

These details have not been verified by PyPI

Project links

Homepage

Development Status
- 4 - Beta
Intended Audience
- Developers
License
- OSI Approved :: Apache Software License
Natural Language
- English
Programming Language

Project description

CF STEP - Incremental Collaborative Filtering

Incremental learning for recommender systems

CF STEP is an open-source library, written in python, that enables fast implementation of incremental learning recommender systems. The library is a by-product of the reasearch project CloudDBAppliance.

Install

Run pip install cf_step to install the library in your environment.

How to use

For this example, we will use the popular movielens dataset. The dataset has collected and made available rating data sets from the MovieLens web site. The data sets were collected over various periods of time, depending on the size of the set.

First let us load the data in a pandas DataFrame. We assume that the reader has downloaded the 1m movielense dataset and have unziped it in the /tmp folder.

To avoid creating a user and movie vocabularies we turn each user and movie to a categorical feature and use the pandas convenient cat attribute to get the codes

# local

# load the data
col_names = ['user_id', 'movie_id', 'rating', 'timestamp']
ratings_df = pd.read_csv('/tmp/ratings.dat', delimiter='::', names=col_names, engine='python')

# transform users and movies to categorical features
ratings_df['user_id'] = ratings_df['user_id'].astype('category')
ratings_df['movie_id'] = ratings_df['movie_id'].astype('category')

# use the codes to avoid creating separate vocabularies
ratings_df['user_code'] = ratings_df['user_id'].cat.codes.astype(int)
ratings_df['movie_code'] = ratings_df['movie_id'].cat.codes.astype(int)

ratings_df.head()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	user_id	movie_id	rating	timestamp	movie_code
0	1	1193	5	978300760	1104
1	1	661	3	978302109	639
2	1	914	3	978301968	853
3	1	3408	4	978300275	3177
4	1	2355	5	978824291	2162

Using the codes we can see how many users and movies are in the dataset.

# local
n_users = ratings_df['user_code'].max() + 1
n_movies = ratings_df['movie_code'].max() + 1

print(f'There are {n_users} unique users and {n_movies} unique movies in the movielens dataset.')

There are 6040 unique users and 3706 unique movies in the movielens dataset.

We will sort the data by Timestamp so as to simulate streaming events.

# local
data_df = ratings_df.sort_values(by='timestamp')

The Step model supports only positive feedback. Thus, we will consider a rating of 5 as positive feedback and discard any other. We want to identify likes with 1 and dislikes with 0.

# local
# more than 4 -> 1, less than 5 -> 0
data_df['preference'] = np.where(data_df['rating'] > 4, 1, 0)
# keep only ones and discard the others
data_df_cleaned = data_df.loc[data_df['preference'] == 1]

data_df_cleaned.head()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	user_id	movie_id	rating	timestamp	user_code	movie_code	preference
999873	6040	593	5	956703954	6039	579	1
1000192	6040	2019	5	956703977	6039	1839	1
999920	6040	213	5	956704056	6039	207	1
999967	6040	3111	5	956704056	6039	2895	1
999971	6040	2503	5	956704191	6039	2309	1

Following, let us initialize out model with a database connection. For everything else (e.g. learning rate, optimizer, loss function etc.) we will use the defaults.

# local
net = SimpleCF(n_users, n_movies, factors=1024, mean=0., std=.1)
objective = lambda pred, targ: targ - pred
optimizer = SGD(net.parameters(), lr=0.06)

model = Step(net, objective, optimizer)

Finally, let us get 1% of the data to fit the model for bootstrapping and create the Pytorch Dataset that we will use.

# local
pct = int(data_df_cleaned.shape[0] * .2)
bootstrapping_data = data_df_cleaned[:pct]

Sub-classing the Pytorch Dataset class, we will create a dataset from our Dataframe. We extract four elements:

The user code
The movie code
The rating
The preference

# local
class MovieLens(Dataset):
    def __init__(self, df, transform=None):
        self.df = df
        self.transform = transform

    def __len__(self):
        return len(self.df)

    def __getitem__(self, idx):
        user = self.df['user_code'].iloc[idx]
        item = self.df['movie_code'].iloc[idx]
        rating = self.df['rating'].iloc[idx] 
        preference = self.df['preference'].iloc[idx] 
        return (user, item, rating, preference)

Create the Pytorch Dataset and DataLoader that we will use. Batch size should always be 1 for online training.

# local
data_set = MovieLens(bootstrapping_data)
data_loader = DataLoader(data_set, batch_size=512, shuffle=False)

Let us now use the batch_fit() method of the Step trainer to bootstrap our model.

# local
model.batch_fit(data_loader)

100%|██████████| 89/89 [00:07<00:00, 11.86it/s]

Then, to simulate streaming we get the remaining data and create a different data set.

# local
data_df_step = data_df_cleaned.drop(bootstrapping_data.index)
data_df_step = data_df_step.reset_index(drop=True)
data_df_step.head()

# create the DataLoader
stream_data_set = MovieLens(data_df_step)
stream_data_loader = DataLoader(stream_data_set, batch_size=1, shuffle=False)

Simulate the stream...

# local
k = 10 # we keep only the top 10 recommendations
recalls = []
known_users = []

with tqdm(total=len(stream_data_loader)) as pbar:
    for idx, (user, item, rtng, pref) in enumerate(stream_data_loader):
        itr = idx + 1
        if user.item() in known_users:
            predictions = model.predict(user, k)
            recall = recall_at_k(predictions.tolist(), item.tolist(), k)
            recalls.append(recall)
            model.step(user, item, rtng, pref)
        else:
            model.step(user, item, rtng, pref)

        known_users.append(user.item())
        pbar.update(1)

100%|██████████| 181048/181048 [1:07:02<00:00, 45.01it/s]

Last but not least, we visualize the results of the recall@10 metric, using a moving average window of 5k elements.

# local
avgs = moving_avg(recalls, 5000)

plt.title('Recall@10')
plt.xlabel('Iterations')
plt.ylabel('Metric')
plt.ylim(0., .1)
plt.plot(avgs)

[<matplotlib.lines.Line2D at 0x7f04041cc210>]

png

Finally, save the model's weights.

# local
model.save(os.path.join('artefacts', 'positive_step.pt'))

Project details

These details have not been verified by PyPI

Project links

Homepage

Development Status
- 4 - Beta
Intended Audience
- Developers
License
- OSI Approved :: Apache Software License
Natural Language
- English
Programming Language

Release history Release notifications | RSS feed

0.2.3

Feb 29, 2020

0.2.2

Feb 18, 2020

0.2.1

Feb 15, 2020

This version

0.1.0

Feb 4, 2020

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cf_step-0.1.0.tar.gz (14.4 kB view hashes)

Uploaded Feb 4, 2020 Source

Built Distribution

cf_step-0.1.0-py3-none-any.whl (11.9 kB view hashes)

Uploaded Feb 4, 2020 Python 3

Hashes for cf_step-0.1.0.tar.gz

Hashes for cf_step-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`014bb930124394f792d0f9a91c792d1b956b5e0f0d9b7c75c2d5228e05d2ab79`
MD5	`ef4ddf0e6f816796b6c8eb9ee623c283`
BLAKE2b-256	`405370fdfaaae9501a26dcda6fe3633ff7acddc83fda90ee998201afca69b3d8`

Hashes for cf_step-0.1.0-py3-none-any.whl

Hashes for cf_step-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`a4601326d74c8c7d9abf055ad33de28100fc2c53abbad712f8ef91bfb5610e6b`
MD5	`6145c71d9207bb220b620f5025c6de1e`
BLAKE2b-256	`45a3ce81a48ab80d5421485ecd905d09e5a3a48da4f45671eb5d634b90c4ce54`