word2vec for itemsets

# Itemset embeddings

This is yet another variation of the well-known word2vec method, proposed by Mikolov et al., applied to unordered sequences, which are commonly referred as itemsets. The contribution of itembed is twofold:

1. Modifying the base algorithm to handle unordered sequences, which has an impact on the definition of context windows;
2. Using the two embedding sets introduced in word2vec for supervised learning.

A similar philosophy is described by Wu et al. in StarSpace and by Barkan and Koenigstein in item2vec. itembed uses Numba to achieve high performances.

## Installation

Install from PyPI:

pip install itembed


pip install git+https://gitlab.com/jojolebarjos/itembed.git


## Getting started

Itemsets must be provided as so-called packed arrays, i.e. a pair of integer arrays describing indices and offsets. The index array is defined as the concatenation of all N itemsets. The offset array contains the N+1 boundaries.

import numpy as np

indices = np.array([
0, 1, 4, 7,
0, 1, 6,
2, 3, 5, 6, 7,
], dtype=np.int32)

offsets = np.array([
0, 4, 7, 12
])


This is similar to compressed sparse matrices:

from scipy.sparse import csr_matrix

dense = np.array([
[1, 1, 0, 0, 1, 0, 0, 1],
[1, 1, 0, 0, 0, 0, 1, 0],
[0, 0, 1, 1, 0, 1, 1, 1],
], dtype=np.int32)

sparse = csr_matrix(dense)

assert (indices == sparse.indices).all()
assert (offsets == sparse.indptr).all()


Training methods do not handle other data types. Also note that:

• indices start at 0;
• item order in an itemset is not important;
• an itemset can contain duplicated items;
• itemsets order is not important;
• there is no weight associated to items, nor itemsets.

However, a small helper is provided for simple cases:

from itembed import pack_itemsets

itemsets = [
["apple", "sugar", "flour"],
["pear", "sugar", "flour", "butter"],
["apple", "pear", "sugar", "buffer", "cinnamon"],
["salt", "flour", "oil"],
# ...
]

labels, indices, offsets = pack_itemsets(itemsets, min_count=2, min_length=2)
num_label = len(labels)


The next step is to define at least one task. For now, let us stick to the unsupervised case, where co-occurrence is used as knowledge source. This is similar to the continuous bag-of-word and continuous skip-gram tasks defined in word2vec.

First, two embedding sets must be allocated. Both should capture the same information, and one is the complement of the other. This is a not-so documented question of word2vec, but empirical results have shown that it is better than reusing the same set twice.

from itembed import initialize_syn

num_dimension = 64
syn0 = initialize_syn(num_label, num_dimension)
syn1 = initialize_syn(num_label, num_dimension)


Second, define a task object that holds all the descriptors:

from itembed import UnsupervisedTask



Third, the do_batch method must be invoked multiple times, until convergence. Another helper is provided to handle the training loop. Note that, due to a different sampling strategy, a larger number of iteration is needed.

from itembed import train



The full code is therefore as follows:

import numpy as np

from itembed import (
pack_itemsets,
initialize_syn,
train,
)

itemsets = [
["apple", "sugar", "flour"],
["pear", "sugar", "flour", "butter"],
["apple", "pear", "sugar", "buffer", "cinnamon"],
["salt", "flour", "oil"],
# ...
]

# Pack itemsets into contiguous arrays
labels, indices, offsets = pack_itemsets(itemsets, min_count=2, min_length=2)
num_label = len(labels)

# Initialize embeddings sets from uniform distribution
num_dimension = 64
syn0 = initialize_syn(num_label, num_dimension)
syn1 = initialize_syn(num_label, num_dimension)

# Define unsupervised task, i.e. using co-occurrences

# Do training
# Note: due to a different sampling strategy, more epochs than word2vec are needed

# Both embedding sets are equivalent, just choose one of them
syn = syn0


More examples can be found in ./example/. See the documentation for more detailed information.

## Performance improvement

As suggested in Numba's documentation, Intel's short vector math library can be used to increase performances:

conda install -c numba icc_rt


## Citation

If you use this software in your work, it would be appreciated if you would cite this tool, for instance using the following Bibtex reference:

@software{itembed,
author = {Johan Berdat},
title = {itembed},
url = {https://gitlab.com/jojolebarjos/itembed},
version = {0.5.0},
date = {2020-06-24},
}


## References

1. Efficient Estimation of Word Representations in Vector Space, 2013, Tomas Mikolov, Kai Chen, Greg Corrado, Jeffrey Dean, https://arxiv.org/abs/1301.3781
2. StarSpace: Embed All The Things!, 2017, Ledell Wu, Adam Fisch, Sumit Chopra, Keith Adams, Antoine Bordes, Jason Weston, https://arxiv.org/abs/1709.03856
3. Item2Vec: Neural Item Embedding for Collaborative Filtering, 2016, Oren Barkan, Noam Koenigstein, https://arxiv.org/abs/1603.04259
4. Numba: a LLVM-based Python JIT compiler, 2015, Siu Kwan Lam, Antoine Pitrou, Stanley Seibert, https://doi.org/10.1145/2833157.2833162

## Changelog

• 0.5.0 - 2020-06-24
• Improve documentation and examples
• Bug fixes
• 0.4.2 - 2020-06-10
• Clean documentation and examples
• Bug fixes
• 0.4.1 - 2020-05-13
• Clean and rename, to avoid confusion
• 0.4.0 - 2020-05-04
• Refactor to make training task explicit
• 0.3.0 - 2020-03-26
• Complete refactor to increase performances and reusability
• 0.2.1 - 2020-03-24
• Allow keyboard interruption
• Fix label count argument
• Fix learning rate issue
• Add optimization flags to Numba JIT
• 0.2.0 - 2019-11-08
• Clean and refactor
• Allow training from plain arrays
• 0.1.0 - 2019-09-13
• Initial version

## Project details

Uploaded source
Uploaded py3