word2vec for itemsets
This is yet another variation of the well-known word2vec method, proposed by Mikolov et al., applied to unordered sequences, which are commonly referred as itemsets. The contribution of itembed is twofold:
- Modifying the base algorithm to handle unordered sequences, which has an impact on the definition of context windows;
- Using the two embedding sets introduced in word2vec for supervised learning.
Install from PyPI:
pip install itembed
Or install from source, to ensure latest version:
pip install git+https://gitlab.com/jojolebarjos/itembed.git
Itemsets must be provided as so-called packed arrays, i.e. a pair of integer arrays describing indices and offsets. The index array is defined as the concatenation of all N itemsets. The offset array contains the N+1 boundaries.
import numpy as np indices = np.array([ 0, 1, 4, 7, 0, 1, 6, 2, 3, 5, 6, 7, ], dtype=np.int32) offsets = np.array([ 0, 4, 7, 12 ])
This is similar to compressed sparse matrices:
from scipy.sparse import csr_matrix dense = np.array([ [1, 1, 0, 0, 1, 0, 0, 1], [1, 1, 0, 0, 0, 0, 1, 0], [0, 0, 1, 1, 0, 1, 1, 1], ], dtype=np.int32) sparse = csr_matrix(dense) assert (indices == sparse.indices).all() assert (offsets == sparse.indptr).all()
Training methods do not handle other data types. Also note that:
- indices start at 0;
- item order in an itemset is not important;
- an itemset can contain duplicated items;
- itemsets order is not important;
- there is no weight associated to items, nor itemsets.
However, a small helper is provided for simple cases:
from itembed import pack_itemsets itemsets = [ ['apple', 'sugar', 'flour'], ['pear', 'sugar', 'flour', 'butter'], ['apple', 'pear', 'sugar', 'buffer', 'cinnamon'], ['salt', 'flour', 'oil'], # ... ] labels, indices, offsets = pack_itemsets(itemsets, min_count=2) num_label = len(labels)
The next step is to define at least one task. For now, let us stick to the unsupervised case, where co-occurrence is used as knowledge source. This is similar to the continuous bag-of-word and continuous skip-gram tasks defined in word2vec.
First, two embedding sets must be allocated. Both should capture the same information, and one is the complement of the other. This is a not-so documented question of word2vec, but empirical results have shown that it is better than reusing the same set twice.
from itembed import initialize_syn num_dimension = 64 syn0 = initialize_syn(num_label, num_dimension) syn1 = initialize_syn(num_label, num_dimension)
Second, define a task object that holds all the descriptors:
from itembed import UnsupervisedTask task = UnsupervisedTask(indices, offsets, syn0, syn1, num_negative=5)
do_batchmethod must be invoked multiple times, until convergence.
Another helper is provided to handle the training loop. Note that, due to a
different sampling strategy, a larger number of iteration is needed.
from itembed import train train(task, num_epoch=100)
The full code is therefore as follows:
import numpy as np from itembed import ( pack_itemsets, initialize_syn, UnsupervisedTask, train, ) # Get your own itemsets itemsets = [ ['apple', 'sugar', 'flour'], ['pear', 'sugar', 'flour', 'butter'], ['apple', 'pear', 'sugar', 'buffer', 'cinnamon'], ['salt', 'flour', 'oil'], # ... ] # Pack itemsets into contiguous arrays labels, indices, offsets = pack_itemsets(itemsets, min_count=2) num_label = len(labels) # Initialize embeddings sets from uniform distribution num_dimension = 64 syn0 = initialize_syn(num_label, num_dimension) syn1 = initialize_syn(num_label, num_dimension) # Define unsupervised task, i.e. using co-occurrences task = UnsupervisedTask(indices, offsets, syn0, syn1, num_negative=5) # Do training # Note: due to a different sampling strategy, more epochs than word2vec are needed train(task, num_epoch=100) # Both embedding sets are equivalent, just choose one of them syn = syn0
More examples can be found in
As suggested in Numba's documentation, Intel's short vector math library can be used to increase performances:
conda install -c numba icc_rt
- Efficient Estimation of Word Representations in Vector Space, 2013, Tomas Mikolov, Kai Chen, Greg Corrado, Jeffrey Dean, https://arxiv.org/abs/1301.3781
- StarSpace: Embed All The Things!, 2017, Ledell Wu, Adam Fisch, Sumit Chopra, Keith Adams, Antoine Bordes, Jason Weston, https://arxiv.org/abs/1709.03856
- Item2Vec: Neural Item Embedding for Collaborative Filtering, 2016, Oren Barkan, Noam Koenigstein, https://arxiv.org/abs/1603.04259
- Numba: a LLVM-based Python JIT compiler, 2015, Siu Kwan Lam, Antoine Pitrou, Stanley Seibert, https://doi.org/10.1145/2833157.2833162
- 0.4.1 - 2020-05-13
- Clean and rename, to avoid confusion
- 0.4.0 - 2020-05-04
- Refactor to make training task explicit
- Add supervised task
- 0.3.0 - 2020-03-26
- Complete refactor to increase performances and reusability
- 0.2.1 - 2020-03-24
- Allow keyboard interruption
- Fix label count argument
- Fix learning rate issue
- Add optimization flags to Numba JIT
- 0.2.0 - 2019-11-08
- Clean and refactor
- Allow training from plain arrays
- 0.1.0 - 2019-09-13
- Initial version
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
|Filename, size||File type||Python version||Upload date||Hashes|
|Filename, size itembed-0.4.1-py3-none-any.whl (10.0 kB)||File type Wheel||Python version py3||Upload date||Hashes View|
|Filename, size itembed-0.4.1.tar.gz (11.4 kB)||File type Source||Python version None||Upload date||Hashes View|