Skip to main content

word2vec for itemsets

Project description

Itemset embeddings

This is yet another variation of the well-known word2vec method, proposed by Mikolov et al., applied to unordered sequences, which are commonly referred as itemsets. The contribution of itembed is twofold:

  1. Modifying the base algorithm to handle unordered sequences, which has an impact on the definition of context windows;
  2. Using the two embedding sets introduced in word2vec for supervised learning.

A similar philosophy is described by Wu et al. in StarSpace and by Barkan and Koenigstein in item2vec. itembed uses Numba to achieve high performances.

Installation

Install from PyPI:

pip install itembed

Or install from source, to ensure latest version:

pip install git+https://gitlab.com/jojolebarjos/itembed.git

Getting started

Itemsets must be provided as so-called packed arrays, i.e. a pair of integer arrays describing indices and offsets. The index array is defined as the concatenation of all N itemsets. The offset array contains the N+1 boundaries.

import numpy as np

indices = np.array([
    0, 1, 4, 7,
    0, 1, 6,
    2, 3, 5, 6, 7,
], dtype=np.int32)

offsets = np.array([
    0, 4, 7, 12
])

This is similar to compressed sparse matrices:

from scipy.sparse import csr_matrix

dense = np.array([
    [1, 1, 0, 0, 1, 0, 0, 1],
    [1, 1, 0, 0, 0, 0, 1, 0],
    [0, 0, 1, 1, 0, 1, 1, 1],
], dtype=np.int32)

sparse = csr_matrix(dense)

assert (indices == sparse.indices).all()
assert (offsets == sparse.indptr).all()

Training methods do not handle other data types. Also note that:

  • indices start at 0;
  • item order in an itemset is not important;
  • an itemset can contain duplicated items;
  • itemsets order is not important;
  • there is no weight associated to items, nor itemsets.

However, a small helper is provided for simple cases:

from itembed import pack_itemsets

itemsets = [
    ['apple', 'sugar', 'flour'],
    ['pear', 'sugar', 'flour', 'butter'],
    ['apple', 'pear', 'sugar', 'buffer', 'cinnamon'],
    ['salt', 'flour', 'oil'],
    # ...
]

labels, indices, offsets = pack_itemsets(itemsets, min_count=2)
num_label = len(labels)

The next step is to define at least one task. For now, let us stick to the unsupervised case, where co-occurrence is used as knowledge source. This is similar to the continuous bag-of-word and continuous skip-gram tasks defined in word2vec.

First, two embedding sets must be allocated. Both should capture the same information, and one is the complement of the other. This is a not-so documented question of word2vec, but empirical results have shown that it is better than reusing the same set twice.

from itembed import initialize_syn

num_dimension = 64
syn0 = initialize_syn(num_label, num_dimension)
syn1 = initialize_syn(num_label, num_dimension)

Second, define a task object that holds all the descriptors:

from itembed import UnsupervisedTask

task = UnsupervisedTask(indices, offsets, syn0, syn1, num_negative=5)

Third, the do_batchmethod must be invoked multiple times, until convergence. Another helper is provided to handle the training loop. Note that, due to a different sampling strategy, a larger number of iteration is needed.

from itembed import train

train(task, num_epoch=100)

The full code is therefore as follows:

import numpy as np

from itembed import (
    pack_itemsets,
    initialize_syn,
    UnsupervisedTask,
    train,
)

# Get your own itemsets
itemsets = [
    ['apple', 'sugar', 'flour'],
    ['pear', 'sugar', 'flour', 'butter'],
    ['apple', 'pear', 'sugar', 'buffer', 'cinnamon'],
    ['salt', 'flour', 'oil'],
    # ...
]

# Pack itemsets into contiguous arrays
labels, indices, offsets = pack_itemsets(itemsets, min_count=2)
num_label = len(labels)

# Initialize embeddings sets from uniform distribution
num_dimension = 64
syn0 = initialize_syn(num_label, num_dimension)
syn1 = initialize_syn(num_label, num_dimension)

# Define unsupervised task, i.e. using co-occurrences
task = UnsupervisedTask(indices, offsets, syn0, syn1, num_negative=5)

# Do training
# Note: due to a different sampling strategy, more epochs than word2vec are needed
train(task, num_epoch=100)

# Both embedding sets are equivalent, just choose one of them
syn = syn0

More examples can be found in ./example/.

Performance improvement

As suggested in Numba's documentation, Intel's short vector math library can be used to increase performances:

conda install -c numba icc_rt

References

  1. Efficient Estimation of Word Representations in Vector Space, 2013, Tomas Mikolov, Kai Chen, Greg Corrado, Jeffrey Dean, https://arxiv.org/abs/1301.3781
  2. StarSpace: Embed All The Things!, 2017, Ledell Wu, Adam Fisch, Sumit Chopra, Keith Adams, Antoine Bordes, Jason Weston, https://arxiv.org/abs/1709.03856
  3. Item2Vec: Neural Item Embedding for Collaborative Filtering, 2016, Oren Barkan, Noam Koenigstein, https://arxiv.org/abs/1603.04259
  4. Numba: a LLVM-based Python JIT compiler, 2015, Siu Kwan Lam, Antoine Pitrou, Stanley Seibert, https://doi.org/10.1145/2833157.2833162

Changelog

  • 0.4.1 - 2020-05-13
    • Clean and rename, to avoid confusion
  • 0.4.0 - 2020-05-04
    • Refactor to make training task explicit
    • Add supervised task
  • 0.3.0 - 2020-03-26
    • Complete refactor to increase performances and reusability
  • 0.2.1 - 2020-03-24
    • Allow keyboard interruption
    • Fix label count argument
    • Fix learning rate issue
    • Add optimization flags to Numba JIT
  • 0.2.0 - 2019-11-08
    • Clean and refactor
    • Allow training from plain arrays
  • 0.1.0 - 2019-09-13
    • Initial version

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Files for itembed, version 0.4.1
Filename, size File type Python version Upload date Hashes
Filename, size itembed-0.4.1-py3-none-any.whl (10.0 kB) File type Wheel Python version py3 Upload date Hashes View
Filename, size itembed-0.4.1.tar.gz (11.4 kB) File type Source Python version None Upload date Hashes View

Supported by

AWS AWS Cloud computing Datadog Datadog Monitoring DigiCert DigiCert EV certificate Facebook / Instagram Facebook / Instagram PSF Sponsor Fastly Fastly CDN Google Google Object Storage and Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Salesforce Salesforce PSF Sponsor Sentry Sentry Error logging StatusPage StatusPage Status page