word2vec for itemsets

These details have not been verified by PyPI

Project links

Project description

Itemset embeddings

This is yet another variation of the well-known word2vec method, proposed by Mikolov et al., applied to unordered sequences, which are commonly referred as itemsets. The contribution of itembed is twofold:

Modifying the base algorithm to handle unordered sequences, which has an impact on the definition of context windows;
Using the two embedding sets introduced in word2vec for supervised learning.

A similar philosophy is described by Wu et al. in StarSpace and by Barkan and Koenigstein in item2vec. itembed uses Numba to achieve high performances.

Installation

Install from PyPI:

pip install itembed

Or install from source, to ensure latest version:

pip install git+https://gitlab.com/jojolebarjos/itembed.git

Getting started

Itemsets must be provided as so-called packed arrays, i.e. a pair of integer arrays describing indices and offsets. The index array is defined as the concatenation of all N itemsets. The offset array contains the N+1 boundaries.

import numpy as np

indices = np.array([
    0, 1, 4, 7,
    0, 1, 6,
    2, 3, 5, 6, 7,
], dtype=np.int32)

offsets = np.array([
    0, 4, 7, 12
])

This is similar to compressed sparse matrices:

from scipy.sparse import csr_matrix

dense = np.array([
    [1, 1, 0, 0, 1, 0, 0, 1],
    [1, 1, 0, 0, 0, 0, 1, 0],
    [0, 0, 1, 1, 0, 1, 1, 1],
], dtype=np.int32)

sparse = csr_matrix(dense)

assert (indices == sparse.indices).all()
assert (offsets == sparse.indptr).all()

Training methods do not handle other data types. Also note that:

indices start at 0;
item order in an itemset is not important;
an itemset can contain duplicated items;
itemsets order is not important;
there is no weight associated to items, nor itemsets.

However, a small helper is provided for simple cases:

from itembed import pack_itemsets

itemsets = [
    ["apple", "sugar", "flour"],
    ["pear", "sugar", "flour", "butter"],
    ["apple", "pear", "sugar", "buffer", "cinnamon"],
    ["salt", "flour", "oil"],
    # ...
]

labels, indices, offsets = pack_itemsets(itemsets, min_count=2, min_length=2)
num_label = len(labels)

The next step is to define at least one task. For now, let us stick to the unsupervised case, where co-occurrence is used as knowledge source. This is similar to the continuous bag-of-word and continuous skip-gram tasks defined in word2vec.

First, two embedding sets must be allocated. Both should capture the same information, and one is the complement of the other. This is a not-so documented question of word2vec, but empirical results have shown that it is better than reusing the same set twice.

from itembed import initialize_syn

num_dimension = 64
syn0 = initialize_syn(num_label, num_dimension)
syn1 = initialize_syn(num_label, num_dimension)

Second, define a task object that holds all the descriptors:

from itembed import UnsupervisedTask

task = UnsupervisedTask(indices, offsets, syn0, syn1, num_negative=5)

Third, the do_batch method must be invoked multiple times, until convergence. Another helper is provided to handle the training loop. Note that, due to a different sampling strategy, a larger number of iteration is needed.

from itembed import train

train(task, num_epoch=100)

The full code is therefore as follows:

import numpy as np

from itembed import (
    pack_itemsets,
    initialize_syn,
    UnsupervisedTask,
    train,
)

# Get your own itemsets
itemsets = [
    ["apple", "sugar", "flour"],
    ["pear", "sugar", "flour", "butter"],
    ["apple", "pear", "sugar", "buffer", "cinnamon"],
    ["salt", "flour", "oil"],
    # ...
]

# Pack itemsets into contiguous arrays
labels, indices, offsets = pack_itemsets(itemsets, min_count=2, min_length=2)
num_label = len(labels)

# Initialize embeddings sets from uniform distribution
num_dimension = 64
syn0 = initialize_syn(num_label, num_dimension)
syn1 = initialize_syn(num_label, num_dimension)

# Define unsupervised task, i.e. using co-occurrences
task = UnsupervisedTask(indices, offsets, syn0, syn1, num_negative=5)

# Do training
# Note: due to a different sampling strategy, more epochs than word2vec are needed
train(task, num_epoch=100)

# Both embedding sets are equivalent, just choose one of them
syn = syn0

More examples can be found in ./example/. See the documentation for more detailed information.

Performance improvement

As suggested in Numba's documentation, Intel's short vector math library can be used to increase performances:

conda install -c numba icc_rt

Citation

If you use this software in your work, it would be appreciated if you would cite this tool, for instance using the following Bibtex reference:

@software{itembed,
  author = {Johan Berdat},
  title = {itembed},
  url = {https://gitlab.com/jojolebarjos/itembed},
  version = {0.5.0},
  date = {2020-06-24},
}

References

Efficient Estimation of Word Representations in Vector Space, 2013, Tomas Mikolov, Kai Chen, Greg Corrado, Jeffrey Dean, https://arxiv.org/abs/1301.3781
StarSpace: Embed All The Things!, 2017, Ledell Wu, Adam Fisch, Sumit Chopra, Keith Adams, Antoine Bordes, Jason Weston, https://arxiv.org/abs/1709.03856
Item2Vec: Neural Item Embedding for Collaborative Filtering, 2016, Oren Barkan, Noam Koenigstein, https://arxiv.org/abs/1603.04259
Numba: a LLVM-based Python JIT compiler, 2015, Siu Kwan Lam, Antoine Pitrou, Stanley Seibert, https://doi.org/10.1145/2833157.2833162

Changelog

0.5.0 - 2020-06-24
- Add weighted itemset support
- Improve documentation and examples
- Bug fixes
0.4.2 - 2020-06-10
- Clean documentation and examples
- Bug fixes
0.4.1 - 2020-05-13
- Clean and rename, to avoid confusion
0.4.0 - 2020-05-04
- Refactor to make training task explicit
- Add supervised task
0.3.0 - 2020-03-26
- Complete refactor to increase performances and reusability
0.2.1 - 2020-03-24
- Allow keyboard interruption
- Fix label count argument
- Fix learning rate issue
- Add optimization flags to Numba JIT
0.2.0 - 2019-11-08
- Clean and refactor
- Allow training from plain arrays
0.1.0 - 2019-09-13
- Initial version

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.5.1

Feb 28, 2024

This version

0.5.0

Sep 3, 2021

0.4.2

Jun 10, 2020

0.4.1

May 13, 2020

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

itembed-0.5.0.tar.gz (13.4 kB view hashes)

Uploaded Sep 3, 2021 Source

Built Distribution

itembed-0.5.0-py3-none-any.whl (11.6 kB view hashes)

Uploaded Sep 3, 2021 Python 3

Hashes for itembed-0.5.0.tar.gz

Hashes for itembed-0.5.0.tar.gz
Algorithm	Hash digest
SHA256	`ff0c811de637b0a6b691488469c49b06abebda007f083d3938fc04e2bef6596a`
MD5	`464a6a1e2136b43504a8f24de832dd41`
BLAKE2b-256	`b3f536018d83da75cdcb9423e5d1d3e709fb86abdcf33dd4ffe1a8ac3179b255`

Hashes for itembed-0.5.0-py3-none-any.whl

Hashes for itembed-0.5.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`95db546331b1aeba804824beb084b4c36d36edbd9641e04af30134b8ef4147ea`
MD5	`85a5254e963438fb7d4794e1a58b7801`
BLAKE2b-256	`76ded4a29b1b7a34a4af13a1d15b105a65021a090159ea5b9b47b2e70a2a169d`