corpus-level trie to store corpus efficiently and speed up sentence search
Project description
Zen-corpora
Description
Zen-corpora provides two main funcitonalities:
- A memory efficient way to store unique sentences in corpus.
- Beam text search with RNN model in PyTorch.
Installation
This module requires Python 3.7+. Please install it by running:
pip install zen-corpora
Why Zen-corpora?
Think about how Python stores the corpus below:
corpus = [['I', 'have', 'a', 'pen'],
['I', 'have', 'a', 'dog'],
['I', 'have', 'a', 'cat'],
['I', 'have', 'a', 'tie']]
It stores each sentence separately, but it's wasting the memory by storing "I have a " 4 times.
Zen-corpora solves this problem by storing sentences in a corpus-level trie. For example, the corpus above will be stored as
|-- I -- have -- a
|-- pen
|-- dog
|-- cat
|-- tie
In this way, we can save lots of memory space and sentence search can be a lot faster!
Zen-corpora provides Python API to easily construct and interact with a corpus trie. See the following example:
>>> import zencorpora
>>> from zencorpora.corpustrie import CorpusTrie
>>> corpus = [['I', 'have', 'a', 'pen'],
... ['I', 'have', 'a', 'dog'],
... ['I', 'have', 'a', 'cat'],
... ['I', 'have', 'a', 'tie']]
>>> trie = CorpusTrie(corpus=corpus)
>>> print(len(trie))
7
>>> print(['I', 'have', 'a', 'pen'] in trie)
True
>>> print(['I', 'have', 'a', 'sen'] in trie)
False
>>> trie.insert(['I', 'have', 'a', 'book'])
>>> print(['I', 'have', 'a', 'book'] in trie)
True
>>> print(trie.remove(['I', 'have', 'a', 'book']))
1
>>> print(['I', 'have', 'a', 'book'] in trie)
False
>>> print(trie.remove(['I', 'have', 'a', 'caw']))
-1
>>> print(trie.make_list())
[['i', 'have', 'a', 'pen'], ['i', 'have', 'a', 'dog'], ['i', 'have', 'a', 'cat'], ['i', 'have', 'a', 'tie']]
Left-to-Right Beam Text Search
As shown in SmartReply paper by Kannan et al. (2016), corpus trie can be used to perform left-to-right beam search using RNN model. A model encodes input text, then it computes the probability of each pre-defined sentence in the searching space given the encoded input. However, this process is exhaustive. What if we have 1 million sentences in the search space? Without beam search, a RNN model processes 1 million sentences. Thus, the authors used the corpus trie to perform a beam search for their pre-defined sentences. The idea is simple, it starts search from the root of the trie. Then, it only retains beam width number of probable sentences at each level.
Zen-corpora provides a class to enable beam search. See the example below.
>>> import torch.nn as nn
>>> import torch
>>> import os
>>> from zencorpora import SearchSpace
>>> corpus_path = os.path.join('data', 'search_space.csv')
>>> data = ... # assume data contains torchtext Field, encoder and decoder
>>> space = SearchSpace(
... src_field=data.input_field,
... trg_field=data.output_field,
... encoder=data.model.encoder,
... decoder=data.model.decoder,
... corpus_path=corpus_path,
... hide_progress=False,
... score_function=nn.functional.log_softmax,
... device=torch.device('cpu'),
... ) # you can hide a progress bar by setting hide_progress = False
Construct Corpus Trie: 100%|...| 34105/34105 [00:01<00:00, 21732.69 sentence/s]
>>> src = ['this', 'is', 'test']
>>> result = space.beam_search(src, 2)
>>> print(len(result))
2
>>> print(result)
[('is this test?', 1.0), ('this is test!', 1.0)]
>>> result = space.beam_search(src, 100)
>>> print(len(result))
100
License
This project is licensed under Apache 2.0.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file zen-corpora-0.1.2.tar.gz
.
File metadata
- Download URL: zen-corpora-0.1.2.tar.gz
- Upload date:
- Size: 14.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/45.2.0 requests-toolbelt/0.9.1 tqdm/4.31.1 CPython/3.7.3
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7bc0826834742ee620c8d9831ef24a2f5ed7c9736bb8dd046417c273722c9d36 |
|
MD5 | 103e68ea11bd90a12301e3bd41652c74 |
|
BLAKE2b-256 | d0a87f3eeaac6eb3d60647a6c2db18519deff3bf0f5a979e445e3a87616ae8f5 |