Skip to main content

corpus-level trie to store corpus efficiently and speed up sentence search

Project description

Zen-corpora

Description

Zen-corpora provides two main funcitonalities:

  • A memory efficient way to store unique sentences in corpus.
  • Beam text search with RNN model in PyTorch.

Installation

This module requires Python 3.7+. Please install it by running:

pip install zen-corpora

Why Zen-corpora?

Think about how Python stores the corpus below:

corpus = [['I', 'have', 'a', 'pen'],
          ['I', 'have', 'a', 'dog'],
          ['I', 'have', 'a', 'cat'],
          ['I', 'have', 'a', 'tie']]

It stores each sentence separately, but it's wasting the memory by storing "I have a " 4 times.

Zen-corpora solves this problem by storing sentences in a corpus-level trie. For example, the corpus above will be stored as

|-- I -- have -- a
      	         |-- pen
		             |-- dog
                 |-- cat
	               |-- tie

In this way, we can save lots of memory space and sentence search can be a lot faster!

Zen-corpora provides Python API to easily construct and interact with a corpus trie. See the following example:

>>> import zencorpora
>>> from zencorpora.corpustrie import CorpusTrie
>>> corpus = [['I', 'have', 'a', 'pen'],
...           ['I', 'have', 'a', 'dog'],
...           ['I', 'have', 'a', 'cat'],
...           ['I', 'have', 'a', 'tie']]
>>> trie = CorpusTrie(corpus=corpus)
>>> print(len(trie))
7
>>> print(['I', 'have', 'a', 'pen'] in trie)
True
>>> print(['I', 'have', 'a', 'sen'] in trie)
False
>>> trie.insert(['I', 'have', 'a', 'book'])
>>> print(['I', 'have', 'a', 'book'] in trie)
True
>>> print(trie.remove(['I', 'have', 'a', 'book']))
1
>>> print(['I', 'have', 'a', 'book'] in trie)
False
>>> print(trie.remove(['I', 'have', 'a', 'caw']))
-1
>>> print(trie.make_list())
[['i', 'have', 'a', 'pen'], ['i', 'have', 'a', 'dog'], ['i', 'have', 'a', 'cat'], ['i', 'have', 'a', 'tie']]

Left-to-Right Beam Text Search

As shown in SmartReply paper by Kannan et al. (2016), corpus trie can be used to perform left-to-right beam search using RNN model. A model encodes input text, then it computes the probability of each pre-defined sentence in the searching space given the encoded input. However, this process is exhaustive. What if we have 1 million sentences in the search space? Without beam search, a RNN model processes 1 million sentences. Thus, the authors used the corpus trie to perform a beam search for their pre-defined sentences. The idea is simple, it starts search from the root of the trie. Then, it only retains beam width number of probable sentences at each level.

Zen-corpora provides a class to enable beam search. See the example below.

>>> import torch.nn as nn
>>> import torch
>>> import os
>>> from zencorpora import SearchSpace
>>> corpus_path = os.path.join('data', 'search_space.csv')
>>> data = ... # assume data contains torchtext Field, encoder and decoder
>>> space = SearchSpace(
...    src_field=data.input_field,
...    trg_field=data.output_field,
...    encoder=data.model.encoder,
...    decoder=data.model.decoder,
...    corpus_path=corpus_path,
...    hide_progress=False,
...    score_function=nn.functional.log_softmax,
...    device=torch.device('cpu'),
... ) # you can hide a progress bar by setting hide_progress = False
Construct Corpus Trie: 100%|...| 34105/34105 [00:01<00:00, 21732.69 sentence/s]
>>> src = ['this', 'is', 'test']
>>> result = space.beam_search(src, 2)
>>> print(len(result))
2
>>> print(result)
[('is this test?', 1.0), ('this is test!', 1.0)]
>>> result = space.beam_search(src, 100)
>>> print(len(result))
100

License

This project is licensed under Apache 2.0.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Files for zen-corpora, version 0.1.2
Filename, size File type Python version Upload date Hashes
Filename, size zen-corpora-0.1.2.tar.gz (14.6 kB) File type Source Python version None Upload date Hashes View

Supported by

AWS AWS Cloud computing Datadog Datadog Monitoring DigiCert DigiCert EV certificate Facebook / Instagram Facebook / Instagram PSF Sponsor Fastly Fastly CDN Google Google Object Storage and Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Salesforce Salesforce PSF Sponsor Sentry Sentry Error logging StatusPage StatusPage Status page