Skip to main content

corpus-level trie to store corpus efficiently and speed up sentence search

Project description

Zen-corpora

Description

Zen-corpora provides two main funcitonalities:

  • A memory efficient way to store unique sentences in corpus.
  • Beam text search with RNN model in PyTorch.

Installation

This module requires Python 3.7+. Please install it by running:

pip install zen-corpora

Why Zen-corpora?

Think about how Python stores the corpus below:

corpus = [['I', 'have', 'a', 'pen'],
          ['I', 'have', 'a', 'dog'],
          ['I', 'have', 'a', 'cat'],
          ['I', 'have', 'a', 'tie']]

It stores each sentence separately, but it's wasting the memory by storing "I have a " 4 times.

Zen-corpora solves this problem by storing sentences in a corpus-level trie. For example, the corpus above will be stored as

|-- I -- have -- a
      	         |-- pen
		             |-- dog
                 |-- cat
	               |-- tie

In this way, we can save lots of memory space and sentence search can be a lot faster!

Zen-corpora provides Python API to easily construct and interact with a corpus trie. See the following example:

>>> import zencorpora
>>> from zencorpora.corpustrie import CorpusTrie
>>> corpus = [['I', 'have', 'a', 'pen'],
...           ['I', 'have', 'a', 'dog'],
...           ['I', 'have', 'a', 'cat'],
...           ['I', 'have', 'a', 'tie']]
>>> trie = CorpusTrie(corpus=corpus)
>>> print(len(trie))
7
>>> print(['I', 'have', 'a', 'pen'] in trie)
True
>>> print(['I', 'have', 'a', 'sen'] in trie)
False
>>> trie.insert(['I', 'have', 'a', 'book'])
>>> print(['I', 'have', 'a', 'book'] in trie)
True
>>> print(trie.remove(['I', 'have', 'a', 'book']))
1
>>> print(['I', 'have', 'a', 'book'] in trie)
False
>>> print(trie.remove(['I', 'have', 'a', 'caw']))
-1
>>> print(trie.make_list())
[['i', 'have', 'a', 'pen'], ['i', 'have', 'a', 'dog'], ['i', 'have', 'a', 'cat'], ['i', 'have', 'a', 'tie']]

Left-to-Right Beam Text Search

As shown in SmartReply paper by Kannan et al. (2016), corpus trie can be used to perform left-to-right beam search using RNN model. A model encodes input text, then it computes the probability of each pre-defined sentence in the searching space given the encoded input. However, this process is exhaustive. What if we have 1 million sentences in the search space? Without beam search, a RNN model processes 1 million sentences. Thus, the authors used the corpus trie to perform a beam search for their pre-defined sentences. The idea is simple, it starts search from the root of the trie. Then, it only retains beam width number of probable sentences at each level.

Zen-corpora provides a class to enable beam search. See the example below.

>>> import torch.nn as nn
>>> import torch
>>> import os
>>> from zencorpora import SearchSpace
>>> corpus_path = os.path.join('data', 'search_space.csv')
>>> data = ... # assume data contains torchtext Field, encoder and decoder
>>> space = SearchSpace(
...    src_field=data.input_field,
...    trg_field=data.output_field,
...    encoder=data.model.encoder,
...    decoder=data.model.decoder,
...    corpus_path=corpus_path,
...    hide_progress=False,
...    score_function=nn.functional.log_softmax,
...    device=torch.device('cpu'),
... ) # you can hide a progress bar by setting hide_progress = False
Construct Corpus Trie: 100%|...| 34105/34105 [00:01<00:00, 21732.69 sentence/s]
>>> src = ['this', 'is', 'test']
>>> result = space.beam_search(src, 2)
>>> print(len(result))
2
>>> print(result)
[('is this test?', 1.0), ('this is test!', 1.0)]
>>> result = space.beam_search(src, 100)
>>> print(len(result))
100

License

This project is licensed under Apache 2.0.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

zen-corpora-0.1.2.tar.gz (14.6 kB view details)

Uploaded Source

File details

Details for the file zen-corpora-0.1.2.tar.gz.

File metadata

  • Download URL: zen-corpora-0.1.2.tar.gz
  • Upload date:
  • Size: 14.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/45.2.0 requests-toolbelt/0.9.1 tqdm/4.31.1 CPython/3.7.3

File hashes

Hashes for zen-corpora-0.1.2.tar.gz
Algorithm Hash digest
SHA256 7bc0826834742ee620c8d9831ef24a2f5ed7c9736bb8dd046417c273722c9d36
MD5 103e68ea11bd90a12301e3bd41652c74
BLAKE2b-256 d0a87f3eeaac6eb3d60647a6c2db18519deff3bf0f5a979e445e3a87616ae8f5

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page