Skip to main content

Wavelet Matrix/Tree succinct data structure for full text search (using shellinford C++ library)

Project description

shellinford

https://badge.fury.io/py/shellinford.png https://travis-ci.org/ikegami-yukino/shellinford-python.svg?branch=master https://coveralls.io/repos/ikegami-yukino/shellinford-python/badge.png

Shellinford is an implementation of a Wavelet Matrix/Tree succinct data structure for document retrieval.

Based on shellinford C++ library.

Installation

$ pip install shellinford

Usage

Create a new FM-index instance

>>> import shellinford
>>> fm = shellinford.FMIndex()
  • shellinford.Shellinford([use_wavelet_tree=True, filename=None])
    • When given a filename, Shellinford loads FM-index data from the file

Build FM-index

>>> fm.build(['Milky Holmes', 'Sherlock "Sheryl" Shellingford', 'Milky'], 'milky.fm')
  • build([docs, filename])
    • When given a filename, Shellinford stores FM-index data to the file

Search word from FM-index

>>> for doc in fm.search('Milky'):
>>>     print 'doc_id:', doc.doc_id
>>>     print 'count:', doc.count
>>>     print 'text:', doc.text
doc_id:    0
count: 1
text:  Milky Holmes
doc_id:    2
count: 1
text:  Milky

>>> for doc in fm.search(['Milky', 'Holmes']):
>>>     print 'doc_id:', doc.doc_id
>>>     print 'count:', doc.count
>>>     print 'text:', doc.text
doc_id:    1
count: 1
text:  Milky Holmes
  • search(query, [_or=False, ignores=[]])
    • If _or = True, then “OR” search is executed, else “AND” search
    • Given ignores, “NOT” search is also executed
    • NOTE: The search function is available after FM-index is built or loaded

Add a document

>>> fm.push_back('Baritsu')
  • push_back(doc)
    • NOTE: A document added by this method is not available to search until build

Read FM-index from a binary file

>>> fm.read('milky_holmes.fm')
  • read(path)

Write FM-index binary to a file

>>> fm.write('milky_holmes.fm')
  • write(path)

License

  • Wrapper code is licensed under the New BSD License.
  • Bundled shellinford C++ library (c) 2012 echizen_tm is licensed under the New BSD License.

CHANGES

0.3.4 (2016-10-28)

  • FMIndex.search() returns list

0.3 (2014-11-24)

  • “OR” search and “NOT” search are available in FMIndex.search().
  • FMIndex.size and FMIndex.docsize are available as property

0.2 (2014-03-28)

“AND” search is available by giving Sequence (list, tuple, etc.) FMIndex.search()

0.1 (2014-03-11)

First release.

Project details


Release history Release notifications

This version
History Node

0.3.4

History Node

0.3.1

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Filename, size & hash SHA256 hash help File type Python version Upload date
shellinford-0.3.4.tar.gz (61.1 kB) Copy SHA256 hash SHA256 Source None Oct 29, 2016

Supported by

Elastic Elastic Search Pingdom Pingdom Monitoring Google Google BigQuery Sentry Sentry Error logging CloudAMQP CloudAMQP RabbitMQ AWS AWS Cloud computing Fastly Fastly CDN DigiCert DigiCert EV certificate StatusPage StatusPage Status page