Skip to main content

Wavelet Matrix/Tree succinct data structure for full text search (using shellinford C++ library)

Project description

shellinford

travis-ci.org coveralls.io pyversion latest version license

Shellinford is an implementation of a Wavelet Matrix/Tree succinct data structure for document retrieval.

It is based on shellinford C++ library.

NOTE: This module requires C++11 compiler

Installation

$ pip install shellinford

Usage

Create a new FM-index instance

>>> import shellinford
>>> fm = shellinford.FMIndex()
  • shellinford.Shellinford([use_wavelet_tree=True, filename=None])
    • When given a filename, Shellinford loads FM-index data from the file

Build FM-index

>>> fm.build(['Milky Holmes', 'Sherlock "Sheryl" Shellingford', 'Milky'], 'milky.fm')
  • build([docs, filename])
    • When given a filename, Shellinford stores FM-index data to the file

Search word from FM-index

>>> for doc in fm.search('Milky'):
>>>     print('doc_id:', doc.doc_id)
>>>     print('count:', doc.count)
>>>     print('text:', doc.text)
doc_id: 0
count: [1]
text: Milky Holmes
doc_id: 2
count: [1]
text: Milky

>>> for doc in fm.search(['Milky', 'Holmes']):
>>>     print('doc_id:', doc.doc_id)
>>>     print('count:', doc.count)
>>>     print('text:', doc.text)
doc_id: 1
count: [1]
text: Milky Holmes
  • search(query, [_or=False, ignores=[]])
    • If _or = True, then “OR” search is executed, else “AND” search
    • Given ignores, “NOT” search is also executed
    • NOTE: The search function is available after FM-index is built or loaded

Count word from FM-index

>>> fm.count('Milky'):
2

>>> fm.count(['Milky', 'Holmes']):
1
  • count(query, [_or=False])
    • If _or = True, then “OR” search is executed, else “AND” search
    • NOTE: The count function is available after FM-index is built or loaded
    • This function is slightly faster than the search function

Add a document

>>> fm.push_back('Baritsu')
  • push_back(doc)
    • NOTE: A document added by this method is not available to search until build

Read FM-index from a binary file

>>> fm.read('milky_holmes.fm')
  • read(path)

Write FM-index binary to a file

>>> fm.write('milky_holmes.fm')
  • write(path)

Check Whether FM-Index contains string

>>> 'baritsu' in fm

License

  • Wrapper code is licensed under the New BSD License.
  • Bundled shellinford C++ library (c) 2012 echizen_tm is licensed under the New BSD License.

CHANGES

0.4.1 (2010-02-08)

  • Make “in” operator faster

0.4.0 (2018-09-30)

  • FMIndex.count() is added
  • No longer support Python 2.6
  • bug fix

0.3.5 (2018-09-05)

  • FMIndex.build() and FMIndex.pushback() ignore empty string
  • FMIndex supports “in” operator. (e.g., ‘a’ in fm)
  • Support Python 3.5, 3.6 and 3.7

0.3.4 (2016-10-28)

  • FMIndex.search() returns list

0.3 (2014-11-24)

  • “OR” search and “NOT” search are available in FMIndex.search().
  • FMIndex.size and FMIndex.docsize are available as property

0.2 (2014-03-28)

“AND” search is available by giving Sequence (list, tuple, etc.) FMIndex.search()

0.1 (2014-03-11)

First release.

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Files for shellinford, version 0.4.1
Filename, size File type Python version Upload date Hashes
Filename, size shellinford-0.4.1.tar.gz (65.0 kB) File type Source Python version None Upload date Hashes View hashes

Supported by

Elastic Elastic Search Pingdom Pingdom Monitoring Google Google BigQuery Sentry Sentry Error logging AWS AWS Cloud computing DataDog DataDog Monitoring Fastly Fastly CDN DigiCert DigiCert EV certificate StatusPage StatusPage Status page