Skip to main content

Wavelet Matrix/Tree succinct data structure for full text search (using shellinford C++ library)

Project description

shellinford

https://badge.fury.io/py/shellinford.png https://travis-ci.org/ikegami-yukino/shellinford-python.svg?branch=master https://coveralls.io/repos/ikegami-yukino/shellinford-python/badge.png

Shellinford is an implementation of a Wavelet Matrix/Tree succinct data structure for document retrieval.

Based on shellinford C++ library.

Installation

$ pip install shellinford

Usage

Create a new FM-index instance

>>> import shellinford
>>> fm = shellinford.FMIndex()
  • shellinford.Shellinford([use_wavelet_tree=True, filename=None])

    • When given a filename, Shellinford loads FM-index data from the file

Build FM-index

>>> fm.build(['Milky Holmes', 'Sherlock "Sheryl" Shellingford', 'Milky'], 'milky.fm')
  • build([docs, filename])

    • When given a filename, Shellinford stores FM-index data to the file

Search word from FM-index

>>> for doc in fm.search('Milky'):
>>>     print 'doc_id:', doc.doc_id
>>>     print 'count:', doc.count
>>>     print 'text:', doc.text
doc_id:    0
count: 1
text:  Milky Holmes
doc_id:    2
count: 1
text:  Milky

>>> for doc in fm.search(['Milky', 'Holmes']):
>>>     print 'doc_id:', doc.doc_id
>>>     print 'count:', doc.count
>>>     print 'text:', doc.text
doc_id:    1
count: 1
text:  Milky Holmes
  • search(query, [_or=False, ignores=[]])

    • If _or = True, then “OR” search is executed, else “AND” search

    • Given ignores, “NOT” search is also executed

    • NOTE: The search function is available after FM-index is built or loaded

Add a document

>>> fm.push_back('Baritsu')
  • push_back(doc)

    • NOTE: A document added by this method is not available to search until build

Read FM-index from a binary file

>>> fm.read('milky_holmes.fm')
  • read(path)

Write FM-index binary to a file

>>> fm.write('milky_holmes.fm')
  • write(path)

Check Whether FM-Index contains string

>>> 'baritsu' in fm

License

  • Wrapper code is licensed under the New BSD License.

  • Bundled shellinford C++ library (c) 2012 echizen_tm is licensed under the New BSD License.

CHANGES

0.3.5 (2018-09-05)

  • FMIndex.build() and FMIndex.pushback() ignore empty string

  • FMIndex supports “in” operator. (e.g., ‘a’ in fm)

  • Support Python 3.5, 3.6 and 3.7

0.3.4 (2016-10-28)

  • FMIndex.search() returns list

0.3 (2014-11-24)

  • “OR” search and “NOT” search are available in FMIndex.search().

  • FMIndex.size and FMIndex.docsize are available as property

0.2 (2014-03-28)

“AND” search is available by giving Sequence (list, tuple, etc.) FMIndex.search()

0.1 (2014-03-11)

First release.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

shellinford-0.3.5.tar.gz (62.9 kB view details)

Uploaded Source

File details

Details for the file shellinford-0.3.5.tar.gz.

File metadata

  • Download URL: shellinford-0.3.5.tar.gz
  • Upload date:
  • Size: 62.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: Python-urllib/3.6

File hashes

Hashes for shellinford-0.3.5.tar.gz
Algorithm Hash digest
SHA256 e9a45c30db15cfa0e9fede38ec23ec1c771656a6e3f73cb7668c02bd8fa7cdc5
MD5 fcfe0fa5456519f360cd1eaf222d203c
BLAKE2b-256 202e2217b7afede772c6f9ea40c8ece3c5657f3e86da2c49c22685cb8461ad20

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page