Skip to main content

Wavelet Matrix/Tree succinct data structure for full text search (using shellinford C++ library)

Project description

shellinford

https://badge.fury.io/py/shellinford.png https://travis-ci.org/ikegami-yukino/shellinford-python.svg?branch=master https://coveralls.io/repos/ikegami-yukino/shellinford-python/badge.png

Shellinford is an implementation of a Wavelet Matrix/Tree succinct data structure for document retrieval.

It is based on shellinford C++ library.

NOTE: This module requires C++11 compiler

Installation

$ pip install shellinford

Usage

Create a new FM-index instance

>>> import shellinford
>>> fm = shellinford.FMIndex()
  • shellinford.Shellinford([use_wavelet_tree=True, filename=None])

    • When given a filename, Shellinford loads FM-index data from the file

Build FM-index

>>> fm.build(['Milky Holmes', 'Sherlock "Sheryl" Shellingford', 'Milky'], 'milky.fm')
  • build([docs, filename])

    • When given a filename, Shellinford stores FM-index data to the file

Search word from FM-index

>>> for doc in fm.search('Milky'):
>>>     print('doc_id:', doc.doc_id)
>>>     print('count:', doc.count)
>>>     print('text:', doc.text)
doc_id: 0
count: [1]
text: Milky Holmes
doc_id: 2
count: [1]
text: Milky

>>> for doc in fm.search(['Milky', 'Holmes']):
>>>     print('doc_id:', doc.doc_id)
>>>     print('count:', doc.count)
>>>     print('text:', doc.text)
doc_id: 1
count: [1]
text: Milky Holmes
  • search(query, [_or=False, ignores=[]])

    • If _or = True, then “OR” search is executed, else “AND” search

    • Given ignores, “NOT” search is also executed

    • NOTE: The search function is available after FM-index is built or loaded

Count word from FM-index

>>> fm.count('Milky'):
2

>>> fm.count(['Milky', 'Holmes']):
1
  • count(query, [_or=False])

    • If _or = True, then “OR” search is executed, else “AND” search

    • NOTE: The count function is available after FM-index is built or loaded

    • This function is slightly faster than the search function

Add a document

>>> fm.push_back('Baritsu')
  • push_back(doc)

    • NOTE: A document added by this method is not available to search until build

Read FM-index from a binary file

>>> fm.read('milky_holmes.fm')
  • read(path)

Write FM-index binary to a file

>>> fm.write('milky_holmes.fm')
  • write(path)

Check Whether FM-Index contains string

>>> 'baritsu' in fm

License

  • Wrapper code is licensed under the New BSD License.

  • Bundled shellinford C++ library (c) 2012 echizen_tm is licensed under the New BSD License.

CHANGES

0.4.0 (2018-09-30)

  • FMIndex.count() is added

  • No longer support Python 2.6

  • bug fix

0.3.5 (2018-09-05)

  • FMIndex.build() and FMIndex.pushback() ignore empty string

  • FMIndex supports “in” operator. (e.g., ‘a’ in fm)

  • Support Python 3.5, 3.6 and 3.7

0.3.4 (2016-10-28)

  • FMIndex.search() returns list

0.3 (2014-11-24)

  • “OR” search and “NOT” search are available in FMIndex.search().

  • FMIndex.size and FMIndex.docsize are available as property

0.2 (2014-03-28)

“AND” search is available by giving Sequence (list, tuple, etc.) FMIndex.search()

0.1 (2014-03-11)

First release.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

shellinford-0.4.0.tar.gz (64.3 kB view details)

Uploaded Source

File details

Details for the file shellinford-0.4.0.tar.gz.

File metadata

  • Download URL: shellinford-0.4.0.tar.gz
  • Upload date:
  • Size: 64.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.12.1 pkginfo/1.4.2 requests/2.19.1 setuptools/40.2.0 requests-toolbelt/0.8.0 tqdm/4.26.0 CPython/3.7.0

File hashes

Hashes for shellinford-0.4.0.tar.gz
Algorithm Hash digest
SHA256 7311a203b8f6b2b6f96e616859ae00bec6edc2f7ed54d385766a0603ac20d5c4
MD5 466639acb95ada2a58a720908597263a
BLAKE2b-256 f4dbaeab3393e085917eddfd859a072a90f60676e53d688dba4248797c71d4fb

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page