Wavelet Matrix/Tree succinct data structure for full text search (using shellinford C++ library)
Project description
shellinford
Shellinford is an implementation of a Wavelet Matrix/Tree succinct data structure for document retrieval.
It is based on shellinford C++ library.
NOTE: This module requires C++11 compiler
Installation
$ pip install shellinford
Usage
Create a new FM-index instance
>>> import shellinford
>>> fm = shellinford.FMIndex()
shellinford.Shellinford([use_wavelet_tree=True, filename=None])
When given a filename, Shellinford loads FM-index data from the file
Build FM-index
>>> fm.build(['Milky Holmes', 'Sherlock "Sheryl" Shellingford', 'Milky'], 'milky.fm')
build([docs, filename])
When given a filename, Shellinford stores FM-index data to the file
Search word from FM-index
>>> for doc in fm.search('Milky'):
>>> print('doc_id:', doc.doc_id)
>>> print('count:', doc.count)
>>> print('text:', doc.text)
doc_id: 0
count: [1]
text: Milky Holmes
doc_id: 2
count: [1]
text: Milky
>>> for doc in fm.search(['Milky', 'Holmes']):
>>> print('doc_id:', doc.doc_id)
>>> print('count:', doc.count)
>>> print('text:', doc.text)
doc_id: 1
count: [1]
text: Milky Holmes
search(query, [_or=False, ignores=[]])
If _or = True, then “OR” search is executed, else “AND” search
Given ignores, “NOT” search is also executed
NOTE: The search function is available after FM-index is built or loaded
Count word from FM-index
>>> fm.count('Milky'):
2
>>> fm.count(['Milky', 'Holmes']):
1
count(query, [_or=False])
If _or = True, then “OR” search is executed, else “AND” search
NOTE: The count function is available after FM-index is built or loaded
This function is slightly faster than the search function
Add a document
>>> fm.push_back('Baritsu')
push_back(doc)
NOTE: A document added by this method is not available to search until build
Read FM-index from a binary file
>>> fm.read('milky_holmes.fm')
read(path)
Write FM-index binary to a file
>>> fm.write('milky_holmes.fm')
write(path)
Check Whether FM-Index contains string
>>> 'baritsu' in fm
License
Wrapper code is licensed under the New BSD License.
Bundled shellinford C++ library (c) 2012 echizen_tm is licensed under the New BSD License.
CHANGES
0.4.0 (2018-09-30)
FMIndex.count() is added
No longer support Python 2.6
bug fix
0.3.5 (2018-09-05)
FMIndex.build() and FMIndex.pushback() ignore empty string
FMIndex supports “in” operator. (e.g., ‘a’ in fm)
Support Python 3.5, 3.6 and 3.7
0.3.4 (2016-10-28)
FMIndex.search() returns list
0.3 (2014-11-24)
“OR” search and “NOT” search are available in FMIndex.search().
FMIndex.size and FMIndex.docsize are available as property
0.2 (2014-03-28)
“AND” search is available by giving Sequence (list, tuple, etc.) FMIndex.search()
0.1 (2014-03-11)
First release.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file shellinford-0.4.0.tar.gz
.
File metadata
- Download URL: shellinford-0.4.0.tar.gz
- Upload date:
- Size: 64.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.12.1 pkginfo/1.4.2 requests/2.19.1 setuptools/40.2.0 requests-toolbelt/0.8.0 tqdm/4.26.0 CPython/3.7.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7311a203b8f6b2b6f96e616859ae00bec6edc2f7ed54d385766a0603ac20d5c4 |
|
MD5 | 466639acb95ada2a58a720908597263a |
|
BLAKE2b-256 | f4dbaeab3393e085917eddfd859a072a90f60676e53d688dba4248797c71d4fb |