Wavelet Matrix/Tree succinct data structure for full text search (using shellinford C++ library)
Project description
shellinford
Shellinford is an implementation of a Wavelet Matrix/Tree succinct data structure for document retrieval.
It is based on shellinford C++ library.
NOTE: This module requires C++11 compiler
Installation
$ pip install shellinford
Usage
Create a new FM-index instance
>>> import shellinford >>> fm = shellinford.FMIndex()
- shellinford.Shellinford([use_wavelet_tree=True, filename=None])
- When given a filename, Shellinford loads FM-index data from the file
Build FM-index
>>> fm.build(['Milky Holmes', 'Sherlock "Sheryl" Shellingford', 'Milky'], 'milky.fm')
- build([docs, filename])
- When given a filename, Shellinford stores FM-index data to the file
Search word from FM-index
>>> for doc in fm.search('Milky'): >>> print('doc_id:', doc.doc_id) >>> print('count:', doc.count) >>> print('text:', doc.text) doc_id: 0 count: [1] text: Milky Holmes doc_id: 2 count: [1] text: Milky >>> for doc in fm.search(['Milky', 'Holmes']): >>> print('doc_id:', doc.doc_id) >>> print('count:', doc.count) >>> print('text:', doc.text) doc_id: 1 count: [1] text: Milky Holmes
- search(query, [_or=False, ignores=[]])
- If _or = True, then “OR” search is executed, else “AND” search
- Given ignores, “NOT” search is also executed
- NOTE: The search function is available after FM-index is built or loaded
Count word from FM-index
>>> fm.count('Milky'): 2 >>> fm.count(['Milky', 'Holmes']): 1
- count(query, [_or=False])
- If _or = True, then “OR” search is executed, else “AND” search
- NOTE: The count function is available after FM-index is built or loaded
- This function is slightly faster than the search function
Add a document
>>> fm.push_back('Baritsu')
- push_back(doc)
- NOTE: A document added by this method is not available to search until build
Read FM-index from a binary file
>>> fm.read('milky_holmes.fm')
- read(path)
Write FM-index binary to a file
>>> fm.write('milky_holmes.fm')
- write(path)
Check Whether FM-Index contains string
>>> 'baritsu' in fm
License
- Wrapper code is licensed under the New BSD License.
- Bundled shellinford C++ library (c) 2012 echizen_tm is licensed under the New BSD License.
CHANGES
0.4.1 (2010-02-08)
- Make “in” operator faster
0.4.0 (2018-09-30)
- FMIndex.count() is added
- No longer support Python 2.6
- bug fix
0.3.5 (2018-09-05)
- FMIndex.build() and FMIndex.pushback() ignore empty string
- FMIndex supports “in” operator. (e.g., ‘a’ in fm)
- Support Python 3.5, 3.6 and 3.7
0.3.4 (2016-10-28)
- FMIndex.search() returns list
0.3 (2014-11-24)
- “OR” search and “NOT” search are available in FMIndex.search().
- FMIndex.size and FMIndex.docsize are available as property
0.2 (2014-03-28)
“AND” search is available by giving Sequence (list, tuple, etc.) FMIndex.search()
0.1 (2014-03-11)
First release.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Filename, size | File type | Python version | Upload date | Hashes |
---|---|---|---|---|
Filename, size shellinford-0.4.1.tar.gz (65.0 kB) | File type Source | Python version None | Upload date | Hashes View |