A module for creating ngrams and searching multiple phrases using inverted index searching in a document
Project description
inverted-index-search
inverted-index-search is python library for searching up keywords or sub words in a corpus of data using inverted index lookup
Installation
Use the package manager pip to install.
pip install inverted-index-search
Usage
The library's usage is straightforward, and it can be easily imported into your Python script. The make_doc_ngrams function breaks down the document into n-grams, and the search_doc function finds matching substrings in the document. You can specify the n-gram size and the type of n-gram level (word or character) for the search.
Here is an example of how to use the library:
from inverted_index_search import search_doc
#This breaks down the document in ngrams to be used for searching
document_ngrams = make_doc_ngrams("this is big document with multiple words and sentences", "word", [1,2], verbose=True)
>> DOCUMENT N GRAMS => [1, 2]
>> Removing these ngrams :
>> DOCUMENT NGRAM LOOKUP TABLE => {'this': [(0, 4)], 'is': [(2, 4), (5, 7)], 'big': [(8, 11)], 'document': [(12, 20)], 'with': [(21, 25)], 'multiple': [(26, 34)], 'words': [(35, 40)], 'and': [(41, 44)], 'sentences': [(45, 54)], 'this is': [(0, 7)], 'is big': [(5, 11)], 'big document': [(8, 20)], 'document with': [(12, 25)], 'with multiple': [(21, 34)], 'multiple words': [(26, 40)], 'words and': [(35, 44)], 'and sentences': [(41, 54)]}
#This breaks down the phrases and actually does the matching
search_doc(document_ngrams, ['document' , 'multiple words'], [1], 'word', verbose=True))
>> Phrase N GRAMS => [1]
Checking for phrase ngram : document
Checking for phrase ngram : multiple
Checking for phrase ngram : words
>> {'document': {'document': {'count': 1, 'occured': [(12, 20)]}}, 'multiple words': {'multiple': {'count': 1, 'occured': [(26, 34)]}, 'words': {'count': 1, 'occured': [(35, 40)]}}}
print(search_doc.__doc__)
>> """ This function creates ngrams out of the phrases you have
entered and finds the matching substrings in the document. You can specify what ngram for using phrase_ngrams paramter and
. Simply pass phrase_ngrams=[1,7,2] to create ngrams of size 1,7 and 2. There are two level ngram either words or chaarcter which
you can change by changing the n_gram_level to either 'char' or 'word'. To turn on logging setting verbose to True"""
Features
Efficient inverted index search for large text data sets
Customizable n-gram size and level (word or character)
Simple and easy-to-use API
Built-in logging for debugging and testing purposes
Contributing
Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.
Please make sure to update tests as appropriate.
Github
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file inverted_index_search-1.3.4.tar.gz.
File metadata
- Download URL: inverted_index_search-1.3.4.tar.gz
- Upload date:
- Size: 4.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.11.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9a08c2e1e44af82814286083e20f8b75d369fd6cf88848d67eb070668ddea2fe
|
|
| MD5 |
8fc74cc952c9185c797e93d81337c496
|
|
| BLAKE2b-256 |
c8f2602d07d1141f31bcc347a0dcc00c051e2bced2c65730314382054ec2cacd
|
File details
Details for the file inverted_index_search-1.3.4-py3-none-any.whl.
File metadata
- Download URL: inverted_index_search-1.3.4-py3-none-any.whl
- Upload date:
- Size: 5.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.11.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
46e483e21fcb489bb45f773d16b90631c4f08589f8f28ce7ae69a5b8672d04b6
|
|
| MD5 |
da1512a3826b810b2cc613ed861267ac
|
|
| BLAKE2b-256 |
aba718f7ba299c616892f2e33e129475a9727480c2e2c16c9d4baedacb661295
|