Skip to main content

A module for creating ngrams and searching multiple phrases using inverted index searching in a document

Project description

inverted-index-search

inverted-index-search is python library for searching up keywords or sub words in a corpus of data using inverted index lookup

Installation

Use the package manager pip to install.

pip install inverted-index-search

Usage

The library's usage is straightforward, and it can be easily imported into your Python script. The make_doc_ngrams function breaks down the document into n-grams, and the search_doc function finds matching substrings in the document. You can specify the n-gram size and the type of n-gram level (word or character) for the search.

Here is an example of how to use the library:

from  inverted_index_search import search_doc


#This breaks down the document in ngrams to be used for searching
document_ngrams = make_doc_ngrams("this is big document with multiple words and sentences", "word", [1,2], verbose=True)
>> DOCUMENT N GRAMS => [1, 2]
>> Removing these ngrams :  
>> DOCUMENT NGRAM LOOKUP TABLE => {'this': [(0, 4)], 'is': [(2, 4), (5, 7)], 'big': [(8, 11)], 'document': [(12, 20)], 'with': [(21, 25)], 'multiple': [(26, 34)], 'words': [(35, 40)], 'and': [(41, 44)], 'sentences': [(45, 54)], 'this is': [(0, 7)], 'is big': [(5, 11)], 'big document': [(8, 20)], 'document with': [(12, 25)], 'with multiple': [(21, 34)], 'multiple words': [(26, 40)], 'words and': [(35, 44)], 'and sentences': [(41, 54)]}



#This breaks down the phrases and actually does the matching
search_doc(document_ngrams, ['document' , 'multiple words'], [1], 'word', verbose=True))
>> Phrase N GRAMS => [1]
Checking for phrase ngram : document
Checking for phrase ngram : multiple
Checking for phrase ngram : words
>> {'document': {'document': {'count': 1, 'occured': [(12, 20)]}}, 'multiple words': {'multiple': {'count': 1, 'occured': [(26, 34)]}, 'words': {'count': 1, 'occured': [(35, 40)]}}}


print(search_doc.__doc__)
>>  """ This function creates ngrams out of the phrases you have
    entered and finds the matching substrings in the document. You can specify what ngram for using phrase_ngrams paramter and
   . Simply pass phrase_ngrams=[1,7,2] to create ngrams of size 1,7 and 2. There are two level ngram either words or chaarcter which
    you can change by changing the n_gram_level to either 'char' or 'word'. To turn on logging setting verbose to True"""

Features

Efficient inverted index search for large text data sets
Customizable n-gram size and level (word or character)
Simple and easy-to-use API
Built-in logging for debugging and testing purposes

Contributing

Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.

Please make sure to update tests as appropriate.

Github

Affan

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

inverted_index_search-1.3.4.tar.gz (4.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

inverted_index_search-1.3.4-py3-none-any.whl (5.3 kB view details)

Uploaded Python 3

File details

Details for the file inverted_index_search-1.3.4.tar.gz.

File metadata

  • Download URL: inverted_index_search-1.3.4.tar.gz
  • Upload date:
  • Size: 4.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.2

File hashes

Hashes for inverted_index_search-1.3.4.tar.gz
Algorithm Hash digest
SHA256 9a08c2e1e44af82814286083e20f8b75d369fd6cf88848d67eb070668ddea2fe
MD5 8fc74cc952c9185c797e93d81337c496
BLAKE2b-256 c8f2602d07d1141f31bcc347a0dcc00c051e2bced2c65730314382054ec2cacd

See more details on using hashes here.

File details

Details for the file inverted_index_search-1.3.4-py3-none-any.whl.

File metadata

File hashes

Hashes for inverted_index_search-1.3.4-py3-none-any.whl
Algorithm Hash digest
SHA256 46e483e21fcb489bb45f773d16b90631c4f08589f8f28ce7ae69a5b8672d04b6
MD5 da1512a3826b810b2cc613ed861267ac
BLAKE2b-256 aba718f7ba299c616892f2e33e129475a9727480c2e2c16c9d4baedacb661295

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page