Skip to main content

Boolean text search in Python

Project description

Boolean text search using Eldar

Getting Started

These instructions will get you a copy of the project up and running on your local machine for development and testing purposes.

Installing

You can install the method by typing:

pip install eldar

Basic usage

from eldar import Query


# build list
documents = [
    "Gandalf is a fictional character in Tolkien's The Lord of the Rings",
    "Frodo is the main character in The Lord of the Rings",
    "Ian McKellen interpreted Gandalf in Peter Jackson's movies",
    "Elijah Wood was cast as Frodo Baggins in Jackson's adaptation",
    "The Lord of the Rings is an epic fantasy novel by J. R. R. Tolkien"]

eldar = Query('("gandalf" OR "frodo") AND NOT ("movie" OR "adaptation")')

# use `filter` to get a list of matches:
print(eldar.filter(documents))
# >>> ["Gandalf is a fictional character in Tolkien's The Lord of the Rings",
#     'Frodo is the main character in The Lord of the Rings']

# call to see if the text matches the query:
print(eldar(documents[0]))
# >>> True

# by default, words must match. Thus, "movie" != "movies":
print(eldar(documents[2]))
# >>> True

You can also use it to mask Pandas DataFrames:

from eldar import Query
import pandas as pd


# build dataframe
df = pd.DataFrame([
    "Gandalf is a fictional character in Tolkien's The Lord of the Rings",
    "Frodo is the main character in The Lord of the Rings",
    "Ian McKellen interpreted Gandalf in Peter Jackson's movies",
    "Elijah Wood was cast as Frodo Baggins in Jackson's adaptation",
    "The Lord of the Rings is an epic fantasy novel by J. R. R. Tolkien"],
    columns=['content'])

# build query object
eldar = Query('("gandalf" OR "frodo") AND NOT ("movie" OR "adaptation")')

# eldar's call returns True if the text matches the query.
# You can filter a dataframe using pandas mask syntax:
df = df[df.content.apply(eldar)]
print(df)

Parameters

There are three parameters that you can adjust in the query builder. By default:

Query(..., ignore_case=True, ignore_accent=True, match_word=True)

Let the query be query = '"movie"':

  • If ignore_case is True, the documents "Movie" and "movie" will be matched. If False, only "movie" will be matched.
  • If ignore_accent is True, the documents "mövie" will be matched.
  • If match_word is True, the document will be tokenized and the query terms will have to match exactly. If set to False, the documents "movies" and "movie" will be matched. Setting this option to True may slow down the query.

Wildcards

Queries also support * as wildcard character. Wildcard matches any number (including none) of alphanumeric characters.

from eldar import Query


# sample document and query with multiple wildcards:
document = "Gandalf is a fictional character in Tolkien's The Lord of the Rings"
eldar = Query('"g*dal*"')

# call to see if the text matches the query:
print(eldar(document))
# >>> True

Building an index for faster queries

Searching in a large corpus using the Query object is slow, as each document has to be checked. For (much) faster queries, create an Index object, and build it using a list of documents.

from eldar import Index
from eldar.trie import Trie

documents = [
    "Gandalf is a fictional character in Tolkien's The Lord of the Rings",
    "Frodo is the main character in The Lord of the Rings",
    "Ian McKellen interpreted Gandalf in Peter Jackson's movies",
    "Elijah Wood was cast as Frodo Baggins in Jackson's adaptation",
    "The Lord of the Rings is an epic fantasy novel by J. R. R. Tolkien",
    "Frodo Baggins is a hobbit"
]

index = Index(ignore_case=True, ignore_accent=True)
index.build(documents)  # must only be done once

# persist and retrieve index from disk
index.save("index.p")  # but documents are copied to disk
index = Index.load("index.p")

print(index.search('"frodo b*" AND NOT hobbit'))  # support wildcards
print(index.count('"frodo b*" AND NOT hobbit'))  # shows only the count
# to only return document ids, set `return_ids` to True:
print(index.search('"frodo b*" AND NOT hobbit', return_ids=True))

It works like a usual search engine does: by keeping a dictionary that maps each word to its document ids. The boolean query is turned into an operation tree, where document ids are joined or intersected in order to return the desired matches.

License

This package is MIT licensed.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

eldar-0.0.8.tar.gz (9.3 kB view details)

Uploaded Source

Built Distribution

eldar-0.0.8-py3-none-any.whl (8.9 kB view details)

Uploaded Python 3

File details

Details for the file eldar-0.0.8.tar.gz.

File metadata

  • Download URL: eldar-0.0.8.tar.gz
  • Upload date:
  • Size: 9.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/3.10.1 pkginfo/1.7.0 requests/2.27.1 requests-toolbelt/0.9.1 tqdm/4.54.1 CPython/3.9.1

File hashes

Hashes for eldar-0.0.8.tar.gz
Algorithm Hash digest
SHA256 350e796e1f1bd74c6e3d05c7a050029cb498846c180d0c59b6b8db18b946ab9a
MD5 b982f1ec5c0948366deda7351e5f8a1d
BLAKE2b-256 7362d42ca071332ca95b369c47d9aab43b64216646eb5925885f996e8dfddbf6

See more details on using hashes here.

File details

Details for the file eldar-0.0.8-py3-none-any.whl.

File metadata

  • Download URL: eldar-0.0.8-py3-none-any.whl
  • Upload date:
  • Size: 8.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/3.10.1 pkginfo/1.7.0 requests/2.27.1 requests-toolbelt/0.9.1 tqdm/4.54.1 CPython/3.9.1

File hashes

Hashes for eldar-0.0.8-py3-none-any.whl
Algorithm Hash digest
SHA256 38da84dd270b8a1a826fe33cd1959ce16c51d9122b0d86eabdec28b0208f057e
MD5 40777141de3d413384ea06eda81e871d
BLAKE2b-256 ca684842ed21884f8499dbd565454534f1a259612df3db15125ac95529a55543

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page