Skip to main content

Tiny one-phase search engine

Project description

TinySearch

TinySearch is a tiny one-phase search engine. It is extremely easy to use and works well with simple lists where the query may not match the document text exactly.

This is a minimal search engine. You don't need to run separate, big instances of search engine when your use case is a few hundreds or thousands small documents.

Example

Input documents:

"Goldilocks and the Three Bears"
"Fuzzy Wuzzy"
"The Bear Went Over The Mountain"
"We're Going on a Bear Hunt"
"Brown Bear, Brown Bear, What Do You See?"

Search query:

bear

Results (ordered by best match):

"Brown Bear, Brown Bear, What Do You See?"
"Goldilocks and the Three Bears"
"The Bear Went Over The Mountain"
"We're Going on a Bear Hunt"

How to use

from tinysearch.search import Search

docs = [
    "Goldilocks and the Three Bears",
    "Fuzzy Wuzzy",
    "The Bear Went Over The Mountain",
    "We're Going on a Bear Hunt",
    "Brown Bear, Brown Bear, What Do You See?",
]
query = "bear"

s = Search(docs, query)

# How many results?
print(s.results.count)

# What is the top result?
print(s.results.matches[0].doc)

# Print all matches. Best results are at the top.
for m in s.results.matches:
    print(m.doc)

Pass your own analyzer

When tinysearch.analyzer.SimpleEnglishAnalyzer does not satisfy your needs, you can write your own analyzer and pass it to the Search object.

An analyzer inherits from tinysearch.analyzer.base.Analyzer. It only need to implement analyze method. The analyze method accepts a string representing the document on the input, and returns a list of strings representing tokens (terms). Everything that you need to make it happen can be implemented there. See the docstring of the Analyzer base class.

You can then pass your analyzer to Search:

my_analyzer = MyOwnAnalyzer()

s = Search(docs, query, analyzer=my_analyzer)
print(s.results.count)

Under the hood

When you pass documents to the Search object, each document is tokenized and transformed for easier search. The same process is applied to the query.

Then each document is scored using the TF-IDF algorithm to find the best match, and matches are returned sorted to the user. The best match is at the top.

Performance

Performance is important since search engines typically respond to user queries, so it should generate results in a few seconds at most. More than that would appear as a significant delay.

The numbers below are dependent on the running machine, so they are just indicative.

gantt
title Search time for different dataset sizes [s]
dateFormat X
axisFormat %s

section 100
0.0, terms=1 : 0, 0.0s
0.0, terms=2 : 0, 0.0s
0.0, terms=3 : 0, 0.0s

section 1000
0.3, terms=1 : 0, 0.3s
0.2, terms=2 : 0, 0.2s
0.3, terms=3 : 0, 0.3s

section 10000
2.7, terms=1 : 0, 2.7s
2.7, terms=2 : 0, 2.7s
2.7, terms=3 : 0, 2.7s

section 52478
15.1, terms=1 : 0, 15.6s
15.4, terms=2 : 0, 15.1s
15.6, terms=3 : 0, 15.2s

Datasets of around 1000 entries might generate reasonable search times, which is the intended use case for TinySearch. Still, there is probably room for improvement.

Can we make it faster?

Most time is spent in analyzer, so improving performance means improving processing time of the analyzer. The default SimpleEnglishAnalyzer has already been highly optimized.

The next step to consider is to split the search into two phases: indexing and searching. Since analyzer needs to process every document, indexing can happen earlier in the process execution and searching when the user requests it. This has an additional benefit of indexing once and searching multiple times.

from tinysearch.index import Index
from tinysearch.search import Search

i = Index(docs)

# ...later...
s = Search(i, query)
print(s.results.matches[0])

License

See LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tinysearch-0.5.0.tar.gz (9.0 kB view details)

Uploaded Source

Built Distribution

tinysearch-0.5.0-py3-none-any.whl (11.2 kB view details)

Uploaded Python 3

File details

Details for the file tinysearch-0.5.0.tar.gz.

File metadata

  • Download URL: tinysearch-0.5.0.tar.gz
  • Upload date:
  • Size: 9.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.16

File hashes

Hashes for tinysearch-0.5.0.tar.gz
Algorithm Hash digest
SHA256 abe783cda3e33dc8417efc1344ee4dced31fd379bdd4d6dc96e7d93c969e2111
MD5 fc4e3aba5351e00732f08f2b0a0931b3
BLAKE2b-256 683ead5e5f33c6353e0ba375bb6a0303bd0179d8baf9625dfcdf35966f4c61ac

See more details on using hashes here.

File details

Details for the file tinysearch-0.5.0-py3-none-any.whl.

File metadata

  • Download URL: tinysearch-0.5.0-py3-none-any.whl
  • Upload date:
  • Size: 11.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.16

File hashes

Hashes for tinysearch-0.5.0-py3-none-any.whl
Algorithm Hash digest
SHA256 8a7f0047b5c5b2f57ce62ddd6f937700f41449184d90f0a6ddee871de0532138
MD5 e351b667775cc3d571dd0de09aaa2b50
BLAKE2b-256 a447a0b67df8ec176427ba566b17d450c6d7a6f90668396b6370c2648dc21768

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page