Skip to main content

Search for the most relevant documents containing words from a query

Project description

Generic badge Generic badge

skifts

Search for the most relevant documents containing words from the query.

query = ['A', 'B']

documents = [
    ['N', 'A', 'M'],  # matching features: 'A'
    ['C', 'B', 'A'],  # matching features: 'A', 'B'  
    ['X', 'Y']  # no matching features
]

The search with return ['C', 'B', 'A'] and ['N', 'A', 'M'] in that particular order.

It's not necessarily about text. Words are just any str instances. Documents are unordered collections of these str. We search for documents considering frequency, rarity and match accuracy.

Install

pip3 install git+https://github.com/rtmigo/skifts_py#egg=skifts

Use for full-text search

Finding documents that contain words from the query.

from skifts import SkiFts

# three documents, one per row
documents = [
    ["wait", "mister", "postman"],
    ["please", "mister", "postman", "look", "and", "see"],
    ["oh", "yes", "wait", "a", "minute", "mister", "postman"]
]

fts = SkiFts(documents)

# find and print the most relevant documents:
for doc_index in fts.search(['postman', 'wait']):
    print(documents[doc_index])

Words inside the documents list are considered ready-made feature identifiers. If your text needs preprocessing or stemming, this should be done separately.

The ranking takes into account the frequency of words in the document and the rarity of words in the corpus. The word order in the document and the distance between words do not matter.

Implementation details

The search uses the scikit-learn library, which ranks documents using tf-idf and cosine similarity.

See also

The gifts package implements the same search, but in pure Python with no binary dependencies.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

skifts-0.1.0.tar.gz (4.7 kB view details)

Uploaded Source

Built Distribution

skifts-0.1.0-py3-none-any.whl (5.2 kB view details)

Uploaded Python 3

File details

Details for the file skifts-0.1.0.tar.gz.

File metadata

  • Download URL: skifts-0.1.0.tar.gz
  • Upload date:
  • Size: 4.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.7.1 importlib_metadata/4.10.0 pkginfo/1.8.2 requests/2.27.1 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.10.1

File hashes

Hashes for skifts-0.1.0.tar.gz
Algorithm Hash digest
SHA256 32017a958754ba25bc16d21225d49da427274edce59bbbb1c8370f630050f25c
MD5 bc9c189a987d9d8d36dc5c7efb36764a
BLAKE2b-256 684f88cc233718a2de5b8e818aecbe469a75ddafcfa20e081da825a8973bc94a

See more details on using hashes here.

File details

Details for the file skifts-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: skifts-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 5.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.7.1 importlib_metadata/4.10.0 pkginfo/1.8.2 requests/2.27.1 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.10.1

File hashes

Hashes for skifts-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 df539d5670da1b8fb520dee7c5c67d26895d47c0534c1857a4a124fecd26bbe5
MD5 bfad6ee880c07233b4c94a5f9c400457
BLAKE2b-256 d53e1497648fa47969d8c8daee69309526fa78792b9d4a3016b41a0966eda03f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page