Skip to main content

A transient, in-memory semantic search engine for small document collections, powered by sentence embeddings from an OpenAI Embeddings-compatible API.

Project description

transient-in-memory-semantic-search-engine

A transient, in-memory semantic search engine for small document collections, powered by sentence embeddings from an OpenAI Embeddings-compatible API. That includes:

with compatible models.

Features

  • Simple in-memory construction from mapping or iterable of key-value pairs.
    • Values must be of type Text.
  • Uses cosine similarity for nearest neighbor ranking.

Installation

pip install transient-in-memory-semantic-search-engine

Example

# coding=utf-8
from __future__ import print_function, unicode_literals
from transient_in_memory_semantic_search_engine import TransientInMemorySemanticSearchEngine

# For this example, api_key and base_url are ignored by the dummy backend.
engine = TransientInMemorySemanticSearchEngine(
    api_key='OPENAI_API_KEY',
    base_url='https://api.openai.com/v1',
    model='text-embedding-ada-002',
    key_value_mapping_or_key_value_pairs=[
        ('a', 'The quick brown fox jumps over the lazy dog.'),
        ('b', 'A fast, dark-colored fox leaped above a sleeping canine.'),
        ('c', 'Unrelated sentence about software engineering.'),
    ]
)

# Search with a query:
query = 'A fox jumping over a dog'
results = engine(query)

# Results are returned as a list of (similarity, key) pairs, sorted by similarity descending.
print('Results:')
for score, key in results:
    print('Key: %s, Similarity: %.4f' % (key, score))

Output:

Results:
Key: b, Similarity: 0.9236
Key: a, Similarity: 0.9130
Key: c, Similarity: 0.7434

Notes

  • Values must be of type Text.
  • Engine construction is fast for moderate document sizes, as all embeddings are precomputed.
  • Suitable for prototyping and small-scale applications.

Command-line Usage

You can also use semantic search from the command line using:

python -m transient_in_memory_semantic_search_engine \
    --api-key YOUR_API_KEY \
    --base-url https://api.openai.com/v1 \
    --model text-embedding-ada-002 \
    --key-value-json path/to/your_documents.json

Where your_documents.json is a file containing your documents as a mapping from keys to text values, for example:

{
    "doc1": "The quick brown fox jumps over the lazy dog.",
    "doc2": "A fast, dark-colored fox leaped above a sleeping canine.",
    "doc3": "Unrelated sentence about software engineering."
}

Once started, you will be prompted:

Enter a query:

Type your search query. The engine will print a ranked list of matches:

score,key
0.9236,"doc2"
0.9130,"doc1"
0.7434,"doc3"

Contributing

Contributions are welcome! Please submit pull requests or open issues on the GitHub repository.

License

This project is licensed under the MIT License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

File details

Details for the file transient_in_memory_semantic_search_engine-0.1.0a0.tar.gz.

File metadata

File hashes

Hashes for transient_in_memory_semantic_search_engine-0.1.0a0.tar.gz
Algorithm Hash digest
SHA256 d21d5f796975987b50632262f518b0e85f87cd6303a8eba39b88c69b0764002e
MD5 1642f42e75f1e784c38bda58dee956ee
BLAKE2b-256 b6ad24fef455193e864058b0d99c0fca584bc1774973926af63b8700356dcf1a

See more details on using hashes here.

File details

Details for the file transient_in_memory_semantic_search_engine-0.1.0a0-py2.py3-none-any.whl.

File metadata

File hashes

Hashes for transient_in_memory_semantic_search_engine-0.1.0a0-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 7f947e8176f358b6ad7df3a940bba1a30ad55808afa1b71d0de28c0b2d432f3c
MD5 31eea4598303c31259b218d5665caf7a
BLAKE2b-256 56917c5643346d25a19d61b9666749fe5a0170cec2fbdcca670f1c3b34a919fd

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page