Easy to use hybrid index for semantic + keyword search

These details have not been verified by PyPI

Project links

Homepage

Project description

🌌 HybridIndex

HybridIndex is an easy to use Python library that lets you add a retrieval engine in your AI apps.

It uses and hybrid search (semantic + keyword) that outperforms both semantic search and keyword search individually.

With HybridIndex, you can build LLM-powered apps without the headaches of maintaining your own retrieval engine, while outperforming the current alternatives and being free to use.

📥 Installation

pip install hybrid-index

📙 Usage

To use the index, first create a new instance and provide it with a name and an openai key (to embed the documents).

from hybrid_index import HybridIndex

index = HybridIndex('name', 'your-openai-key')

Then, we can start adding documents to the index. Each document in the list is a tuple containing an id (can be whatever you want, but keep it unique) and the document body:

await index.add([
('a1', 'The sun rises in the east and sets in the west.'), 
('a2', 'Elephants are the largest land animals on Earth.'), 
('a3', 'Water boils at 100 degrees Celsius.'),
('a4', 'The capital of France is Paris.'),
('a5', 'The Great Wall of China is a famous landmark.')
])

Finally, we can search the index specifying the query and the number of results to retrieve (in our case 3):

results = await index.query('What are the largest land animals on Earth?', 3)

The above code returns a list of the top 3 ids ordered by similarity:

['a2', 'a4', 'a1']

The index can be serialized and deserialized (so that it can be saved) using pickle:

pickled_index = index.serialize()

# After closing the session and saving the pickled index, we can load it back and rebuild the index

index2 = HybridIndex.deserialize(pickled_index, 'your-openai-key')

For security purposes your openai key is not pickled alongside your index, so it must be re-entered at deserialization time.

⚙️ How it works

This index uses a technique called hybrid search that aims to solve the shortcomings of semantic search (that is uncapable to find relevant keywords) and keyword search (that does not understand the meaning of the documents and query).

To achieve this goal, the HybridIndex stores two indices:

A faiss index for semantic search with cosine similarity
A modified BM25O+ index that measure keyword similarity

The BM25O+ index is a modified version of BM25+ that uses $\dfrac{corpus \ size}{frequency^2}$ to calculate the IDF. This formula has the effect of putting an enphasis on rare keywords while the semantic search takes care of the rest.

Both indexes are queried and each result is then combined using $\alpha \cdot cosine \ score + (1 - \alpha ) \cdot BM25O+ \ score$.

$\alpha$ is a parameter set by default at 0.7 (after a bit of experimentation), but can be chosen when initializing the index:

index = HybridIndex('name', 'your-openai-key', a=0.6)

Finally, the top n results are returned to the user.

🪪 License

HybridIndex is licensed under the MIT License. See the LICENSE file for more details.

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

0.1.0

Jun 2, 2023

0.0.11

Jun 2, 2023

0.0.8

Jun 3, 2023

0.0.7

Jun 3, 2023

This version

0.0.6

Jun 3, 2023

0.0.5

Jun 2, 2023

0.0.4

Jun 2, 2023

0.0.3

Jun 2, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hybrid_index-0.0.6.tar.gz (5.5 kB view hashes)

Uploaded Jun 3, 2023 Source

Built Distribution

hybrid_index-0.0.6-py3-none-any.whl (5.7 kB view hashes)

Uploaded Jun 3, 2023 Python 3

Hashes for hybrid_index-0.0.6.tar.gz

Hashes for hybrid_index-0.0.6.tar.gz
Algorithm	Hash digest
SHA256	`27c147967b66fc44c0a0a01021d06deeb867027a25e201d1f74a950a8a331bc8`
MD5	`2a24c03586cc979c79427f2f13bfcae2`
BLAKE2b-256	`945064dea71432612b2d40215967c99463d189d2d842367f48f3069d26375004`

Hashes for hybrid_index-0.0.6-py3-none-any.whl

Hashes for hybrid_index-0.0.6-py3-none-any.whl
Algorithm	Hash digest
SHA256	`5309a5d20f6ab2e1efd1f17d632f4863181b9246a9f494840bffaff9326fb4fe`
MD5	`747eae814fb7a99395ca3e02a3af8d5d`
BLAKE2b-256	`0b7ca00362ef556a10d0a65c026bcddf1423f66ac2edebdd1051f00fc6a53e18`