Hybrid search with OpenSearch and Langchain
Project description
hybrid search instructions
Created following this article https://opensearch.org/blog/hybrid-search/
Installation:
With conda or micromamba setup the environment:
micromamba create -f environment.yaml
micromamba activate hybrid_search
For the OpenSearch itself there are several installation options.
From docker-compose
This repository goes with a test two nodes open-search cluster together with a dashboard.
Optional: change OPENSEARCH_JAVA_OPTS=-Xms2512m -Xmx2512m according to your RAM availability, usually it is recommended to have them equal in side. Start docker-compose:
docker compose up
Open http://localhost:5601/ to explore the dashboard, "admin" is used both as user and passport by default.
Manual installation
- Go to https://opensearch.org/downloads.html and download OpenSearch choose the installation variant you like. OpenSearch Dashboards is a convenient tool but not mandatory.
- Install the latest Java
- For Windows unpack the archive. In opensearch_folder/config/opensearch.yml make sure plugins.security.ssl.http.enabled: true. Because it works correctly only with ssl on, despite some functionality still being available with http. Launch opensearch-windows-install.bat, despite the name it is not an installer but a main launcher.
- For Linux use docker or follow instructions in the documentation.
Usage:
- Launch open-search either with docker-compose or java
- Launch index.py for the initial indexing test dataset. It creates an index and pipeline for hybrid search.
- Activate environment
micromamba activate hybrid_search #to activate environment
pip install -e . #[optional] install current package locally
- Launch search to perform test search.
python index.py #to index
python search.py # to search, uses default query
You can also tune index.py parameters. For example:
python index.py main --url https://agingkills.eu:9200 --user admin --password admin --index_name index-bge-test_rsids_10k --embedding BAAI/bge-base-en-v1.5
If you want to use another embedding, for example specter2, try:
python index.py specter2
Tests
RSID test
There are text pieces deliberately incorporated into tacutu papers data ( /data/tacutopapers_test_rsids_10k ) In particular for rs123456789 and rs123456788 as well as similar but misspelled rsids are added to the documents:
- 10.txt contains both two times
- 11.txt contains both one time
- 12.txt and 13 contain only one rsid
- 20.txt contains both wrong rsids two times
- 21.txt contains both wrong rsids one time
- 22.txt and 23 contain only one wrong rsid
You can test them by:
python search.py test_rsids
Comics superheroes test
Also, similar test for "Comics superheroes" that will test embeddings:
- Only 114 document has text about superheroes, but text did not contain words 'comics' or 'superheroes'
You can test them by:
python search.py test_heroes
Right now testing is not automated and you have to call CLI to test
Troubleshooting
If something is not working with OpenSearch, read log messages carefully. For example, if you have small disk space it can block writing (watermark issue) that will cause failing with different final error message.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file hybrid_search-0.0.15.tar.gz
.
File metadata
- Download URL: hybrid_search-0.0.15.tar.gz
- Upload date:
- Size: 11.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.11.9
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 487c2329f730a475cff8c498f2108f5475788bfbf4c162802880209cf80833fd |
|
MD5 | 602afe5537f7d043034a8a2395b98179 |
|
BLAKE2b-256 | 3bb0bf9a4107637c92bfa9b5fbc1f111b8c1b9dcf60a16587106ca759de49d1f |
File details
Details for the file hybrid_search-0.0.15-py2.py3-none-any.whl
.
File metadata
- Download URL: hybrid_search-0.0.15-py2.py3-none-any.whl
- Upload date:
- Size: 11.7 kB
- Tags: Python 2, Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.11.9
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | b356a7b59829e0781bf39196a91868063b6c777144954a5c975b575e70d98f7b |
|
MD5 | 6240179d79818f3cb8cda224d1617e80 |
|
BLAKE2b-256 | f78fd67556c03836b6443e8e2b1f2a418a3b721fb93df0218e1fe8e88e02f7d0 |