Skip to main content

Hybrid search with OpenSearch and Langchain

Project description

hybrid search instructions

Created following this article https://opensearch.org/blog/hybrid-search/

Installation:

With conda or micromamba setup the environment:

micromamba create -f environment.yaml
micromamba activate hybrid_search

For the OpenSearch itself there are several installation options.

From docker-compose

This repository goes with a test two nodes open-search cluster together with a dashboard.

Optional: change OPENSEARCH_JAVA_OPTS=-Xms2512m -Xmx2512m according to your RAM availability, usually it is recommended to have them equal in side. Start docker-compose:

docker compose up

Open http://localhost:5601/ to explore the dashboard, "admin" is used both as user and passport by default.

Manual installation

  • Go to https://opensearch.org/downloads.html and download OpenSearch choose the installation variant you like. OpenSearch Dashboards is a convenient tool but not mandatory.
  • Install the latest Java
  • For Windows unpack the archive. In opensearch_folder/config/opensearch.yml make sure plugins.security.ssl.http.enabled: true. Because it works correctly only with ssl on, despite some functionality still being available with http. Launch opensearch-windows-install.bat, despite the name it is not an installer but a main launcher.
  • For Linux use docker or follow instructions in the documentation.

Usage:

  • Launch open-search either with docker-compose or java
  • Launch index.py for the initial indexing test dataset. It creates an index and pipeline for hybrid search.
  • Activate environment
micromamba activate hybrid_search #to activate environment
pip install -e . #[optional] install current package locally
  • Launch search to perform test search.
python index.py #to index
python search.py # to search, uses default query

You can also tune index.py parameters. For example:

python index.py main --url https://agingkills.eu:9200 --user admin --password admin --index_name index-bge-test_rsids_10k --embedding BAAI/bge-base-en-v1.5

If you want to use another embedding, for example specter2, try:

python index.py specter2

Tests

RSID test

There are text pieces deliberately incorporated into tacutu papers data ( /data/tacutopapers_test_rsids_10k ) In particular for rs123456789 and rs123456788 as well as similar but misspelled rsids are added to the documents:

  • 10.txt contains both two times
  • 11.txt contains both one time
  • 12.txt and 13 contain only one rsid
  • 20.txt contains both wrong rsids two times
  • 21.txt contains both wrong rsids one time
  • 22.txt and 23 contain only one wrong rsid

You can test them by:

python search.py test_rsids

Comics superheroes test

Also, similar test for "Comics superheroes" that will test embeddings:

  • Only 114 document has text about superheroes, but text did not contain words 'comics' or 'superheroes'

You can test them by:

python search.py test_heroes

Right now testing is not automated and you have to call CLI to test

Troubleshooting

If something is not working with OpenSearch, read log messages carefully. For example, if you have small disk space it can block writing (watermark issue) that will cause failing with different final error message.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hybrid_search-0.0.15.tar.gz (11.5 kB view details)

Uploaded Source

Built Distribution

hybrid_search-0.0.15-py2.py3-none-any.whl (11.7 kB view details)

Uploaded Python 2 Python 3

File details

Details for the file hybrid_search-0.0.15.tar.gz.

File metadata

  • Download URL: hybrid_search-0.0.15.tar.gz
  • Upload date:
  • Size: 11.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.11.9

File hashes

Hashes for hybrid_search-0.0.15.tar.gz
Algorithm Hash digest
SHA256 487c2329f730a475cff8c498f2108f5475788bfbf4c162802880209cf80833fd
MD5 602afe5537f7d043034a8a2395b98179
BLAKE2b-256 3bb0bf9a4107637c92bfa9b5fbc1f111b8c1b9dcf60a16587106ca759de49d1f

See more details on using hashes here.

File details

Details for the file hybrid_search-0.0.15-py2.py3-none-any.whl.

File metadata

File hashes

Hashes for hybrid_search-0.0.15-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 b356a7b59829e0781bf39196a91868063b6c777144954a5c975b575e70d98f7b
MD5 6240179d79818f3cb8cda224d1617e80
BLAKE2b-256 f78fd67556c03836b6443e8e2b1f2a418a3b721fb93df0218e1fe8e88e02f7d0

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page