Skip to main content

Hybrid search with OpenSearch and Langchain

Project description

hybrid search instructions

Created following this article https://opensearch.org/blog/hybrid-search/

Installation:

With conda or micromamba setup the environment:


micromamba create -f environment.yaml

micromamba activate hybrid_search

For the OpenSearch itself there are several installation options.

From docker-compose

This repository goes with a test two nodes open-search cluster together with a dashboard.

Optional: change OPENSEARCH_JAVA_OPTS=-Xms2512m -Xmx2512m according to your RAM availability, usually it is recommended to have them equal in side.

Start docker-compose:

docker compose up

Open http://localhost:5601/ to explore the dashboard, "admin" is used both as user and passport by default.

Manual installation

  • Go to https://opensearch.org/downloads.html and download OpenSearch choose the installation variant you like. OpenSearch Dashboards is a convenient tool but not mandatory.

  • Install the latest Java

  • For Windows unpack the archive. In opensearch_folder/config/opensearch.yml make sure plugins.security.ssl.http.enabled: true. Because it works correctly only with ssl on, despite some functionality still being available with http. Launch opensearch-windows-install.bat, despite the name it is not an installer but a main launcher.

  • For Linux use docker or follow instructions in the documentation.

Usage:

  • Launch open-search either with docker-compose or java

  • Launch index.py for the initial indexing test dataset. It creates an index and pipeline for hybrid search.

  • Activate environment

micromamba activate hybrid_search #to activate environment

pip install -e . #[optional] install current package locally
  • Launch search to perform test search.
python index.py #to index

python search.py # to search, uses default query

You can also tune index.py parameters. For example:


python index.py main --url https://agingkills.eu:9200 --user admin --password admin --index_name index-bge-test_rsids_10k --embedding BAAI/bge-base-en-v1.5



If you want to use another embedding, for example specter2, try:

python index.py specter2

Tests

RSID test

There are text pieces deliberately incorporated into tacutu papers data ( /data/tacutopapers_test_rsids_10k )

In particular for rs123456789 and rs123456788 as well as similar but misspelled rsids are added to the documents:

  • 10.txt contains both two times

  • 11.txt contains both one time

  • 12.txt and 13 contain only one rsid

  • 20.txt contains both wrong rsids two times

  • 21.txt contains both wrong rsids one time

  • 22.txt and 23 contain only one wrong rsid

You can test them by:


python search.py test_rsids

Comics superheroes test

Also, similar test for "Comics superheroes" that will test embeddings:

  • Only 114 document has text about superheroes, but text did not contain words 'comics' or 'superheroes'

You can test them by:


python search.py test_heroes

Right now testing is not automated and you have to call CLI to test

Troubleshooting

If something is not working with OpenSearch, read log messages carefully. For example, if you have small disk space it can block writing (watermark issue) that will cause failing with different final error message.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hybrid_search-0.0.13.tar.gz (12.0 kB view hashes)

Uploaded Source

Built Distribution

hybrid_search-0.0.13-py2.py3-none-any.whl (11.8 kB view hashes)

Uploaded Python 2 Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page