Skip to main content

indexpaper - library to index papers with vector databases

Project description

indexpaper

The project devoted to indexing papers in vector databases

It was originally part of getpaper but now has not dependencies on it

We provide features to index the papers with openai or llama embeddings and save them in chromadb vector store. For openai embeddings to work you have to create .env file and specify your openai key there, see .env.template as example

For example if you have your papers inside data/output/test/papers folder, and you want to make a ChromaDB index at data/output/test/index you can do it by:

indexpaper/index.py index_papers --papers data/output/test/papers --folder data/output/test/index --collection mypapers --chunk_size 6000

It is possible to use both Chroma and Qdrant. To use qdrant we provide docker-compose file to set it up:

cd services
docker compose -f docker-compose.yaml up

then you can run the indexing of the paper with Qdrant:

indexpaper/index.py index_papers --papers data/output/test/papers --url http://localhost:6333 --collection mypapers --chunk_size 6000 --database Qdrant

You can also take a look if things were added to the collection with qdrant web UI by checking http://localhost:6333/dashboard

Indexing with Llama-2 embeddings

You can also use llama-2 embeddings if you install llama-cpp-python and pass a path to the model, for example for https://huggingface.co/TheBloke/Llama-2-13B-GGML model:

indexpaper/index.py index_papers --papers data/output/test/papers --url http://localhost:6333 --collection papers_llama2_2000 --chunk_size 2000 --database Qdrant --embeddings llama --model /home/antonkulaga/sources/indexpaper/data/models/llama-2-13b-chat.ggmlv3.q2_K.bin

Instead of explicitly pathing the model path you can also include the path to LLAMA_MODEL to the .env file as:

LLAMA_MODEL="/home/antonkulaga/sources/indexpaper/data/models/llama-2-13b-chat.ggmlv3.q2_K.bin"

Note: if you want to use Qdrant cloud you do not need docker-compose, but you need to provide a key and look at qdrant cloud setting for the url to give.

indexpaper/index.py index_papers --papers data/output/test/papers --url https://5bea7502-97d4-4876-98af-0cdf8af4bd18.us-east-1-0.aws.cloud.qdrant.io:6333 --key put_your_key_here --collection mypapers --chunk_size 6000 --database Qdrant

Note: there are temporal issues with embeddings for llama.

Examples

You can run examples.py to see usage examples and also to evaluate embeddings.

For example if you want to evaluate how fast embeddings compute on Robi Tacutu papers you can run:

python example.py preload

to download dataset and model. And then:

python example.py evaluate --model intfloat/multilingual-e5-large --dataset longevity-genie/tacutu_papers

To measure time

python example.py measure --model intfloat/multilingual-e5-large --dataset longevity-genie/tacutu_papers

indexing a dataset

To index a dataset you can use either index.py dataset subcomment or you look how to do it in code in papers.ipynb example notebook For example, if we want to index "longevity-genie/tacutu_papers" huggingface dataset using "michiyasunaga/BioLinkBERT-large" hugging face embedding with "cuda" as device and with 10 papers in a slice. And we want to write it to the local version of qdrant located at http://localhost:6333 (see services for docker-compose file):

python indexpaper/index.py dataset --collection biolinkbert_512_tacutu_papers --dataset "longevity-genie/tacutu_papers" --url http://localhost:6333 --model michiyasunaga/BioLinkBERT-large --slice 10 --chunk_size 512 --device cuda

Another example. If we want to index "longevity-genie/moskalev_papers" huggingface dataset using "michiyasunaga/BioLinkBERT-large" hugging face embedding with "gpu" as device and with 10 papers in a slice. And we want to use our Qdrant cloud key (fill in QDRANT_KEY or put it to environment variable)

python indexpaper/index.py dataset --collection biolinkbert_512_moskalev_papers --dataset "longevity-genie/moskalev_papers" --url https://5bea7502-97d4-4876-98af-0cdf8af4bd18.us-east-1-0.aws.cloud.qdrant.io:6333 --key QDRANT_KEY --model michiyasunaga/BioLinkBERT-large --slice 10 --chunk_size 512 --device cuda

Another example. Robi Tacutu papers with cpu using QDRANT_KEY, cluster url (put yours) and biolord embeddings model:

python indexpaper/index.py dataset --url https://5bea7502-97d4-4876-98af-0cdf8af4bd18.us-east-1-0.aws.cloud.qdrant.io --collection biolord_512_tacutu_papers --dataset "longevity-genie/tacutu_papers" --key QDRANT_KEY --model FremyCompany/BioLORD-STAMB2-v1 --slice 10 --chunk_size 512 --device cpu

Runnning local Qdrant

We provide docker-compose configuration to run local qdrant (you can also use qdrant cloud instead). To run local qdrant install docker compose (sometimes needs sudo) and run:

cd services
docker compose up

Then you should be able to see http://localhost:6333/dashboard

Additional requirements

index.py has local dependencies on other modules, for this reason if you are running it inside indexpaper project folder consider having it installed locally:

pip install -e .

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

indexpaper-0.0.3.tar.gz (7.5 kB view details)

Uploaded Source

Built Distribution

indexpaper-0.0.3-py2.py3-none-any.whl (7.5 kB view details)

Uploaded Python 2Python 3

File details

Details for the file indexpaper-0.0.3.tar.gz.

File metadata

  • Download URL: indexpaper-0.0.3.tar.gz
  • Upload date:
  • Size: 7.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.12

File hashes

Hashes for indexpaper-0.0.3.tar.gz
Algorithm Hash digest
SHA256 154db6b649a793bd36c695b3443e6dfb75f95e84e7b49386f71ca2c3c4dad3b5
MD5 28c17e5b8675daa34baa410574caafc2
BLAKE2b-256 4639b7a0808382328512f52d17a2fc334a6d2dcfeba87eff001e2b673cd3ac82

See more details on using hashes here.

File details

Details for the file indexpaper-0.0.3-py2.py3-none-any.whl.

File metadata

  • Download URL: indexpaper-0.0.3-py2.py3-none-any.whl
  • Upload date:
  • Size: 7.5 kB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.12

File hashes

Hashes for indexpaper-0.0.3-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 de70b374ba8868bf8ff5e98ca19f28f1a6cbbf44ada56faad0303456234b92f1
MD5 37fef3b303a8f58315175cea6b42265e
BLAKE2b-256 f147d1226f0265319345044770b098a3037e00da82ad8a9dcdfcb1b53ca03e92

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page