Skip to main content

Scientific Document Insight Q/A

Project description


title: Scientific Document Insights Q/A emoji: 📝 colorFrom: yellow colorTo: pink sdk: streamlit sdk_version: 1.27.2 app_file: streamlit_app.py pinned: false license: apache-2.0

DocumentIQA: Scientific Document Insights Q/A

Work in progress :construction_worker:

Introduction

Question/Answering on scientific documents using LLMs: ChatGPT-3.5-turbo, Mistral-7b-instruct and Zephyr-7b-beta. The streamlit application demonstrate the implementaiton of a RAG (Retrieval Augmented Generation) on scientific documents, that we are developing at NIMS (National Institute for Materials Science), in Tsukuba, Japan. Differently to most of the projects, we focus on scientific articles. We target only the full-text using Grobid that provide and cleaner results than the raw PDF2Text converter (which is comparable with most of other solutions).

Additionally, this frontend provides the visualisation of named entities on LLM responses to extract physical quantities, measurements (with grobid-quantities) and materials mentions (with grobid-superconductors).

The conversation is backed up by a sliding window memory (top 4 more recent messages) that help refers to information previously discussed in the chat.

Demos:

Getting started

  • Select the model+embedding combination you want ot use
  • Enter your API Key (Open AI or Huggingface).
  • Upload a scientific article as PDF document. You will see a spinner or loading indicator while the processing is in progress.
  • Once the spinner stops, you can proceed to ask your questions

screenshot2.png

Documentation

Context size

Allow to change the number of blocks from the original document that are considered for responding. The default size of each block is 250 tokens (which can be changed before uploading the first document). With default settings, each question uses around 1000 tokens.

NOTE: if the chat answers something like "the information is not provided in the given context", changing the context size will likely help.

Chunks size

When uploaded, each document is split into blocks of a determined size (250 tokens by default). This setting allow users to modify the size of such blocks. Smaller blocks will result in smaller context, yielding more precise sections of the document. Larger blocks will result in larger context less constrained around the question.

Query mode

Indicates whether sending a question to the LLM (Language Model) or to the vector storage.

  • LLM (default) enables question/answering related to the document content.
  • Embeddings: the response will consist of the raw text from the document related to the question (based on the embeddings). This mode helps to test why sometimes the answers are not satisfying or incomplete.

NER (Named Entities Recognition)

This feature is specifically crafted for people working with scientific documents in materials science. It enables to run NER on the response from the LLM, to identify materials mentions and properties (quantities, masurements). This feature leverages both grobid-quantities and grobid-superconductors external services.

Development notes

To release a new version:

  • bump-my-version bump patch
  • git push --tags

To use docker:

  • docker run lfoppiano/document-insights-qa:latest

To install the library with Pypi:

  • pip install document-qa-engine

Acknolwedgement

This project is developed at the National Institute for Materials Science (NIMS) in Japan in collaboration with the Lambard-ML-Team.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

document-qa-engine-0.3.0.tar.gz (452.5 kB view details)

Uploaded Source

Built Distribution

document_qa_engine-0.3.0-py3-none-any.whl (19.7 kB view details)

Uploaded Python 3

File details

Details for the file document-qa-engine-0.3.0.tar.gz.

File metadata

  • Download URL: document-qa-engine-0.3.0.tar.gz
  • Upload date:
  • Size: 452.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.12

File hashes

Hashes for document-qa-engine-0.3.0.tar.gz
Algorithm Hash digest
SHA256 7b00b377363c02a00a31ca4240ca98dd9448a3672dac756f8f1e54e5346be116
MD5 e548d5d951b38952149c3638a9553201
BLAKE2b-256 599be6bbd0be3b0c91c6142fd25daa94281063a0969c87617aa02f6655810fbc

See more details on using hashes here.

File details

Details for the file document_qa_engine-0.3.0-py3-none-any.whl.

File metadata

File hashes

Hashes for document_qa_engine-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 205fa062a68b2b3a9b273818c2d2814b130098a6e119bfae086c87be08271856
MD5 92c9d48271f0308de5554fdcb34deff2
BLAKE2b-256 5503d17a801b8992db71a47af92d3d17b81fa12406eaa74e93f559b00bd9a469

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page