Scientific Document Insight Q/A

These details have not been verified by PyPI

Project links

Project description

title: Scientific Document Insights Q/A emoji: 📝 colorFrom: yellow colorTo: pink sdk: streamlit sdk_version: 1.37.1 app_file: streamlit_app.py pinned: false license: apache-2.0 app_port: 8501

DocumentIQA: Scientific Document Insights Q/A

Work in progress :construction_worker:

https://lfoppiano-document-qa.hf.space/

NOTE: The LLM API is kindly provided by Modal.com which offers 30$/month for computing. When these are done, the app will stop answering. 😅

Introduction

Question/Answering on scientific documents using LLMs. The tool can be customized to use different types of LLM APIs. The streamlit application demonstrates the implementation of a RAG (Retrieval Augmented Generation) on scientific documents. Different from most of the projects, we focus on scientific articles and extract text from a structured document. We target only the full text using Grobid which provides cleaner results than the raw PDF2Text converter (which is comparable with most of the other solutions).

Additionally, this frontend provides the visualisation of named entities on LLM responses to extract physical quantities, measurements (with grobid-quantities) and materials mentions (with grobid-superconductors).

(The image on the right was generated with https://huggingface.co/spaces/stabilityai/stable-diffusion)

Getting started

Upload a scientific article as a PDF document. You will see a spinner or loading indicator while the processing is in progress.
Once the spinner disappears, you can proceed to ask your questions

Documentation

For full technical documentation of the document-qa-engine library docs/README.md.

To deploy the LLM and embedding endpoints on Modal.com, see document_qa/deployment/README.md.

Embedding selection

In the latest version, there is the possibility to select both embedding functions and LLMs. There are some limitations, OpenAI embeddings cannot be used with open source models, and vice-versa.

Context size

Allow to change the number of blocks from the original document that are considered for responding. The default size of each block is 250 tokens (which can be changed before uploading the first document). With default settings, each question uses around 1000 tokens.

NOTE: if the chat answers something like "the information is not provided in the given context", changing the context size will likely help.

Chunks size

When uploaded, each document is split into blocks of a determined size (250 tokens by default). This setting allows users to modify the size of such blocks. Smaller blocks will result in a smaller context, yielding more precise sections of the document. Larger blocks will result in a larger context less constrained around the question.

Query mode

Indicates whether sending a question to the LLM (Language Model) or the vector storage.

LLM (default) enables question/answering related to the document content.
Embeddings: the response will consist of the raw text from the document related to the question (based on the embeddings). This mode helps to test why sometimes the answers are not satisfying or incomplete.
Question coefficient (experimental): provide a coefficient that indicates how the question has been far or closed to the retrieved context

NER (Named Entities Recognition)

This feature is specifically crafted for people working with scientific documents in materials science. It enables to run NER on the response from the LLM, to identify materials mentions and properties (quantities, measurements). This feature leverages both grobid-quantities and grobid-superconductors external services.

Troubleshooting

Error: streamlit: Your system has an unsupported version of sqlite3. Chroma requires sqlite3 >= 3.35.0. Here is the solution on Linux. For more information, see the details on the Chroma website.

Disclaimer on Data, Security, and Privacy ⚠️

Please read carefully:

Avoid uploading sensitive data. We temporarily store text from the uploaded PDF documents only for processing your request, and we disclaim any responsibility for subsequent use or handling of the submitted data by third-party LLMs.
The public demo serves open models (Phi-4-mini-instruct, Qwen3) self-hosted on Modal.com under a limited monthly compute budget, so there is no guarantee that all requests will go through. Use at your own risk.
We do not assume responsibility for how the data is utilized by the LLM end-points API.

Development notes

To release a new version:

bump-my-version bump patch
git push --tags

To use docker:

docker run lfoppiano/document-insights-qa:{latest_version}
docker run lfoppiano/document-insights-qa:latest-develop for the latest development version

To install the library with Pypi:

pip install document-qa-engine

Acknowledgement

The project was initiated at the National Institute for Materials Science (NIMS) in Japan. Currently, the development is possible thanks to ScienciLAB. This project was contributed by Guillaume Lambard and the Lambard-ML-Team, Pedro Ortiz Suarez, and Tomoya Mato. Thanks also to Patrice Lopez, the author of Grobid.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.6.0

Jun 7, 2026

0.5.1

May 23, 2025

0.4.5

Feb 10, 2025

0.4.2

Aug 23, 2024

0.4.1

Aug 23, 2024

0.4.0

Jun 24, 2024

0.3.4

Dec 26, 2023

0.3.3

Dec 14, 2023

0.3.2

Nov 29, 2023

0.3.1

Nov 22, 2023

0.3.0

Nov 18, 2023

0.2.1

Nov 1, 2023

0.2.0

Oct 31, 2023

0.1.3

Oct 30, 2023

0.1.2

Oct 27, 2023

0.1.1

Oct 26, 2023

0.1.0

Oct 26, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

document_qa_engine-0.6.0.tar.gz (561.3 kB view details)

Uploaded Jun 7, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

document_qa_engine-0.6.0-py3-none-any.whl (40.1 kB view details)

Uploaded Jun 7, 2026 Python 3

File details

Details for the file document_qa_engine-0.6.0.tar.gz.

File metadata

Download URL: document_qa_engine-0.6.0.tar.gz
Upload date: Jun 7, 2026
Size: 561.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for document_qa_engine-0.6.0.tar.gz
Algorithm	Hash digest
SHA256	`3d4a6223a9ca0fb82fd809af3fb82f7fdee3f4450a7ecafe8303401edf41863c`
MD5	`ba5321aef89c9c00855a555a85bfd2ed`
BLAKE2b-256	`04b4887dc0557879e4c938a89b5a0dfe775aee31c4a5363d22edccbe64411211`

See more details on using hashes here.

File details

Details for the file document_qa_engine-0.6.0-py3-none-any.whl.

File metadata

Download URL: document_qa_engine-0.6.0-py3-none-any.whl
Upload date: Jun 7, 2026
Size: 40.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for document_qa_engine-0.6.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`40271195be66a2a299be05ddd751976f7533284427c1ef69865b045138f022b8`
MD5	`62e9247dd5e7d09ff1807d2fa7e15a98`
BLAKE2b-256	`25b0866726733bb259e4c3f24f715326d14a67b569cabccf5dc1c1dc522821f2`

See more details on using hashes here.

document-qa-engine 0.6.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

title: Scientific Document Insights Q/A emoji: 📝 colorFrom: yellow colorTo: pink sdk: streamlit sdk_version: 1.37.1 app_file: streamlit_app.py pinned: false license: apache-2.0 app_port: 8501

DocumentIQA: Scientific Document Insights Q/A

Introduction

Getting started

Documentation

Embedding selection

Context size

Chunks size

Query mode

NER (Named Entities Recognition)

Troubleshooting

Disclaimer on Data, Security, and Privacy ⚠️

Development notes

Acknowledgement

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes