Skip to main content

LLM Chain for answering questions from docs

Project description

Paper QA

GitHub tests PyPI version MIT license

This is a simple and incomplete package for doing question and answering from documents. It uses gpt-index to embed and search documents and langchain to generate answers.

It uses this process

embed docs into vectors -> embed query into vector -> search for top k passages in docs

create summary of each passage relevant to query -> put summaries into prompt -> generate answer

Install

Install from github with pip:

pip install git+https://github.com/whitead/paper-qa.git

Usage

Make sure you have set your OPENAI_API_KEY environment variable to your openai api key

To use the package, you need to have a list of paths (valid extensions include: .pdf, .txt, .jpg, .pptx, .docx, .csv, .epub, .md, .mp4, .mp3) and a list of citations (strings) that correspond to the paths. You can then use the Docs class to add the documents and then query them.

from paperqa import Docs

# get a list of paths, citations

docs = Docs()
for d, c in zip(my_docs, my_citations):
    docs.add(d, c)

# takes ~ 1 min
answer = docs.query("What manufacturing challenges are unique to bispecific antibodies?")
print(answer.formatted_answer)

The answer object has the following attributes: formatted_answer, answer (answer alone), questions, context (the summaries of passages found for answer), refernces (the docs from which the passages came).

How is this different from gpt-index?

gpt-index does generate answers, but in a somewhat opinionated way. It doesn't have a great way to track where text comes from and it's not easy to force it to pull from multiple documents. I don't know which way is better, but for writing scholarly text I found it to work better to pull from multiple relevant documents and then generate an answer. I would like to PR to do this to gpt-index but it looks pretty involved right now.

Where do the documents come from?

I use some of my own code to pull papers from Google Scholar. This code is not included because it may enable people to violate Google's terms of service and publisher's terms of service.

Saving/loading

The Docs class can be pickled and unpickled. This is useful if you want to save the embeddings of the documents and then load them later.

import pickle

with open("my_docs.pkl", "wb") as f:
    pickle.dump(docs, f)

with open("my_docs.pkl", "rb") as f:
    docs = pickle.load(f)

Project details


Release history Release notifications | RSS feed

This version

0.0.1

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

paper-qa-0.0.1.tar.gz (6.2 kB view hashes)

Uploaded Source

Built Distribution

paper_qa-0.0.1-py3-none-any.whl (6.9 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page