Skip to main content

Neural Question Answering & Semantic Search at Scale. Use modern transformer based models like BERT to find answers in large document collections

Project description



Build Checked with MyPy Documentation Release License Last commit

Haystack is an end-to-end framework for Question Answering & Neural search that enables you to ...

... ask questions in natural language and find granular answers in your own documents.
... do semantic document search and retrieve more relevant documents for your search queries.
... search at scale through millions of documents.
... use off-the-shelf models or fine-tune them to your own domain.
... evaluate, benchmark and continuously improve your models via user feedback.
... improve chat bots by leveraging existing knowledge bases for the long tail of queries.
... automate processes by automatically applying a list of questions to new documents and using the extracted answers.

:ledger: Docs Usage, Guides, API documentation ...
:computer: Installation How to install
:art: Key components Overview of core concepts
:eyes: Quick Tour Basic explanation of concepts, options and usage
:mortar_board: Tutorials Jupyter/Colab Notebooks & Scripts
:bar_chart: Benchmarks Speed & Accuracy of Retriever, Readers and DocumentStores
:telescope: Roadmap Public roadmap of Haystack
:heart: Contributing We welcome all contributions!

Core Features

  • Latest models: Utilize all latest transformer based models (e.g. BERT, RoBERTa, MiniLM) for extractive QA, generative QA and document retrieval.
  • Modular: Multiple choices to fit your tech stack and use case. Pick your favorite database, file converter or modeling framwework.
  • Open: 100% compatible with HuggingFace's model hub. Tight interfaces to other frameworks (e.g. Transformers, FARM, sentence-transformers)
  • Scalable: Scale to millions of docs via retrievers, production-ready backends like Elasticsearch / FAISS and a fastAPI REST API
  • End-to-End: All tooling in one place: file conversion, cleaning, splitting, training, eval, inference, labeling ...
  • Developer friendly: Easy to debug, extend and modify.
  • Customizable: Fine-tune models to your own domain or implement your custom DocumentStore.
  • Continuous Learning: Collect new training data via user feedback in production & improve your models continuously

Installation

PyPi:

pip install farm-haystack

Master branch (if you wanna try the latest features):

git clone https://github.com/deepset-ai/haystack.git
cd haystack
pip install --editable .

To update your installation, just do a git pull. The --editable flag will update changes immediately.

On Windows you might need:

pip install farm-haystack -f https://download.pytorch.org/whl/torch_stable.html

Key Components

image

  1. FileConverter: Extracts pure text from files (pdf, docx, pptx, html and many more).
  2. PreProcessor: Cleans and splits texts into smaller chunks.
  3. DocumentStore: Database storing the documents, metadata and vectors for our search. We recommend Elasticsearch or FAISS, but have also more light-weight options for fast prototyping (SQL or In-Memory).
  4. Retriever: Fast algorithms that identify candidate documents for a given query from a large collection of documents. Retrievers narrow down the search space significantly and are therefore key for scalable QA. Haystack supports sparse methods (TF-IDF, BM25, custom Elasticsearch queries) and state of the art dense methods (e.g. sentence-transformers and Dense Passage Retrieval)
  5. Reader: Neural network (e.g. BERT or RoBERTA) that reads through texts in detail to find an answer. The Reader takes multiple passages of text as input and returns top-n answers. Models are trained via FARM or Transformers on SQuAD like tasks. You can just load a pretrained model from Hugging Face's model hub or fine-tune it on your own domain data.
  6. Generator: Neural network (e.g. RAG) that generates an answer for a given question conditioned on the retrieved documents from the retriever.
  7. Finder: Glues together a Retriever + Reader/Generator as a pipeline to provide an easy-to-use question answering interface.
  8. REST API: Exposes a simple API based on fastAPI for running QA search, uploading files and collecting user feedback for continuous learning.
  9. Haystack Annotate: Create custom QA labels to improve performance of your domain-specific models. Hosted version or Docker images.

Usage

image

Tutorials

Quick Tour

File Conversion | Preprocessing | DocumentStores | Retrievers | Readers | REST API | Labeling Tool

1) File Conversion

What
Different converters to extract text from your original files (pdf, docx, txt, html). While it's almost impossible to cover all types, layouts and special cases (especially in PDFs), we cover the most common formats (incl. multi-column) and extract meta information (e.g. page splits). The converters are easily extendable, so that you can customize them for your files if needed.

Available options

  • Txt
  • PDF
  • Docx
  • Apache Tika (Supports > 340 file formats)

Example

#PDF
from haystack.file_converter.pdf import PDFToTextConverter
converter = PDFToTextConverter(remove_numeric_tables=True, valid_languages=["de","en"])
doc = converter.convert(file_path=file, meta=None)
# => {"text": "text first page \f text second page ...", "meta": None}

#DOCX
from haystack.file_converter.docx import DocxToTextConverter
converter = DocxToTextConverter(remove_numeric_tables=True, valid_languages=["de","en"])
doc = converter.convert(file_path=file, meta=None)
# => {"text": "some text", "meta": None}

2) Preprocessing

What
Cleaning and splitting of your texts are crucial steps that will directly impact the speed and accuracy of your search. The splitting of larger texts is especially important for achieving fast query speed. The longer the texts that the retriever passes to the reader, the slower your queries.

Available Options
We provide a basic PreProcessor class that allows:

  • clean whitespace, headers, footer and empty lines
  • split by words, sentences or passages
  • option for "overlapping" splits
  • option to never split within a sentence

You can easily extend this class to your own custom requirements.

Example

converter = PDFToTextConverter(remove_numeric_tables=True, valid_languages=["en"])

processor = PreProcessor(clean_empty_lines=True,
                         clean_whitespace=True,
                         clean_header_footer=True,
                         split_by="word",
                         split_length=200,
                         split_respect_sentence_boundary=True)
docs = []
for f_name, f_path in zip(filenames, filepaths):
    # Optional: Supply any meta data here
    # the "name" field will be used by DPR if embed_title=True, rest is custom and can be named arbitrarily
    cur_meta = {"name": f_name, "category": "a" ...}

    # Run the conversion on each file (PDF -> 1x doc)
    d = converter.convert(f_path, meta=cur_meta)

    # clean and split each dict (1x doc -> multiple docs)
    d = processor.process(d)
    docs.extend(d)

# at this point docs will be [{"text": "some", "meta":{"name": "myfilename", "category":"a"}},...]
document_store.write_documents(docs)

3) DocumentStores

What

  • Store your texts, meta data and optionally embeddings
  • Documents should be chunked into smaller units (e.g. paragraphs) before indexing to make the results returned by the Retriever more granular and accurate.

Available Options

  • Elasticsearch
  • FAISS
  • SQL
  • InMemory

Example

# Run elasticsearch, e.g. via docker run -d -p 9200:9200 -e "discovery.type=single-node" elasticsearch:7.6.2

# Connect 
document_store = ElasticsearchDocumentStore(host="localhost", username="", password="", index="document")

# Get all documents
document_store.get_all_documents()

# Query
document_store.query(query="What is the meaning of life?", filters=None, top_k=5)
document_store.query_by_embedding(query_emb, filters=None, top_k=5)

-> See docs for details

4) Retrievers

What
The Retriever is a fast "filter" that can quickly go through the full document store and pass a set of candidate documents to the Reader. It is an tool for sifting out the obvious negative cases, saving the Reader from doing more work than it needs to and speeding up the querying process. There are two fundamentally different categories of retrievers: sparse (e.g. TF-IDF, BM25) and dense (e.g. DPR, sentence-transformers).

Available Options

  • DensePassageRetriever
  • ElasticsearchRetriever
  • EmbeddingRetriever
  • TfidfRetriever

Example

retriever = DensePassageRetriever(document_store=document_store,
                                  query_embedding_model="facebook/dpr-question_encoder-single-nq-base",
                                  passage_embedding_model="facebook/dpr-ctx_encoder-single-nq-base",
                                  use_gpu=True,
                                  batch_size=16,
                                  embed_title=True)
retriever.retrieve(query="Why did the revenue increase?")
# returns: [Document, Document]

-> See docs for details

5) Readers

What
Neural networks (i.e. mostly Transformer-based) that read through texts in detail to find an answer. Use diverse models like BERT, RoBERTa or XLNet trained via FARM or on SQuAD-like datasets. The Reader takes multiple passages of text as input and returns top-n answers with corresponding confidence scores. Both readers can load either a local model or any public model from Hugging Face's model hub

Available Options

  • FARMReader: Reader based on FARM incl. extensive configuration options and speed optimizations
  • TransformersReader: Reader based on the pipeline class of HuggingFace's Transformers.
    Both Readers can load models directly from HuggingFace's model hub.

Example

reader = FARMReader(model_name_or_path="deepset/roberta-base-squad2",
                use_gpu=False, no_ans_boost=-10, context_window_size=500,
                top_k_per_candidate=3, top_k_per_sample=1,
                num_processes=8, max_seq_len=256, doc_stride=128)

# Optional: Training & eval
reader.train(...)
reader.eval(...)

# Predict
reader.predict(question="Who is the father of Arya Starck?", documents=documents, top_k=3)

-> See docs for details

6) REST API

What
A simple REST API based on FastAPI is provided to:

  • search answers in texts (extractive QA)
  • search answers by comparing user question to existing questions (FAQ-style QA)
  • collect & export user feedback on answers to gain domain-specific training data (feedback)
  • allow basic monitoring of requests (currently via APM in Kibana)

Example
To serve the API, adjust the values in rest_api/config.py and run:

gunicorn rest_api.application:app -b 0.0.0.0:8000 -k uvicorn.workers.UvicornWorker -t 300

You will find the Swagger API documentation at http://127.0.0.1:8000/docs

7) Labeling Tool

  • Use the hosted version (Beta) or deploy it yourself with the Docker Images.
  • Create labels with different techniques: Come up with questions (+ answers) while reading passages (SQuAD style) or have a set of predefined questions and look for answers in the document (~ Natural Questions).
  • Structure your work via organizations, projects, users
  • Upload your documents or import labels from an existing SQuAD-style dataset

image

:heart: Contributing

We are very open to contributions from the community - be it the fix of a small typo or a completely new feature! You don't need to be an Haystack expert for providing meaningful improvements. To avoid any extra work on either side, please check our Contributor Guidelines first.

Tests will automatically run for every commit you push to your PR. You can also run them locally by executing pytest in your terminal from the root folder of this repository:

All tests:

cd test
pytest

You can also only run a subset of tests by specifying a marker and the optional "not" keyword:

cd test
pytest -m not elasticsearch
pytest -m elasticsearch
pytest -m generator
pytest -m tika
pytest -m not slow
...

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

farm-haystack-0.6.0.tar.gz (108.7 kB view hashes)

Uploaded Source

Built Distribution

farm_haystack-0.6.0-py3-none-any.whl (104.6 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page