Skip to main content

Neural Question Answering at Scale. Use modern transformer based models like BERT to find answers in large document collections

Project description

Build Release License Last Commit

Introduction

The performance of modern Question Answering Models (BERT, ALBERT …) has seen drastic improvements within the last year enabling many new opportunities for accessing information more efficiently. However, those models are designed to find answers within rather small text passages. Haystack lets you scale QA models to large collections of documents! While QA is the focussed use case for haystack, we will soon support additional options to boost search (re-ranking, most-similar search …).

Haystack is designed in a modular way and lets you use any models trained with FARM or Transformers.

Core Features

  • Powerful ML models: Utilize all latest transformer based models (BERT, ALBERT, RoBERTa …)

  • Modular & future-proof: Easily switch to newer models once they get published.

  • Developer friendly: Easy to debug, extend and modify.

  • Scalable: Production-ready deployments via Elasticsearch backend & REST API

  • Customizable: Fine-tune models to your own domain & improve them continuously via user feedback

Components

  1. DocumentStore: Database storing the documents for our search. We recommend Elasticsearch, but have also more light-weight options for fast prototyping (SQL or In-Memory).

  2. Retriever: Fast, simple algorithm that identifies candidate passages from a large collection of documents. Algorithms include TF-IDF or BM25, custom Elasticsearch queries, and embedding-based approaches. The Retriever helps to narrow down the scope for Reader to smaller units of text where a given question could be answered.

  3. Reader: Powerful neural model that reads through texts in detail to find an answer. Use diverse models like BERT, RoBERTa or XLNet trained via FARM or Transformers on SQuAD like tasks. The Reader takes multiple passages of text as input and returns top-n answers with corresponding confidence scores. You can just load a pretrained model from Hugging Face’s model hub or fine-tune it to your own domain data.

  4. Finder: Glues together a Reader and a Retriever as a pipeline to provide an easy-to-use question answering interface.

  5. REST API: Exposes a simple API for running QA search, collecting feedback and monitoring requests

  6. Labeling Tool: Hosted version (Beta), Docker images (coming soon)

Resources

Quick Start

Installation

Recommended (because of active development):

git clone https://github.com/deepset-ai/haystack.git
cd haystack
pip install --editable .

To update your installation, just do a git pull. The –editable flag will update changes immediately.

From PyPi:

pip install farm-haystack

Usage

https://raw.githubusercontent.com/deepset-ai/haystack/master/docs/img/code_snippet_usage.png

Quick Tour

1) DocumentStores

Haystack has an extensible DocumentStore-Layer, which is storing the documents for our search. We recommend Elasticsearch, but have also more light-weight options for fast prototyping.

SQL / InMemory (Alternative)

haystack.database.sql.SQLDocumentStore & haystack.database.memory.InMemoryDocumentStore

These DocumentStores are mainly intended to simplify the first development steps or test a prototype on an existing SQL Database containing your texts. The SQLDocumentStore initializes by default a local file-based SQLite database. However, you can easily configure it for PostgreSQL or MySQL since our implementation is based on SQLAlchemy. Limitations: Retrieval (e.g. via TfidfRetriever) happens in-memory here and will therefore only work efficiently on small datasets

2) Retrievers

ElasticsearchRetriever

Scoring text similarity via sparse Bag-of-words representations are strong and well-established baselines in Information Retrieval. The default ElasticsearchRetriever uses Elasticsearch’s native scoring (BM25), but can be extended easily with custom queries or filtering.

Example:

retriever = ElasticsearchRetriever(document_store=document_store, custom_query=None)
retriever.retrieve(query="Why did the revenue increase?", filters={"years": ["2019"], "company": ["Q1", "Q2"]})
# returns: [Document, Document]

EmbeddingRetriever

Using dense embeddings (i.e. vector representations) of texts is a powerful alternative to score similarity of texts. This retriever allows you to transform your query into an embedding using a model (e.g. Sentence-BERT) and find similar texts by using cosine similarity.

Example:

retriever = EmbeddingRetriever(document_store=document_store,
                               embedding_model="deepset/sentence-bert",
                               model_format="farm")
retriever.retrieve(query="Why did the revenue increase?", filters={"years": ["2019"], "company": ["Q1", "Q2"]})
# returns: [Document, Document]

We are working on extending this category of retrievers a lot as there’s a lot of exciting work in research indicating substantial performance improvements (e.g. DPR , REALM )

TfidfRetriever

Basic in-memory retriever getting texts from the DocumentStore, creating TF-IDF representations in-memory and allowing to query them.

3) Readers

Neural networks (i.e. mostly Transformer-based) that read through texts in detail to find an answer. Use diverse models like BERT, RoBERTa or XLNet trained via FARM or on SQuAD like tasks. The Reader takes multiple passages of text as input and returns top-n answers with corresponding confidence scores. Both readers can load either a local model or any public model from Hugging Face’s model hub

FARMReader

Implementing various QA models via the FARM Framework. Example:

reader = FARMReader(model_name_or_path="deepset/roberta-base-squad2",
                use_gpu=False, no_ans_boost=-10, context_window_size=500,
                top_k_per_candidate=3, top_k_per_sample=1,
                num_processes=8, max_seq_len=256, doc_stride=128)

# Optional: Training & eval
reader.train(...)
reader.eval(...)

# Predict
reader.predict(question="Who is the father of Arya Starck?", documents=documents, top_k=3)

This Reader comes with: * quite many configuration options * using multiple processes for preprocessing * option to train * option to evaluate

TransformersReader

Implementing various QA models via the pipeline class of Transformers Framework.

Example:

reader = TransformersReader(model="distilbert-base-uncased-distilled-squad",
                            tokenizer="distilbert-base-uncased",
                            context_window_size=500,
                            use_gpu=-1)

reader.predict(question="Who is the father of Arya Starck?", documents=documents, top_k=3)

5. REST API

A simple REST API based on FastAPI is provided to:

  • search answers in texts (extractive QA)

  • search answers by comparing user question to existing questions (FAQ-style QA)

  • collect & export user feedback on answers to gain domain-specific training data (feedback)

  • allow basic monitoring of requests (currently via APM in Kibana)

To serve the API, run:

gunicorn haystack.api.application:app -b 0.0.0.0:80 -k uvicorn.workers.UvicornWorker`

You will find the Swagger API documentation at http://127.0.0.1:80/docs

6. Labeling Tool

  • Use the hosted version (Beta) or deploy it yourself via Docker images (coming soon)

  • Create labels with different techniques: Come up with questions (+ answers) while reading passages (SQuAD style) or have a set of predefined questions and look for answers in the document (~ Natural Questions).

  • Structure your work via organizations, projects, users

  • Upload your documents or import labels from an existing SQuAD-style dataset

  • Coming soon: more file formats for document upload, metrics for label quality …

https://raw.githubusercontent.com/deepset-ai/haystack/master/docs/img/annotation_tool.png

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

farm-haystack-0.2.0.post1.tar.gz (30.5 kB view hashes)

Uploaded Source

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page