farm-haystack

Neural Question Answering at Scale. Use modern transformer based models like BERT to find answers in large document collections

None None

These details have been verified by PyPI

Maintainers

bogdankostic deepset masci ZanSara

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

Introduction

The performance of modern Question Answering Models (BERT, ALBERT …) has seen drastic improvements within the last year enabling many new opportunities for accessing information more efficiently. However, those models are designed to find answers within rather small text passages. Haystack lets you scale QA models to large collections of documents! While QA is the focussed use case for Haystack, we will address further options around neural search in the future (re-ranking, most-similar search …).

Haystack is designed in a modular way and lets you use any models trained with FARM or Transformers.

Core Features

Powerful ML models: Utilize all latest transformer based models (BERT, ALBERT, RoBERTa …)
Modular & future-proof: Easily switch to newer models once they get published.
Developer friendly: Easy to debug, extend and modify.
Scalable: Production-ready deployments via Elasticsearch backend & REST API
Customizable: Fine-tune models to your own domain & improve them continuously via user feedback

Components

https://raw.githubusercontent.com/deepset-ai/haystack/master/docs/_src/img/sketched_concepts_white.png

DocumentStore: Database storing the documents for our search. We recommend Elasticsearch, but have also more light-weight options for fast prototyping (SQL or In-Memory).
Retriever: Fast, simple algorithm that identifies candidate passages from a large collection of documents. Algorithms include TF-IDF or BM25, custom Elasticsearch queries, and embedding-based approaches. The Retriever helps to narrow down the scope for Reader to smaller units of text where a given question could be answered.
Reader: Powerful neural model that reads through texts in detail to find an answer. Use diverse models like BERT, RoBERTa or XLNet trained via FARM or Transformers on SQuAD like tasks. The Reader takes multiple passages of text as input and returns top-n answers with corresponding confidence scores. You can just load a pretrained model from Hugging Face’s model hub or fine-tune it to your own domain data.
Finder: Glues together a Reader and a Retriever as a pipeline to provide an easy-to-use question answering interface.
REST API: Exposes a simple API for running QA search, collecting feedback and monitoring requests
Haystack Annotate: Create custom QA labels, Hosted version (Beta), Docker images (coming soon)

Resources

Documentation: https://haystack.deepset.ai

Tutorials

Tutorial 1 - Basic QA Pipeline: Jupyter notebook or Colab
Tutorial 2 - Fine-tuning a model on own data: Jupyter notebook or Colab
Tutorial 3 - Basic QA Pipeline without Elasticsearch: Jupyter notebook or Colab
Tutorial 4 - FAQ-style QA: Jupyter notebook or Colab
Tutorial 5 - Evaluation of the whole QA-Pipeline: Jupyter noteboook or Colab
Tutorial 6 - Better Retrievers via “Dense Passage Retrieval”: Jupyter noteboook or Colab

Quick Start

Installation

PyPi:

pip install farm-haystack

Master branch (if you wanna try the latest features):

git clone https://github.com/deepset-ai/haystack.git
cd haystack
pip install --editable .

To update your installation, just do a git pull. The –editable flag will update changes immediately.

Note: On Windows you might need pip install farm-haystack -f https://download.pytorch.org/whl/torch_stable.html to install PyTorch correctly

Usage

Quick Tour

1) DocumentStores

Haystack offers different options for storing your documents for search. We recommend Elasticsearch, but have also light-weight options for fast prototyping and will soon add DocumentStores that are optimized for embeddings (FAISS & Co).

Elasticsearch (Recommended)

haystack.database.elasticsearch.ElasticsearchDocumentStore

Keeps all the logic to store and query documents from Elastic, incl. mapping of fields, adding filters or boosts to your queries, and storing embeddings
You can either use an existing Elasticsearch index or create a new one via haystack
Retrievers operate on top of this DocumentStore to find the relevant documents for a query
Documents should be chunked into smaller units (e.g. paragraphs) before indexing to make the results returned by the Retriever more granular and accurate.

You can get started by running a single Elasticsearch node using docker:

docker run -d -p 9200:9200 -e "discovery.type=single-node" elasticsearch:7.6.2

Or if docker is not possible for you:

wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.6.2-linux-x86_64.tar.gz -q
tar -xzf elasticsearch-7.6.2-linux-x86_64.tar.gz
chown -R daemon:daemon elasticsearch-7.6.2
elasticsearch-7.0.0/bin/elasticsearch

See Tutorial 1 on how to go on with indexing your docs.

SQL / InMemory (Alternative)

haystack.database.sql.SQLDocumentStore & haystack.database.memory.InMemoryDocumentStore

These DocumentStores are mainly intended to simplify the first development steps or test a prototype on an existing SQL Database containing your texts. The SQLDocumentStore initializes by default a local file-based SQLite database. However, you can easily configure it for PostgreSQL or MySQL since our implementation is based on SQLAlchemy. Limitations: Retrieval (e.g. via TfidfRetriever) happens in-memory here and will therefore only work efficiently on small datasets

2) Retrievers

DensePassageRetriever

Using dense embeddings (i.e. vector representations) of texts is a powerful alternative to score similarity of texts. This retriever uses two BERT models - one to embed your query, one to embed your passage. It’s based on the work of Karpukhin et al and is especially an powerful alternative if there’s no direct overlap between tokens in your queries and your texts.

Example

retriever = DensePassageRetriever(document_store=document_store,
                                  embedding_model="dpr-bert-base-nq",
                                  do_lower_case=True, use_gpu=True)
retriever.retrieve(query="Why did the revenue increase?")
# returns: [Document, Document]

ElasticsearchRetriever

Scoring text similarity via sparse Bag-of-words representations are strong and well-established baselines in Information Retrieval. The default ElasticsearchRetriever uses Elasticsearch’s native scoring (BM25), but can be extended easily with custom queries or filtering.

Example

retriever = ElasticsearchRetriever(document_store=document_store, custom_query=None)
retriever.retrieve(query="Why did the revenue increase?", filters={"years": ["2019"], "company": ["Q1", "Q2"]})
# returns: [Document, Document]

EmbeddingRetriever

This retriever uses a single model to embed your query and passage (e.g. Sentence-BERT) and finds similar texts by using cosine similarity. This works well if your query and passage are a similar type of text, e.g. you want to find the most similar question in your FAQ given a user question.

Example

retriever = EmbeddingRetriever(document_store=document_store,
                               embedding_model="deepset/sentence_bert",
                               model_format="farm")
retriever.retrieve(query="Why did the revenue increase?", filters={"years": ["2019"], "company": ["Q1", "Q2"]})
# returns: [Document, Document]

TfidfRetriever

Basic in-memory retriever getting texts from the DocumentStore, creating TF-IDF representations in-memory and allowing to query them. Simple baseline for quick prototypes. Not recommended for production.

3) Readers

Neural networks (i.e. mostly Transformer-based) that read through texts in detail to find an answer. Use diverse models like BERT, RoBERTa or XLNet trained via FARM or on SQuAD-like datasets. The Reader takes multiple passages of text as input and returns top-n answers with corresponding confidence scores. Both readers can load either a local model or any public model from Hugging Face’s model hub

FARMReader

Implementing various QA models via the FARM Framework.

Example

reader = FARMReader(model_name_or_path="deepset/roberta-base-squad2",
                use_gpu=False, no_ans_boost=-10, context_window_size=500,
                top_k_per_candidate=3, top_k_per_sample=1,
                num_processes=8, max_seq_len=256, doc_stride=128)

# Optional: Training & eval
reader.train(...)
reader.eval(...)

# Predict
reader.predict(question="Who is the father of Arya Starck?", documents=documents, top_k=3)

This Reader comes with:

extensive configuration options (no answer boost, aggregation options …)
multiprocessing to speed-up preprocessing
option to train
option to evaluate
option to load all QA models directly from HuggingFace’s model hub

TransformersReader

Implementing various QA models via the pipeline class of Transformers Framework.

Example

reader = TransformersReader(model="distilbert-base-uncased-distilled-squad",
                            tokenizer="distilbert-base-uncased",
                            context_window_size=500,
                            use_gpu=-1)

reader.predict(question="Who is the father of Arya Starck?", documents=documents, top_k=3)

5. REST API

A simple REST API based on FastAPI is provided to:

search answers in texts (extractive QA)
search answers by comparing user question to existing questions (FAQ-style QA)
collect & export user feedback on answers to gain domain-specific training data (feedback)
allow basic monitoring of requests (currently via APM in Kibana)

To serve the API, adjust the values in rest_api/config.py and run:

gunicorn rest_api.application:app -b 0.0.0.0:8000 -k uvicorn.workers.UvicornWorker -t 300

You will find the Swagger API documentation at http://127.0.0.1:8000/docs

6. Labeling Tool

Use the hosted version (Beta) or deploy it yourself via Docker images (coming soon)
Create labels with different techniques: Come up with questions (+ answers) while reading passages (SQuAD style) or have a set of predefined questions and look for answers in the document (~ Natural Questions).
Structure your work via organizations, projects, users
Upload your documents or import labels from an existing SQuAD-style dataset
Coming soon: more file formats for document upload, metrics for label quality …

https://raw.githubusercontent.com/deepset-ai/haystack/master/docs/_src/img/annotation_tool.png

7. Indexing PDF / Docx files

Haystack has basic converters to extract text from PDF and Docx files. While it’s almost impossible to cover all types, layouts and special cases in PDFs, the implementation covers the most common formats and provides basic cleaning functions to remove header, footers, and tables. Multi-Column text layouts are also supported. The converters are easily extendable, so that you can customize them for your files if needed.

Example:

#PDF
from haystack.indexing.file_converters.pdf import PDFToTextConverter
converter = PDFToTextConverter(remove_header_footer=True, remove_numeric_tables=True, valid_languages=["de","en"])
pages = converter.extract_pages(file_path=file)
# => list of str, one per page
#DOCX
from haystack.indexing.file_converters.docx import DocxToTextConverter
converter = DocxToTextConverter()
paragraphs = converter.extract_pages(file_path=file)
#  => list of str, one per paragraph (as docx has no direct notion of pages)

Advanced document convertion is enabled by leveraging mature text extraction library Apache Tika, which is mostly written in Java. Although it’s possible to call Tika API from Python, the current TikaConverter only supports RESTful call to a Tika server running at localhost. One may either run Tika as a REST service at port 9998 (default), or to start a docker container for Tika. The latter is recommended, as it’s easily scalable, and does not require setting up any Java runtime environment. What’s more, future update is also taken care of by docker. Either way, TikaConverter makes RESTful calls to convert any document format supported by Tika. Example code can be found at indexing/file_converters/utils.py’s tika_convert)_files_to_dicts function:

TikaConverter supports 341 file formats, including

most common text file formats, e.g. HTML, XML, Microsoft Office OLE2/XML/OOXML, OpenOffice ODF, iWorks, PDF, ePub, RTF, TXT, RSS, CHM…
text embedded in media files, e.g. WAV, MP3, Vorbis, Flac, PNG, GIF, JPG, BMP, TIF, PSD, WebP, WMF, EMF, MP4, Quicktime, 3GPP, Ogg, FLV…
mail and database files, e.g. Unitx mailboxes, Outlook PST/MSG/TNEF, SQLite3, Microsoft Access, dBase…
and many more other formats…
and all those file formats in archive files, e.g. TAR, ZIP, BZip2, GZip 7Zip, RAR!

Check out complete list of files supported by the most recent Apache Tika 1.24.1. If you feel adventurous, Tika even supports some image OCR with Tesseract, or object recognition for image and video files. (not implemented yet)

TikaConverter also makes a document’s metadata available, including typical fields like file name, file dates and a lot more (e.g. Author and keywords for PDF if they’re available in the files), which may save you some time in data labeling or other downstream tasks.

converter = TikaConverter(remove_header_footer=True)
pages = converter.extract_pages(file_path=path)
pages, meta = converter.extract_pages(file_path=path, return_meta=True)

Contributing

We are very open to contributions from the community - be it the fix of a small typo or a completely new feature! You don’t need to be an Haystack expert for providing meaningful improvements. To avoid any extra work on either side, please check our Contributor Guidelines first.

Tests will automatically run for every commit you push to your PR. You can also run them locally by executing pytest in your terminal from the root folder of this repository:

pytest test/

Project details

None None

These details have been verified by PyPI

Maintainers

bogdankostic deepset masci ZanSara

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

1.25.5

Apr 24, 2024

1.25.4

Apr 23, 2024

1.25.3

Apr 23, 2024

1.25.2

Apr 2, 2024

1.25.1

Mar 25, 2024

1.25.0

Mar 4, 2024

1.25.0rc1 pre-release

Feb 29, 2024

1.24.1

Feb 8, 2024

1.24.1rc1 pre-release

Feb 8, 2024

1.24.0

Jan 25, 2024

1.24.0rc1 pre-release

Jan 24, 2024

1.23.0

Dec 14, 2023

1.23.0rc1 pre-release

Dec 14, 2023

1.23.0rc0 pre-release

Nov 6, 2023

1.22.1

Nov 9, 2023

1.22.0

Nov 7, 2023

1.22.0rc3 pre-release

Nov 3, 2023

1.22.0rc2 pre-release

Nov 2, 2023

1.22.0rc1 pre-release

Oct 30, 2023

1.22.0rc0 pre-release

Nov 5, 2023

1.21.2

Oct 6, 2023

1.21.2rc1 pre-release

Oct 5, 2023

1.21.1

Oct 4, 2023

1.21.1rc2 pre-release

Oct 4, 2023

1.21.1rc1 pre-release

Oct 3, 2023

1.21.0

Sep 27, 2023

1.21.0rc1 pre-release

Sep 26, 2023

1.21.0rc0 pre-release

Nov 2, 2023

1.20.2rc0 pre-release

Sep 20, 2023

1.20.1

Sep 12, 2023

1.20.0

Sep 4, 2023

1.20.0rc1 pre-release

Aug 30, 2023

1.19.0

Jul 26, 2023

1.19.0rc2 pre-release

Jul 26, 2023

1.19.0rc1 pre-release

Jul 24, 2023

1.18.1

Jun 30, 2023

1.18.0

Jun 29, 2023

1.18.0rc2 pre-release

Jun 27, 2023

1.18.0rc1 pre-release

Jun 26, 2023

1.17.2

Jun 19, 2023

1.17.2rc1 pre-release

Jun 16, 2023

1.17.1

Jun 5, 2023

1.17.0

May 30, 2023

1.17.0rc2 pre-release

May 24, 2023

1.17.0rc1 pre-release

May 23, 2023

1.16.1

Apr 28, 2023

1.16.1rc1 pre-release

Apr 28, 2023

1.16.0

Apr 27, 2023

1.16.0rc2 pre-release

Apr 26, 2023

1.16.0rc1 pre-release

Apr 26, 2023

1.15.1

Apr 3, 2023

1.15.1rc1 pre-release

Mar 31, 2023

1.15.0

Mar 30, 2023

1.15.0rc5 pre-release

Mar 29, 2023

1.15.0rc4 pre-release

Mar 29, 2023

1.15.0rc3 pre-release

Mar 28, 2023

1.15.0rc2 pre-release

Mar 28, 2023

1.15.0rc1 pre-release

Mar 27, 2023

1.14.0

Feb 28, 2023

1.14.0rc2 pre-release

Feb 22, 2023

1.14.0rc1 pre-release

Feb 20, 2023

1.13.2

Feb 9, 2023

1.13.2rc0 pre-release

Feb 9, 2023

1.13.1

Feb 3, 2023

1.13.1rc1 pre-release

Feb 2, 2023

1.13.0

Jan 31, 2023

1.13.0rc2 pre-release

Jan 31, 2023

1.13.0rc1 pre-release

Jan 27, 2023

1.12.2

Dec 22, 2022

1.12.2rc1 pre-release

Dec 22, 2022

1.12.1

Dec 21, 2022

1.12.0

Dec 21, 2022

1.12.0rc2 pre-release

Dec 20, 2022

1.12.0rc1 pre-release

Dec 14, 2022

1.11.1

Dec 6, 2022

1.11.1rc1 pre-release

Dec 6, 2022

1.11.0

Nov 21, 2022

1.11.0rc0 pre-release

Nov 17, 2022

1.10.0

Oct 25, 2022

1.10.0rc1 pre-release

Oct 20, 2022

1.9.1

Oct 10, 2022

1.9.1rc1 pre-release

Oct 10, 2022

1.9.0

Sep 26, 2022

1.9.0rc3 pre-release

Sep 22, 2022

1.9.0rc2 pre-release

Sep 22, 2022

1.9.0rc1 pre-release

Sep 21, 2022

1.8.0

Aug 26, 2022

1.7.1

Aug 19, 2022

1.7.0

Aug 15, 2022

1.6.0

Jul 6, 2022

1.5.0

Jun 2, 2022

1.4.0

May 5, 2022

1.3.0

Mar 23, 2022

1.2.0

Feb 23, 2022

1.1.0

Jan 20, 2022

1.0.0

Dec 8, 2021

0.10.0

Sep 16, 2021

0.9.0

Jun 21, 2021

0.8.0

Apr 13, 2021

0.7.0

Jan 21, 2021

0.6.0

Dec 17, 2020

0.5.0

Nov 6, 2020

This version

0.4.0

Sep 21, 2020

0.3.0

Jul 16, 2020

0.2.1

May 5, 2020

0.2.0.post1

May 5, 2020

0.1.0.post2

Nov 28, 2019

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

farm-haystack-0.4.0.tar.gz (71.4 kB view hashes)

Uploaded Sep 21, 2020 Source

Built Distribution

farm_haystack-0.4.0-py3-none-any.whl (81.0 kB view hashes)

Uploaded Sep 21, 2020 Python 3

Hashes for farm-haystack-0.4.0.tar.gz

Hashes for farm-haystack-0.4.0.tar.gz
Algorithm	Hash digest
SHA256	`22ee55cc8dbe4d9196aed6058968fa51f8641738f97e7dccfa0f54d5ab190e2b`
MD5	`22ff76b57d3a9a0eafb6ffa39f9da5ec`
BLAKE2b-256	`de67f6db901ab5d79453ebd4de5c4305244a28126fe76ce19fedcf2b889f268c`

Hashes for farm_haystack-0.4.0-py3-none-any.whl

Hashes for farm_haystack-0.4.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`1f788f91568bf44be06a12ab9db57a45e6e2f77f3ae3e8549c426a892fd63d58`
MD5	`4d0b16bf2d3a90ff465415513d74a566`
BLAKE2b-256	`9ad3fac4222891d49bf21608517ac1b7a6b1662fbd30aaea9d5ae94eed32806a`

farm-haystack 0.4.0

Navigation

Verified details

Maintainers

Unverified details

Project links

GitHub Statistics

Meta

Classifiers

Project description

Introduction

Core Features

Components

Resources

Quick Start

Installation

Usage

Quick Tour

1) DocumentStores

Elasticsearch (Recommended)

SQL / InMemory (Alternative)

2) Retrievers

DensePassageRetriever

ElasticsearchRetriever

EmbeddingRetriever

TfidfRetriever

3) Readers

FARMReader

TransformersReader

5. REST API

6. Labeling Tool

7. Indexing PDF / Docx files

Contributing

Project details

Verified details

Maintainers

Unverified details

Project links

GitHub Statistics

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution