Skip to main content

AI-powered search engine

Project description

txtai: AI-powered search engine

Version GitHub Release Date GitHub issues GitHub last commit Build Status Coverage Status

txtai builds an AI-powered index over sections of text. txtai supports building text indices to perform similarity searches and create extractive question-answering based systems.

demo

NeuML uses txtai and/or the concepts behind it to power all of our Natural Language Processing (NLP) applications. Example applications:

  • cord19q - COVID-19 literature analysis
  • paperai - AI-powered literature discovery and review engine for medical/scientific papers
  • neuspo - a fact-driven, real-time sports event and news site
  • codequestion - Ask coding questions directly from the terminal

txtai is built on the following stack:

Installation

The easiest way to install is via pip and PyPI

pip install txtai

You can also install txtai directly from GitHub. Using a Python Virtual Environment is recommended.

pip install git+https://github.com/neuml/txtai

Python 3.6+ is supported

Troubleshooting

This project has dependencies that require compiling native code. Windows and macOS systems require the following additional steps. Most Linux environments will install without any additional steps.

Windows

macOS

  • Run the following before installing

    brew install libomp
    

    See this link for more information.

Examples

The examples directory has a series of examples and notebooks giving an overview of txtai. See the list of notebooks below.

Notebooks

Notebook Description
Introducing txtai Overview of the functionality provided by txtai Open In Colab
Extractive QA with txtai Extractive question-answering with txtai Open In Colab
Build an Embeddings index from a data source Embeddings index from a data source backed by word embeddings Open In Colab
Extractive QA with Elasticsearch Extractive question-answering with Elasticsearch Open In Colab

Configuration

The following section goes over available settings for Embeddings and Extractor instances.

Embeddings

Embeddings methods are set through the constructor. Examples below.

# Transformers embeddings model
Embeddings({"method": "transformers",
            "path": "sentence-transformers/bert-base-nli-mean-tokens"})

# Word embeddings model
Embeddings({"path": vectors,
            "storevectors": True,
            "scoring": "bm25",
            "pca": 3,
            "quantize": True})

method

method: transformers|words

Sets the sentence embeddings method to use. When set to transformers, the embeddings object builds sentence embeddings using the sentence transformers. Otherwise a word embeddings model is used. Defaults to words.

path

path: string

Required field that sets the path for a vectors model. When method set to transformers, this must be a path to a Hugging Face transformers model. Otherwise, it must be a path to a local word embeddings model.

storevectors

storevectors: boolean

Enables copying of a vectors model set in path into the embeddings models output directory on save. This option enables a fully encapsulated index with no external file dependencies.

scoring

scoring: bm25|tfidf|sif

For word embedding models, a scoring model allows building weighted averages of word vectors for a given sentence. Supports BM25, tf-idf and SIF (smooth inverse frequency) methods. If a scoring method is not provided, mean sentence embeddings are built.

pca

pca: int

Removes n principal components from generated sentence embeddings. When enabled, a TruncatedSVD model is built to help with dimensionality reduction. After pooling of vectors creates a single sentence embedding, this method is applied.

backend

backend: annoy|faiss|hnsw

Approximate Nearest Neighbor (ANN) index backend for storing generated sentence embeddings. Defaults to Faiss for Linux/macOS and Annoy for Windows. Faiss currently is not supported on Windows.

quantize

quantize: boolean

Enables quanitization of generated sentence embeddings. If the index backend supports it, sentence embeddings will be stored with 8-bit precision vs 32-bit. Only Faiss currently supports quantization.

Extractor

Extractor methods are set as constructor arguments. Examples below.

Extractor(embeddings, path, quantize)

embeddings

embeddings: Embeddings object instance

Embeddings object instance. Used to query and find candidate text snippets to run the question-answer model against.

path

path: string

Required path to a Hugging Face SQuAD fine-tuned model. Used to answer questions.

quantize

quantize: boolean

Enables dynamic quantization of the Hugging Face model. This is a runtime setting and doesn't save space. It is used to improve the inference time performance of the QA model.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

txtai-1.2.1.tar.gz (17.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

txtai-1.2.1-py3-none-any.whl (21.0 kB view details)

Uploaded Python 3

File details

Details for the file txtai-1.2.1.tar.gz.

File metadata

  • Download URL: txtai-1.2.1.tar.gz
  • Upload date:
  • Size: 17.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/42.0.2 requests-toolbelt/0.9.1 tqdm/4.48.0 CPython/3.7.9

File hashes

Hashes for txtai-1.2.1.tar.gz
Algorithm Hash digest
SHA256 9dcb2d8d64d1a592df2842ed3e53691ab78c499947c2aa502d7c32474e2f2661
MD5 85eb777170920c0ab9f61f85237d7c35
BLAKE2b-256 581dc5524ab1fbd436532e0ff37bd7c3b0a6e002ea433d7fc42feb4dc6ff3409

See more details on using hashes here.

File details

Details for the file txtai-1.2.1-py3-none-any.whl.

File metadata

  • Download URL: txtai-1.2.1-py3-none-any.whl
  • Upload date:
  • Size: 21.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/42.0.2 requests-toolbelt/0.9.1 tqdm/4.48.0 CPython/3.7.9

File hashes

Hashes for txtai-1.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 b6a704fe1f8549df1ee8decd5e9c6e0cbbc371b8f2389a35c5b9c5f113ebdae6
MD5 cbb77cae35c07eb33fc64a4be01fe37b
BLAKE2b-256 c361d4211c3f77a9ca58037ee908cbf5fc5cfddbfcdad4101ba16e79d1a24cca

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page