txtai

AI-powered search engine

These details have not been verified by PyPI

Project links

Project description

AI-powered search engine

txtai builds an AI-powered index over sections of text. txtai supports building text indices to perform similarity searches and create extractive question-answering based systems. txtai also has functionality for zero-shot classification.

demo

NeuML uses txtai and/or the concepts behind it to power all of our Natural Language Processing (NLP) applications. Example applications:

paperai - AI-powered literature discovery and review engine for medical/scientific papers
tldrstory - AI-powered understanding of headlines and story text
neuspo - Fact-driven, real-time sports event and news site
codequestion - Ask coding questions directly from the terminal

txtai is built on the following stack:

Installation

The easiest way to install is via pip and PyPI

pip install txtai

You can also install txtai directly from GitHub. Using a Python Virtual Environment is recommended.

pip install git+https://github.com/neuml/txtai

Python 3.6+ is supported

Troubleshooting

This project has dependencies that require compiling native code. Windows and macOS systems require the following additional steps. Most Linux environments will install without any additional steps.

Windows

Install C++ Build Tools - https://visualstudio.microsoft.com/visual-cpp-build-tools/
PyTorch now has Windows binaries on PyPI and should work with the standard install. But if issues arise, try running the install directly from PyTorch.
```
pip install txtai -f https://download.pytorch.org/whl/torch_stable.html
```
See pytorch.org for more information.

macOS

Run the following before installing
```
brew install libomp
```
See this link for more information.

See this GitHub workflow file for an example of environment-dependent installation procedures.

Examples

The examples directory has a series of examples and notebooks giving an overview of txtai. See the list of notebooks below.

Notebooks

Notebook	Description
Introducing txtai	Overview of the functionality provided by txtai
Build an Embeddings index with Hugging Face Datasets	Index and search Hugging Face Datasets
Build an Embeddings index from a data source	Index and search a data source with word embeddings
Add semantic search to Elasticsearch	Add semantic search to existing search systems
Extractive QA with txtai	Introduction to extractive question-answering with txtai
Extractive QA with Elasticsearch	Run extractive question-answering queries with Elasticsearch
Apply labels with zero shot classification	Use zero shot learning for labeling, classification and topic modeling
API Gallery	Using txtai in JavaScript, Java, Rust and Go

Configuration

The following sections cover available settings for each txtai component. See the example notebooks for detailed examples on how to use each txtai component.

Embeddings

An Embeddings instance is the engine that provides similarity search. Embeddings can be used to run ad-hoc similarity comparisions or build/search large indices.

Embeddings parameters are set through the constructor. Examples below.

# Transformers embeddings model
Embeddings({"method": "transformers",
            "path": "sentence-transformers/bert-base-nli-mean-tokens"})

# Word embeddings model
Embeddings({"path": vectors,
            "storevectors": True,
            "scoring": "bm25",
            "pca": 3,
            "quantize": True})

method

method: transformers|words

Sets the sentence embeddings method to use. When set to transformers, the embeddings object builds sentence embeddings using the sentence transformers. Otherwise a word embeddings model is used. Defaults to words.

path

path: string

Required field that sets the path for a vectors model. When method set to transformers, this must be a path to a Hugging Face transformers model. Otherwise, it must be a path to a local word embeddings model.

storevectors

storevectors: boolean

Enables copying of a vectors model set in path into the embeddings models output directory on save. This option enables a fully encapsulated index with no external file dependencies.

scoring

scoring: bm25|tfidf|sif

For word embedding models, a scoring model allows building weighted averages of word vectors for a given sentence. Supports BM25, tf-idf and SIF (smooth inverse frequency) methods. If a scoring method is not provided, mean sentence embeddings are built.

pca

pca: int

Removes n principal components from generated sentence embeddings. When enabled, a TruncatedSVD model is built to help with dimensionality reduction. After pooling of vectors creates a single sentence embedding, this method is applied.

backend

backend: annoy|faiss|hnsw

Approximate Nearest Neighbor (ANN) index backend for storing generated sentence embeddings. Defaults to Faiss for Linux/macOS and Annoy for Windows. Faiss currently is not supported on Windows.

Backend-specific settings are set with a corresponding configuration object having the same name as the backend (i.e. annoy, faiss, or hnsw). None of these are required and are set to defaults if omitted.

annoy

annoy:
  ntrees: number of trees (int) - defaults to 10
  searchk: search_k search setting (int) - defaults to -1

See Annoy documentation for more information on these parameters.

faiss

faiss:
  components: Comma separated list of components - defaults to None
  nprobe: search probe setting (int) - defaults to 6

See Faiss documentation on the index factory and search for more information on these parameters.

hnsw

hnsw:
  efconstruction:  ef_construction param for init_index (int) - defaults to 200
  m: M param for init_index (int) - defaults to 16
  randomseed: random-seed param for init_index (init) - defaults to 100
  efsearch: ef search param (int) - defaults to None and not set

See Hnswlib documentation for more information on these parameters.

quantize

quantize: boolean

Enables quanitization of generated sentence embeddings. If the index backend supports it, sentence embeddings will be stored with 8-bit precision vs 32-bit. Only Faiss currently supports quantization.

Pipelines

txtai provides a light wrapper around a couple of the Hugging Face pipelines. All pipelines have the following common parameters.

path

path: string

Required path to a Hugging Face model

quantize

quantize: boolean

Enables dynamic quantization of the Hugging Face model. This is a runtime setting and doesn't save space. It is used to improve the inference time performance of models.

gpu

gpu: boolean

Enables GPU inference.

model

model: Hugging Face pipeline or txtai pipeline

Shares the underlying model of the passed in pipeline with this pipeline. This allows having variations of a pipeline without having to store multiple copies of the full model in memory.

Extractor

An Extractor pipeline is a combination of an embeddings query and an Extractive QA model. Filtering the context for a QA model helps maximize performance of the model.

Extractor parameters are set as constructor arguments. Examples below.

Extractor(embeddings, path, quantize, gpu, model, tokenizer)

embeddings

embeddings: Embeddings object instance

Embeddings object instance. Used to query and find candidate text snippets to run the question-answer model against.

tokenizer

tokenizer: Tokenizer function

Optional custom tokenizer function to parse input queries

Labels

A Labels pipeline uses a zero shot classification model to apply labels to input text.

Labels parameters are set as constructor arguments. Examples below.

Labels()
Labels("roberta-large-mnli")

Similarity

A Similarity pipeline is also a zero shot classifier model where the labels are the queries. The results are transposed to get scores per query/label vs scores per input text.

Similarity parameters are set as constructor arguments. Examples below.

Similarity()
Similarity("roberta-large-mnli")

API

txtai has a full-featured API that can optionally be enabled for any txtai process. All functionality found in txtai can be accessed via the API. The following is an example configuration and startup script for the API.

Note that this configuration file enables all functionality (embeddings, extractor, labels, similarity). It is suggested that separate processes are used for each instance of a txtai component.

# Index file path
path: /tmp/index

# Allow indexing of documents
writable: True

# Embeddings settings
embeddings:
  method: transformers
  path: sentence-transformers/bert-base-nli-mean-tokens

# Extractor settings
extractor:
  path: distilbert-base-cased-distilled-squad

# Labels settings
labels:

# Similarity settings
similarity:

Assuming this YAML content is stored in a file named index.yml, the following command starts the API process.

CONFIG=index.yml uvicorn "txtai.api:app"

Supported language bindings

The following programming languages have txtai bindings:

For additional language bindings, please add an issue!

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

8.6.0

Jun 10, 2025

8.5.0

Apr 14, 2025

8.4.0

Mar 11, 2025

8.3.1

Feb 12, 2025

8.3.0

Feb 11, 2025

8.2.0

Jan 9, 2025

8.1.0

Dec 10, 2024

8.0.0

Nov 18, 2024

7.5.1

Oct 25, 2024

7.5.0

Oct 14, 2024

7.4.0

Sep 5, 2024

7.3.0

Jul 15, 2024

7.2.0

May 31, 2024

7.1.0

Apr 19, 2024

7.0.0

Feb 21, 2024

6.3.0

Jan 2, 2024

6.2.0

Nov 8, 2023

6.1.0

Sep 26, 2023

6.0.0

Aug 10, 2023

5.5.1

Apr 27, 2023

5.5.0

Apr 20, 2023

5.4.0

Mar 6, 2023

5.3.0

Feb 7, 2023

5.2.0

Dec 20, 2022

5.1.0

Oct 18, 2022

5.0.0

Sep 27, 2022

4.6.0

Aug 15, 2022

4.5.0

May 17, 2022

4.4.0

Apr 20, 2022

4.3.1

Mar 11, 2022

4.3.0

Mar 10, 2022

4.2.1

Feb 28, 2022

4.2.0

Feb 24, 2022

4.1.0

Feb 3, 2022

4.0.0

Jan 11, 2022

3.7.0

Nov 23, 2021

3.6.0

Nov 8, 2021

3.5.0

Oct 18, 2021

3.4.0

Oct 7, 2021

3.3.0

Sep 10, 2021

3.2.0

Aug 17, 2021

3.1.0

May 22, 2021

3.0.0

May 4, 2021

This version

2.0.0

Jan 13, 2021

1.5.0

Nov 21, 2020

1.4.0

Nov 3, 2020

1.3.0

Oct 11, 2020

1.2.1

Sep 11, 2020

1.2.0

Sep 10, 2020

1.1.0

Aug 18, 2020

1.0.0

Aug 11, 2020

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

txtai-2.0.0.tar.gz (24.9 kB view details)

Uploaded Jan 13, 2021 Source

Built Distribution

txtai-2.0.0-py3-none-any.whl (27.1 kB view details)

Uploaded Jan 13, 2021 Python 3

File details

Details for the file txtai-2.0.0.tar.gz.

File metadata

Download URL: txtai-2.0.0.tar.gz
Upload date: Jan 13, 2021
Size: 24.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/42.0.2 requests-toolbelt/0.9.1 tqdm/4.48.0 CPython/3.7.9

File hashes

Hashes for txtai-2.0.0.tar.gz
Algorithm	Hash digest
SHA256	`7d72c072584055bb510f23bcfbb1201dcd950c866842bab640f4408674adb643`
MD5	`360382a51b7c559ff1577c32d25ee5b6`
BLAKE2b-256	`a7a68b624f410ec4b5ba9bff659e00e34f0c02564b5a55fdd063c4be84def33e`

See more details on using hashes here.

File details

Details for the file txtai-2.0.0-py3-none-any.whl.

File metadata

Download URL: txtai-2.0.0-py3-none-any.whl
Upload date: Jan 13, 2021
Size: 27.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/42.0.2 requests-toolbelt/0.9.1 tqdm/4.48.0 CPython/3.7.9

File hashes

Hashes for txtai-2.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`aa2092fd0993c7c139532c87b29c1ee94f43edbe1e77cfeb10c6d4da795ebcef`
MD5	`fedf175b837c3285230f5fb9cd750dcf`
BLAKE2b-256	`0a18f3592b253a5f3b69c3d2df7dabc1e266206e9307e699ccf1d9e080fa3e78`

See more details on using hashes here.

txtai 2.0.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

AI-powered search engine

Installation

Troubleshooting

Windows

macOS

Examples

Notebooks

Configuration

Embeddings

method

path

storevectors

scoring

pca

backend

annoy

faiss

hnsw

quantize

Pipelines

path

quantize

gpu

model

Extractor

embeddings

tokenizer

Labels

Similarity

API

Supported language bindings

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes