AI-powered search engine
Project description
txtai: AI-powered search engine
txtai builds an AI-powered index over sections of text. txtai supports building text indices to perform similarity searches and create extractive question-answering based systems.
NeuML uses txtai and/or the concepts behind it to power all of our Natural Language Processing (NLP) applications. Example applications:
- cord19q - COVID-19 literature analysis
- paperai - AI-powered literature discovery and review engine for medical/scientific papers
- neuspo - a fact-driven, real-time sports event and news site
- codequestion - Ask coding questions directly from the terminal
txtai is built on the following stack:
- sentence-transformers
- transformers
- faiss
- Python 3.6+
Installation
The easiest way to install is via pip and PyPI
pip install txtai
You can also install txtai directly from GitHub. Using a Python Virtual Environment is recommended.
pip install git+https://github.com/neuml/txtai
Python 3.6+ is supported
Troubleshooting
This project has dependencies that require compiling native code. Windows and macOS systems require the following additional steps. Most Linux environments will install without any additional steps.
Windows
-
Install C++ Build Tools - https://visualstudio.microsoft.com/visual-cpp-build-tools/
-
PyTorch Windows binaries are not on PyPI, the following url link must be added when installing
pip install txtai -f https://download.pytorch.org/whl/torch_stable.htmlSee pytorch.org for more information.
macOS
-
Run the following before installing
brew install libompSee this link for more information.
Examples
The examples directory has a series of examples and notebooks giving an overview of txtai. See the list of notebooks below.
Notebooks
| Notebook | Description | |
|---|---|---|
| Introducing txtai | Overview of the functionality provided by txtai | |
| Extractive QA with txtai | Extractive question-answering with txtai | |
| Build an Embeddings index from a data source | Embeddings index from a data source backed by word embeddings | |
| Extractive QA with Elasticsearch | Extractive question-answering with Elasticsearch |
Configuration
The following section goes over available settings for Embeddings and Extractor instances.
Embeddings
Embeddings methods are set through the constructor. Examples below.
# Transformers embeddings model
Embeddings({"method": "transformers",
"path": "sentence-transformers/bert-base-nli-mean-tokens"})
# Word embeddings model
Embeddings({"path": vectors,
"storevectors": True,
"scoring": "bm25",
"pca": 3,
"quantize": True})
method
method: transformers|words
Sets the sentence embeddings method to use. When set to transformers, the embeddings object builds sentence embeddings using the sentence transformers. Otherwise a word embeddings model is used. Defaults to words.
path
path: string
Required field that sets the path for a vectors model. When method set to transformers, this must be a path to a Hugging Face transformers model. Otherwise, it must be a path to a local word embeddings model.
storevectors
storevectors: boolean
Enables copying of a vectors model set in path into the embeddings models output directory on save. This option enables a fully encapsulated index with no external file dependencies.
scoring
scoring: bm25|tfidf|sif
For word embedding models, a scoring model allows building weighted averages of word vectors for a given sentence. Supports BM25, tf-idf and SIF (smooth inverse frequency) methods. If a scoring method is not provided, mean sentence embeddings are built.
pca
pca: int
Removes n principal components from generated sentence embeddings. When enabled, a TruncatedSVD model is built to help with dimensionality reduction. After pooling of vectors creates a single sentence embedding, this method is applied.
backend
backend: annoy|faiss|hnsw
Approximate Nearest Neighbor (ANN) index backend for storing generated sentence embeddings. Defaults to Faiss for Linux/macOS and Annoy for Windows. Faiss currently is not supported on Windows.
quantize
quantize: boolean
Enables quanitization of generated sentence embeddings. If the index backend supports it, sentence embeddings will be stored with 8-bit precision vs 32-bit. Only Faiss currently supports quantization.
Extractor
Extractor methods are set as constructor arguments. Examples below.
Extractor(embeddings, path, quantize)
embeddings
embeddings: Embeddings object instance
Embeddings object instance. Used to query and find candidate text snippets to run the question-answer model against.
path
path: string
Required path to a Hugging Face SQuAD fine-tuned model. Used to answer questions.
quantize
quantize: boolean
Enables dynamic quantization of the Hugging Face model. This is a runtime setting and doesn't save space. It is used to improve the inference time performance of the QA model.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file txtai-1.2.1.tar.gz.
File metadata
- Download URL: txtai-1.2.1.tar.gz
- Upload date:
- Size: 17.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/42.0.2 requests-toolbelt/0.9.1 tqdm/4.48.0 CPython/3.7.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9dcb2d8d64d1a592df2842ed3e53691ab78c499947c2aa502d7c32474e2f2661
|
|
| MD5 |
85eb777170920c0ab9f61f85237d7c35
|
|
| BLAKE2b-256 |
581dc5524ab1fbd436532e0ff37bd7c3b0a6e002ea433d7fc42feb4dc6ff3409
|
File details
Details for the file txtai-1.2.1-py3-none-any.whl.
File metadata
- Download URL: txtai-1.2.1-py3-none-any.whl
- Upload date:
- Size: 21.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/42.0.2 requests-toolbelt/0.9.1 tqdm/4.48.0 CPython/3.7.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b6a704fe1f8549df1ee8decd5e9c6e0cbbc371b8f2389a35c5b9c5f113ebdae6
|
|
| MD5 |
cbb77cae35c07eb33fc64a4be01fe37b
|
|
| BLAKE2b-256 |
c361d4211c3f77a9ca58037ee908cbf5fc5cfddbfcdad4101ba16e79d1a24cca
|