A CLI tool for interacting with the Vectara platform, including advanced text processing and indexing features.
Project description
vectara-cli
vectara-cli
is a Python package designed to interact with the Vectara platform, providing a command-line interface (CLI) and a set of APIs for indexing and querying documents, managing corpora, and performing advanced text analysis and processing tasks. This package is particularly useful for developers and data scientists working on search and information retrieval applications.
Features
- Indexing text and documents into Vectara corpora.
- Querying indexed documents.
- Creating and deleting corpora.
- Advanced text processing and analysis using pre-trained models (optional advanced package(s)).
Basic Installation
The basic installation includes the core functionality for interacting with the Vectara platform.
pip install vectara-cli
Advanced Installation
The advanced installation includes additional dependencies for advanced text processing and analysis features. This requires PyTorch, Transformers, and Accelerate, which can be substantial in size.
pip install vectara-cli[rebel_span]
Ensure you have an appropriate PyTorch version installed for your system, especially if you're installing on a machine with GPU support. Refer to the official PyTorch installation guide for more details.
Command Line Interface (CLI) Usage
The vectara-cli
provides a powerful command line interface for interacting with the Vectara platform, enabling tasks such as document indexing, querying, corpus management, and advanced text processing directly from your terminal.
Before your start always set your api keys with :
vectara set-api-keys <user_id> <api_key>
Deploy Your App
-
vectara create-ui
: This command will create a new UI for your app.
Note: that this script assumes you have Node.js and NPM installed on your system, as required by the npx command.
Table of Contents
Get Started
Command Line Interface (CLI) Usage
The vectara-cli
provides a powerful command line interface for interacting with the Vectara platform, enabling tasks such as document indexing, querying, corpus management, and advanced text processing directly from your terminal.
Before your start always set your api keys with :
vectara set-api-keys <user_id> <api_key>
Basic Usage of Vectara CLI
The Vectara CLI provides a simple and efficient way to interact with the Vectara platform, allowing users to create corpora, index documents, and perform various other operations directly from the command line. This section covers the basic usage of the Vectara CLI for common tasks such as creating a corpus and indexing documents.
Creating a Corpus
To create a new corpus, you can use the create-corpus
command. A corpus represents a collection of documents and serves as the primary organizational unit within Vectara.
Basic Corpus Creation
vectara create-corpus <corpus_id> <name> <description>
<corpus_id>
: The unique identifier for the corpus. Must be an integer.<name>
: The name of the corpus. This should be a unique name that describes the corpus.<description>
: A brief description of what the corpus is about.
Example
vectara create-corpus 123 "My Corpus" "A corpus containing documents on topic XYZ"
This command creates a basic corpus with the specified ID, name, and description.
Indexing a Document
To index a document into a corpus, you can use the index-document
command. This command allows you to add a text document to the specified corpus, making it searchable within the Vectara platform.
Indexing Text
vectara index-text <corpus_id> <document_id> <text> <context> <metadata_json>
<corpus_id>
: The unique identifier for the corpus where the document will be indexed.<document_id>
: A unique identifier for the document being indexed.<text>
: The actual text content of the document that you want to index.<context>
: Additional context or information about the document.<metadata_json>
: A JSON string containing metadata about the document.
Example
vectara index-text 12345 67890 "This is the text of the document." "Summary of the document" '{"author":"John Doe", "publishDate":"2024-01-01"}'
This command indexes a document with the provided text, context, and metadata into the specified corpus.
Advanced Corpus Creation
For more advanced scenarios, you might want to specify additional options such as custom dimensions, filter attributes, or privacy settings for your corpus. The create-corpus-advanced
command allows for these additional configurations.
Advanced Creation with Options
vectara create-corpus-advanced <name> <description> [options]
Options include setting custom dimensions, filter attributes, public/private status, and more.
Example
vectara create-corpus-advanced "Research Papers" "Corpus for academic research papers" --custom_dimensions '{"dimension1": "value1", "dimension2": "value2"}' --filter_attributes '{"author": "John Doe"}'
This command creates a corpus with custom dimensions and filter attributes specified, allowing for more detailed organization and retrieval capabilities.
Deleting a Corpus
To remove an existing corpus from the Vectara platform, you can use the delete-corpus
command. Deleting a corpus will permanently remove the corpus and all documents contained within it. This action cannot be undone, so ensure that you really want to delete the corpus before proceeding.
Basic Corpus Deletion
vectara delete-corpus <corpus_id>
<corpus_id>
: The unique identifier for the corpus you wish to delete. This must be an integer.
Example
vectara delete-corpus 12345
This command deletes the corpus with the specified ID from the Vectara platform. Upon successful deletion, you will receive a confirmation message. If the corpus cannot be found or if there is an error during the deletion process, an error message will be displayed instead.
Uploading a Document
To upload a document to a specific corpus in the Vectara platform, you can use the upload-document
command. This allows you to add various types of documents, such as PDFs, Word documents, and plain text files, making them searchable within your corpus.
Basic Document Upload
vectara upload-document <corpus_id> <file_path> [document_id]
<corpus_id>
: The unique identifier for the corpus where the document will be uploaded. This must be an integer.<file_path>
: The path to the document file that you want to upload.[document_id]
: An optional parameter that specifies the document ID. If not provided, Vectara will generate a unique ID for the document.
Example
vectara upload-document 12345 "/path/to/document.pdf"
This command uploads a document from the specified file path to the corpus with the given ID. If the upload is successful, you will receive a confirmation message along with any relevant details provided by the Vectara platform.
Uploading with a Specific Document ID
If you wish to specify a document ID during the upload process, you can include it as an additional argument:
vectara upload-document 12345 "/path/to/document.pdf" "custom-document-id-123"
This allows you to assign a custom identifier to the document, which can be useful for tracking or referencing the document within your application or database.
Supported Document Formats
Vectara supports a variety of document formats for upload, including but not limited to:
- PDF (.pdf)
- Microsoft Word (.docx)
- PowerPoint (.pptx)
- Plain Text (.txt)
Ensure that your documents are in one of the supported formats before attempting to upload them to the Vectara platform.
Metadata and Context
While the basic upload command does not include options for metadata and context, it's important to note that Vectara allows for the association of metadata with documents. This can be accomplished through advanced usage of the Vectara CLI or API, enabling you to provide additional information about the documents you upload, such as author, publication date, tags, and more.
For detailed instructions on advanced document upload options, including how to include metadata and context, please refer to the Vectara documentation or the advanced usage section of the Vectara CLI help.
Querying
To perform a query in a specific corpus:
vectara query "<query_text>" <num_results> <corpus_id>
<query_text>
: The text of the query.<num_results>
: The maximum number of results to return.<corpus_id>
: The ID of the corpus to query against.
Configuration
Optional: Conda Virtual Environment Setup
Conda is an open-source package management system and environment management system that runs on Windows, macOS, and Linux. It allows you to install, run, and update packages and their dependencies. To set up this project using Conda, follow the steps below:
Prerequisites
- Ensure that you have Conda installed on your system. If you do not have Conda installed, you can download it from the official Conda website.
Creating a Conda Environment
-
Open your terminal (or Anaconda Prompt on Windows).
-
Navigate to the project directory where the
environment.yml
file is located. -
Create a new Conda environment by running the following command:
conda env create -f environment.yml
Activating the Environment
Once the environment is created, you can activate it using the following command:
conda activate vectara
Deactivating the Environment
When you are done working on the project, you can deactivate the Conda environment by running:
conda deactivate
Updating the Environment
If you need to update the environment based on the environment.yml
file, use the following command:
conda env update -f environment.yml --prune
This will update the environment with any new dependencies specified in the environment.yml
file.
Removing the Environment
If you wish to remove the Conda environment, you can do so with the following command:
conda env remove -n vectara
By following these steps, you can manage your project's dependencies in an isolated environment using Conda.
Configuration
Setting Credentials via CLI Commands
The vectara-cli
tool now supports a convenient feature for setting your Vectara customer ID and API key directly through the command line. This method utilizes a command specifically designed for securely storing your credentials, making it easier to manage your Vectara configuration without manually setting environment variables or directly embedding your credentials in your scripts.
Using the set-api-keys
Command
To set your Vectara customer ID and API key using the vectara-cli
, you can use the set-api-keys
command. This command stores your credentials securely, allowing vectara-cli
to automatically use them for authentication in future operations.
- Syntax: The command follows this simple syntax:
vectara set-api-keys <customer_id> <api_key>
Replace <customer_id>
with your Vectara customer ID and <api_key>
with your Vectara API key.
- Example:
vectara set-api-keys 123456789 abcdefghijklmnopqrstuvwxyz
After executing this command, you will see a confirmation message indicating that your API keys have been set successfully.
Windows
For Windows users, you can also set environment variables through the Command Prompt or PowerShell, or via the System Properties window.
- Command Prompt:
setx VECTARA_CUSTOMER_ID "your_customer_id"
setx VECTARA_API_KEY "your_api_key"
- PowerShell:
[System.Environment]::SetEnvironmentVariable('VECTARA_CUSTOMER_ID', 'your_customer_id', [System.EnvironmentVariableTarget]::User)
[System.Environment]::SetEnvironmentVariable('VECTARA_API_KEY', 'your_api_key', [System.EnvironmentVariableTarget]::User)
Note that changes made through the command line will only take effect in new instances of the terminal or command prompt.
Using Credentials in vectara-cli
Once you have set up your environment variables, vectara-cli
will automatically use these credentials for authentication. There's no need to manually input your customer ID and API key each time you execute a command.
Programmatic Usage
Setting Up a Vectara Client
First, initialize the Vectara client with your customer ID and API key. This client will be used for all subsequent operations.
from vectara_cli.core import VectaraClient
customer_id = 'your_customer_id'
api_key = 'your_api_key'
vectara_client = VectaraClient(customer_id, api_key)
Indexing a Document
To index a document, you need its corpus ID, a unique document ID, and the text you want to index. Optionally, you can include context, metadata in JSON format, and custom dimensions.
corpus_id = 'your_corpus_id'
document_id = 'unique_document_id'
text = 'This is the document text you want to index.'
context = 'Document context'
metadata_json = '{"author": "John Doe"}'
vectara_client.index_text(corpus_id, document_id, text, context, metadata_json)
Indexing Documents from a Folder
To index all documents from a specified folder into a corpus, provide the corpus ID and the folder path.
corpus_id = 'your_corpus_id'
folder_path = '/path/to/your/documents'
results = vectara_client.index_documents_from_folder(corpus_id, folder_path)
for document_id, success, extracted_text in results:
if success:
print(f"Successfully indexed document {document_id}.")
else:
print(f"Failed to index document {document_id}.")
Querying Documents
To query documents, specify your search query, the number of results you want to return, and the corpus ID.
query_text = 'search query'
num_results = 10 # Number of results to return
corpus_id = 'your_corpus_id'
results = vectara_client.query(query_text, num_results, corpus_id)
print(results)
Deleting a Corpus
To delete a corpus, you only need to provide its ID.
corpus_id = 'your_corpus_id'
response, success = vectara_client.delete_corpus(corpus_id)
if success:
print("Corpus deleted successfully.")
else:
print("Failed to delete corpus:", response)
Uploading a Document
To upload and index a document, specify the corpus ID, the path to the document, and optionally, a document ID and metadata.
corpus_id = 'your_corpus_id'
file_path = '/path/to/your/document.pdf'
document_id = 'unique_document_id' # Optional
metadata = {"author": "Author Name", "title": "Document Title"} # Optional
try:
response, status = vectara_client.upload_document(corpus_id, file_path, document_id, metadata)
print("Upload successful:", response)
except Exception as e:
print("Upload failed:", str(e))
Advanced Usage
Advanced Usage
To leverage the advanced text processing capabilities, ensure you have completed the advanced installation of vectara-cli
. This includes the necessary dependencies for text analysis:
pip install vectara-cli[rebel_span]
Span Text Processing
To process text using the Span model:
vectara span-text "<text>" "<model_name>" "<model_type>"
<text>
: The text to process.<model_name>
: The name of the Span model to use.<model_type>
: The type of the Span model.
Enhanced Batch Processing with NerdSpan
To process and upload documents from a folder:
vectara nerdspan-upsert-folder "<folder_path>" "<model_name>" "<model_type>"
<folder_path>
: The path to the folder containing documents to process and upload.<model_name>
: The name of the model to use for processing.<model_type>
: The type of the model.
For more advanced processing and upsert operations, including using the Rebel model for complex document analysis and upload, refer to the specific command documentation provided with the CLI.
Commercial Advanced Usage
The commercial advanced features of vectara-cli
enable users to leverage state-of-the-art text processing models for enriching document indexes with additional metadata. This enrichment process enhances the search and retrieval capabilities of the Vectara platform, providing more relevant and accurate results for complex queries.
Reference: Aarsen, T. (2023). SpanMarker for Named Entity Recognition. Radboud University. Supervised by Prof. Dr. Fermin Moscoso del Prado Martin (fermin.moscoso-del-prado@ru.nl) and Dr. Daniel Vila Suero (daniel@argilla.io). Second assessor: Dr. Harrie Oosterhuis (harrie.oosterhuis@ru.nl).
CLI Commands for Advanced Usage
The vectara-cli
includes specific commands designed to facilitate advanced text processing and enrichment tasks. Below are the key commands and their usage:
- supported models:
science
andkeyphrase
-
Upload Enriched Text
To upload text that has been enriched with additional metadata:
vectara upload-enriched-text <corpus_id> <document_id> <model_name> "<text>"
<corpus_id>
: The ID of the corpus where the document will be uploaded.<document_id>
: A unique identifier for the document.<model_name>
: The name of the model used for text enrichment.science
orkeyphrase
<text>
: The text content to be enriched and uploaded.
-
Span Enhance Folder
To process and upload all documents within a folder, enhancing them using a specified model:
vectara span-enhance-folder <corpus_id_1> <corpus_id_2> <model_name> "<folder_path>"
<corpus_id_1>
: The ID for the corpus to upload plain text documents.<corpus_id_2>
: The ID for the corpus to upload enhanced text documents.<model_name>
: The name of the model used for document enhancement. supported models :science
andkeyphrase
<folder_path>
: The path to the folder containing the documents to be processed.
Code Example for Advanced Usage
The following Python code demonstrates how to use the EnterpriseSpan
class for advanced text processing and enrichment before uploading the processed documents to Vectara:
from vectara_cli.advanced.commercial.enterpise import EnterpriseSpan
# Initialize the EnterpriseSpan with the desired model
model_name = "keyphrase"
enterprise_span = EnterpriseSpan(model_name)
# Example text to be processed
text = "OpenAI has developed a state-of-the-art language model named GPT-4."
# Predict entities in the text
predictions = enterprise_span.predict(text)
# Format predictions for readability
formatted_predictions = enterprise_span.format_predictions(predictions)
print("Formatted Predictions:\n", formatted_predictions)
# Generate metadata from predictions
metadata = enterprise_span.generate_metadata(predictions)
# Example corpus and document IDs
corpus_id = "123456"
document_id = "doc-001"
# Upload the enriched text along with its metadata to Vectara
enterprise_span.upload_enriched_text(corpus_id, document_id, text, predictions)
print("Enriched text uploaded successfully.")
This example showcases how to enrich text with additional metadata using the EnterpriseSpan
class and upload it to a specified corpus in Vectara. By leveraging advanced models for text processing, users can significantly enhance the quality and relevance of their search and retrieval operations on the Vectara platform.
Non-Commercial Advanced Usage
The advanced features allow you to enrich your indexes with additional information automatically. This should produce better results for retrieval.
Non-Commercial Advanced Usage Using Span Models
The vectara-cli
package extends its functionality through the advanced usage of Span Models, enabling users to perform sophisticated text analysis and entity recognition tasks. This feature is particularly beneficial for non-commercial applications that require deep understanding and processing of textual data.
The Span
class supports processing and indexing documents from a folder, enabling batch operations for efficiency. This feature allows for the automatic extraction of entities from multiple documents, which are then indexed into specified corpora with enriched metadata.
Features
- Named Entity Recognition (NER): Utilize pre-trained Span Models to identify and extract entities from text, enriching your document indexes with valuable metadata.
- Model Flexibility: Choose from a variety of pre-trained models tailored to your specific needs, including
fewnerdsuperfine
,multinerd
, andlargeontonote
. - Enhanced Document Indexing: Improve search relevance and results by indexing documents enriched with named entity information.
Usage
-
Initialize Vectara Client: Start by creating a Vectara client instance with your customer ID and API key.
from vectara_cli.core import VectaraClient customer_id = 'your_customer_id' api_key = 'your_api_key' vectara_client = VectaraClient(customer_id, api_key)
-
Load and Use Span Models: The
Span
class facilitates the loading of pre-trained models and the analysis of text to extract entities.from vectara_cli.advanced.nerdspan import Span # Initialize the Span class span = Span(customer_id, api_key) # Load a pre-trained model model_name = "multinerd" # Example model model_type = "span_marker" span.load_model(model_name, model_type) # Analyze text to extract entities text = "Your text here." output_str, key_value_pairs = span.analyze_text(model_name) print(output_str)
-
Index Enhanced Documents: After extracting entities, use the
VectaraClient
to index the enhanced documents into your corpus.corpus_id = 'your_corpus_id' document_id = 'unique_document_id' metadata_json = json.dumps({"entities": key_value_pairs}) vectara_client.index_text(corpus_id, document_id, text, metadata_json=metadata_json)
Reference: Aarsen, T. (2023). SpanMarker for Named Entity Recognition. Radboud University. Supervised by Prof. Dr. Fermin Moscoso del Prado Martin (fermin.moscoso-del-prado@ru.nl) and Dr. Daniel Vila Suero (daniel@argilla.io). Second assessor: Dr. Harrie Oosterhuis (harrie.oosterhuis@ru.nl).
Non-Commercial Advanced Rag Using Rebel
The mRebel pre-trained model is able to extract triplets for up to 400 relation types from Wikidata.
Use the use the Rebel Class
for advanced indexing. This will automatically extract named entities
, key phrases
, and other relevant information from your documents :
from vectara_cli.advanced.non_commercial.rebel import Rebel
folder_path = '/path/to/your/documents'
query_text = 'search query'
num_results = 10 # Number of results to return
# Initialize the Rebel instance for advanced non-commercial text processing
rebel = Rebel()
# Perform advanced indexing
corpus_id_1, corpus_id_2 = rebel.advanced_upsert_folder(vectara_client, corpus_id_1, corpus_id_2, folder_path)
# Vanilla Retrieval
plain_results = vectara_client.query(query_text, num_results, corpus_id_1)
# Enhanced Retrieval
enhanced_results = vectara_client.query(query_text, num_results, corpus_id_2)
# Print Results
print("=== Plain Results ===")
for result in plain_results:
print(f"Document ID: {result['documentIndex']}, Score: {result['score']}, Text: {result['text'][:100]}...")
print("\n=== Enhanced Results ===")
for result in enhanced_results:
print(f"Document ID: {result['documentIndex']}, Score: {result['score']}, Text: {result['text'][:100]}...")
Contributing
Contributing Guidelines for vectara-cli
Thank you for your interest in contributing to vectara-cli
! As an open-source project, we welcome contributions from developers of all skill levels. This guide will provide you with information on how to contribute effectively and make a valuable impact on the project.
Prerequisites
Before you begin, ensure you have the following installed:
- Python (preferably the latest Python 3 version)
- Conda (for managing environments)
- Git (for version control)
Identify An Issue
Browse the Issues to find tasks to work on. You can start with issues labeled as "good first issue".
- If you have an idea or a bug fix that is not listed, feel free to open a new issue to discuss it with other contributors.
Setting Up for Contribution
-
Fork the Repository: Visit vectara-cli on GitLab and fork the project to your account.
-
Create a New Branch: Before you start making changes, switch to the
devbranch
and create a new branch for your feature or fix. We encourage naming your branch in a way that reflects the issue or feature you're working on.git checkout devbranch git checkout -b feature/your-feature-name
Or, if you're working on a specific issue:
git checkout devbranch git checkout -b issue/ISSUE_NUMBER-short-description
This naming convention (
feature/your-feature-name
orissue/ISSUE_NUMBER-short-description
) helps in identifying branches with their purposes, making collaboration and review processes more efficient.
- the easiest way to make a correctly named branch is to use the gitlab gui directly inside the issue that you are responding to.
-
Create and Activate Conda Environment:
conda env create -f environment.yml conda activate vectara-cli
-
Install the Project in Editable Mode:
pip install --editable .
Develop
-
Add Functionality: Write your code and add it to the appropriate directory:
- For new functionalities, add your code in
./vectara_cli/commands
. - Add command line functionality in
main.py
. - Create or modify data objects in
./vectara_cli/data
.
- For new functionalities, add your code in
-
Add Help Text: Update help texts in
./vectara_cli/help_texts/help_text.py
to reflect your changes or new commands.
Write Tests
- Add tests for your new functionalities in the
tests/
directory. - Ensure all tests pass by running them locally.
Document Your Changes
Update any documentation relevant to your changes, including inline comments and README if necessary.
Submitting Your Contributions
-
Commit Your Changes: After making your changes, commit them to your branch. Use descriptive commit messages that explain the "why" and "what" of your changes. This practice helps reviewers understand your reasoning and the context of your contributions.
git add . git commit -m "A descriptive message explaining the change"
-
Push Your Changes: Once you're ready, push your changes to your forked repository on GitLab.
git push origin feature/your-feature-name
Or, if you're working on an issue:
git push origin issue/ISSUE_NUMBER-short-description
3. Create a Merge Request
- Go to the Merge Requests page.
- Create a new merge request, compare your feature branch to the main repository's
devbranch
. - Fill in a detailed description of your changes and link to any relevant issues.
Review Process
Once your merge request is submitted:
- The project maintainers will review your code and may request changes.
- Collaborate on modifications and push updates to your branch accordingly.
- Once approved, a maintainer will merge your changes into the main codebase.
Post-merge
After your changes have been merged:
- Sync your fork with the original repository.
- Consider deleting your branch to keep your fork clean:
git branch -d your-feature-branch git push origin --delete your-feature-branch
Thank you for contributing to vectara-cli
! For any questions or further discussions, please reach out on the issues page or on discord.
License
vectara-cli
is MIT licensed. See the LICENSE file for more details.
@misc{Vectara Cli,
author = { isayahc , Josephrp, p3nGu1nZz},
title = {Vectara Cli is a Python package for Vectara platform interaction, ideal for search and information retrieval tasks.},
year = {2024},
publisher = {TeamTonic},
journal = {Tonic-AI repository},
howpublished = {\url{https://git.tonic-ai.com/releases/vectara-cli}}
}
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file vectara-cli-0.2.0.tar.gz
.
File metadata
- Download URL: vectara-cli-0.2.0.tar.gz
- Upload date:
- Size: 66.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.0.0 CPython/3.12.2
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 754830916e35dd942f2df9bc69df7f6d9e825ab49a873a0f013999af37b5838f |
|
MD5 | 9eb6441c00d33ea8bc11ceeccfa404e1 |
|
BLAKE2b-256 | 08b07bfc8a97e41d86eb1792888f41ed474b87b77bab998de91cf063b97cecd7 |
File details
Details for the file vectara_cli-0.2.0-py3-none-any.whl
.
File metadata
- Download URL: vectara_cli-0.2.0-py3-none-any.whl
- Upload date:
- Size: 57.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.0.0 CPython/3.12.2
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | e5fc05380650e36b065fd906b32210a61e8fe22287f6b89a1370a48385b2d47b |
|
MD5 | ee311965a8f773c3040fd808d11c5b5a |
|
BLAKE2b-256 | c0e57a9bb2dc189b06f776fd79adbdf298e76d2a384c5df3242357bacfbb1b97 |