Skip to main content

Standalone CLI for TOS Vector operations with Volcengine Ark embeddings

Project description

Volcengine TOS Vectors Embed CLI

Volcengine TOS Vectors Embed CLI is a standalone command-line tool that simplifies the process of working with vector embeddings in TOS Vectors. You can create vector embeddings for your data using Volcengine Ark and store and query them in your TOS vector index using single commands.

Supported Commands

tos-vectors-embed put: Embed text, file content, or TOS objects and store them as vectors in a TOS vector index. You can create and ingest vector embeddings into a TOS vector index using a single put command. You specify the data input you want to create an embedding for, a Volcengine Ark embeddings model ID, your TOS vector bucket name, and TOS vector index name. The command supports several input formats including text data, a local text or image file, a TOS image or text object or prefix. The command generates embeddings using the dimensions configured in your TOS vector index properties. If you are ingesting embeddings for several objects in a TOS prefix or local file path, it automatically uses batch processes to maximize throughput.

Note: Each file is processed as a single embedding. Document chunking is not currently supported.

tos-vectors-embed query: Embed a query input and search for similar vectors in a TOS vector index. You can perform similarity queries for vector embeddings in your TOS vector index using a single query command. You specify your query input, a Volcengine Ark embeddings model ID, the vector bucket name, and vector index name. The command accepts several types of query inputs like a text string, an image file, or a single TOS text or image object. The command generates embeddings for your query using the input embeddings model and then performs a similarity search to find the most relevant matches. You can control the number of results returned, apply metadata filters to narrow your search, and choose whether to include similarity distance in the results for comprehensive analysis.

Supported Input Types

Note: This CLI has introduced a unified --ark-inference-params parameter for all model-specific parameters. Additionally, the query command uses the following separate parameters:

  • --text-value: Direct text query string (preferred for text queries)
  • --text: Text file path (local file or TOS URI)
  • --image: Image file path (local file or TOS URI)
  • --video: Video file path (local file or TOS URI)

Installation and Configuration

Prerequisites

  • Python 3.9 or higher
  • To execute the CLI, you will need Volcengine credentials configured.
  • Update your Volcengine account with appropriate permissions to use Volcengine Ark and TOS Vectors
  • Access to a Volcengine Ark embedding model
  • Create a Volcengine TOS vector bucket and vector index to store your embeddings

Quick Install (Recommended)

pip install tos-vectors-embed-cli

Development Install

# Clone the repository
git clone <repository-url>
cd tos-vectors-embed-cli

# Install in development mode
pip install -e .

Note: All dependencies are automatically installed when you install the package via pip.

Quick Start

Configure credentials

  1. Configure ARK API key from the environment variables:
export ARK_API_KEY="YOUR_ARK_API_KEY"
  1. Configure TOS credentials from the environment variables:
export TOS_ACCESS_KEY="YOUR_TOS_ACCESS_KEY"
export TOS_SECRET_KEY="YOUR_TOS_SECRET_KEY"
export TOS_VECTOR_ENDPOINT="tosvectors-cn-beijing.volces.com" # Optional, defaults to cn-beijing

Put Examples

  1. Embed text and store them as vectors in your TOS vector index:
tos-vectors-embed \
  --account-id 12345678 \
  put \
  --vector-bucket-name my-bucket \
  --index-name my-index \
  --model-id doubao-embedding-vision-250615 \
  --text-value "Hello, world!"
  1. Process local text files:
tos-vectors-embed \
  --account-id 12345678 \
  put \
  --vector-bucket-name my-bucket \
  --index-name my-index \
  --model-id doubao-embedding-vision-250615 \
  --text "./documents/sample.txt"
  1. Process files from a local file path using wildcard characters:
tos-vectors-embed \
  --account-id 12345678 \
  put \
  --vector-bucket-name my-bucket \
  --index-name my-index \
  --model-id doubao-embedding-vision-250615 \
  --text "./documents/*.txt"
  1. Process files from a TOS bucket using wildcard characters:
tos-vectors-embed \
  --account-id 12345678 \
  put \
  --vector-bucket-name my-bucket \
  --index-name my-index \
  --model-id doubao-embedding-vision-250615 \
  --text "tos://bucket/path/*"
  1. Process a single file from a TOS bucket:
tos-vectors-embed \
  --account-id 12345678 \
  put \
  --vector-bucket-name my-bucket \
  --index-name my-index \
  --model-id doubao-embedding-vision-250615 \
  --image "tos://bucket/images/photo.jpg"
  1. Process a local image file:
tos-vectors-embed \
  --account-id 12345678 \
  put \
  --vector-bucket-name my-bucket \
  --index-name my-index \
  --model-id doubao-embedding-vision-250615 \
  --image "./images/photo.jpg"
  1. Process image files from a local path using wildcard characters:
tos-vectors-embed \
  --account-id 12345678 \
  put \
  --vector-bucket-name my-bucket \
  --index-name my-index \
  --model-id doubao-embedding-vision-250615 \
  --image "./images/*.jpg"
  1. Process image files from a TOS bucket using wildcard characters:
tos-vectors-embed \
  --account-id 12345678 \
  put \
  --vector-bucket-name my-bucket \
  --index-name my-index \
  --model-id doubao-embedding-vision-250615 \
  --image "tos://bucket/images/*"
  1. Add metadata to your vectors:
tos-vectors-embed \
  --account-id 12345678 \
  put \
  --vector-bucket-name my-bucket \
  --index-name my-index \
  --model-id doubao-embedding-vision-250615 \
  --text-value "Sample text" \
  --metadata '{"category": "documentation", "version": "1.0"}'
  1. Use a custom vector key:
tos-vectors-embed \
  --account-id 12345678 \
  put \
  --vector-bucket-name my-bucket \
  --index-name my-index \
  --model-id doubao-embedding-vision-250615 \
  --text-value "Sample text" \
  --key "doc-001"
  1. Use filename as vector key:
tos-vectors-embed \
  --account-id 12345678 \
  put \
  --vector-bucket-name my-bucket \
  --index-name my-index \
  --model-id doubao-embedding-vision-250615 \
  --text "./documents/report.txt" \
  --filename-as-key
  1. Use key prefix with auto-generated UUIDs:
tos-vectors-embed \
  --account-id 12345678 \
  put \
  --vector-bucket-name my-bucket \
  --index-name my-index \
  --model-id doubao-embedding-vision-250615 \
  --text-value "Sample text" \
  --key-prefix "temp/"
  1. Use key prefix with custom key:
tos-vectors-embed \
  --account-id 12345678 \
  put \
  --vector-bucket-name my-bucket \
  --index-name my-index \
  --model-id doubao-embedding-vision-250615 \
  --text-value "Sample text" \
  --key "doc-001" \
  --key-prefix "project-a/"
  1. Use key prefix with filename:
tos-vectors-embed \
  --account-id 12345678 \
  put \
  --vector-bucket-name my-bucket \
  --index-name my-index \
  --model-id doubao-embedding-vision-250615 \
  --text "./documents/report.txt" \
  --filename-as-key \
  --key-prefix "docs/"
  1. Process a local video file:
tos-vectors-embed \
  --account-id 12345678 \
  put \
  --vector-bucket-name my-bucket \
  --index-name my-index \
  --model-id doubao-embedding-vision-250615 \
  --video "./videos/sample.mp4"
  1. Process a video file from a TOS bucket:
tos-vectors-embed \
  --account-id 12345678 \
  put \
  --vector-bucket-name my-bucket \
  --index-name my-index \
  --model-id doubao-embedding-vision-250615 \
  --video "tos://bucket/videos/sample.mp4"
  1. Multimodal input (Text + Image): Note: Multimodal input currently only supports one image and one text pair.
tos-vectors-embed \
  --account-id 12345678 \
  put \
  --vector-bucket-name my-bucket \
  --index-name my-index \
  --model-id doubao-embedding-vision-250615 \
  --text-value "A beautiful sunset over the mountains" \
  --image "./images/sunset.jpg"

Query Examples

  1. Direct text query:
tos-vectors-embed \
  --account-id 12345678 \
  query \
  --vector-bucket-name my-bucket \
  --index-name my-index \
  --model-id doubao-embedding-vision-250615 \
  --text-value "query text" \
  --k 20
  1. Query using a local text file:
tos-vectors-embed \
  --account-id 12345678 \
  query \
  --vector-bucket-name my-bucket \
  --index-name my-index \
  --model-id doubao-embedding-vision-250615 \
  --text "./documents/query.txt" \
  --k 20 \
  --output table
  1. Query using a TOS text file:
tos-vectors-embed \
  --account-id 12345678 \
  query \
  --vector-bucket-name my-bucket \
  --index-name my-index \
  --model-id doubao-embedding-vision-250615 \
  --text "tos://my-bucket/query.txt" \
  --k 20 
  1. Image query:
tos-vectors-embed \
  --account-id 12345678 \
  query \
  --vector-bucket-name my-bucket \
  --index-name my-index \
  --model-id doubao-embedding-vision-250615 \
  --image "./documents/image.jpg" \
  --k 20 
  1. Query using a TOS image file:
tos-vectors-embed \
  --account-id 12345678 \
  query \
  --vector-bucket-name my-bucket \
  --index-name my-index \
  --model-id doubao-embedding-vision-250615 \
  --image "tos://my-bucket/image.jpg" \
  --k 20 
  1. Query with metadata filter (Exact match):
tos-vectors-embed \
  --account-id 12345678 \
  query \
  --vector-bucket-name my-bucket \
  --index-name my-index \
  --model-id doubao-embedding-vision-250615 \
  --text-value "query text" \
  --filter '{"category": {"$eq": "documentation"}}'
  1. Query with multiple filters (AND):
tos-vectors-embed \
  --account-id 12345678 \
  query \
  --vector-bucket-name my-bucket \
  --index-name my-index \
  --model-id doubao-embedding-vision-250615 \
  --text-value "query text" \
  --filter '{"$and": [{"category": "tech"}, {"version": {"$gte": "1.0"}}]}'
  1. Video query:
tos-vectors-embed \
  --account-id 12345678 \
  query \
  --vector-bucket-name my-bucket \
  --index-name my-index \
  --model-id doubao-embedding-vision-250615 \
  --video "./videos/query.mp4"
  1. Multimodal query (Text + Image): Note: Multimodal query currently only supports one image and one text pair.
tos-vectors-embed \
  --account-id 12345678 \
  query \
  --vector-bucket-name my-bucket \
  --index-name my-index \
  --model-id doubao-embedding-vision-250615 \
  --text-value "search query" \
  --image "./images/query.jpg"

Wildcard Character Support

The CLI supports powerful wildcard characters in the input path for processing multiple files efficiently:

Local Filesystem Patterns

  • Basic wildcards: ./data/*.txt - all .txt files in data directory
  • Home directory: ~/documents/*.md - all .md files in user's documents
  • Recursive patterns: ./docs/**/*.txt - all .txt files recursively
  • Multiple extensions: ./files/*.{txt,md,json} - multiple file types
  • Question mark: ./file?.txt - single character wildcard

TOS URI Patterns

Important: TOS wildcards work with prefixes, not file extensions. Use tos://bucket/path/* not tos://bucket/path/*.ext.

Examples:

# Process all files under a TOS prefix
tos-vectors-embed put --vector-bucket-name bucket --index-name idx \
  --model-id doubao-embedding-vision-250615 --text "tos://bucket/path1/*"

Important Differences: Local vs TOS Wildcards

Local Filesystem Wildcards:

  • ✅ Support file extensions: ./data/*.txt, ./docs/*.json
  • ✅ Support complex patterns: ./files/*.{txt,md}, ./doc?.txt
  • ✅ Support recursive patterns: ./docs/**/*.md

TOS Wildcards:

  • ✅ Support prefix patterns: tos://bucket/docs/*, tos://bucket/2024/reports/*
  • Do NOT support extension filtering: tos://bucket/path/*.json won't work
  • Do NOT support complex patterns: Use prefix-based organization instead

Best Practices:

  • For TOS: Organize files by prefix/path structure: tos://bucket/json-files/*
  • For Local: Use full wildcard capabilities: ./data/*.{json,txt}

Global Options

  • --debug: Enable debug mode with detailed logging for troubleshooting
  • --account-id: Volcengine account id
  • --vectors-region: TOS vectors bucket region name
  • --vectors-endpoint: The domain names that other services can use to access TOS vectors bucket

Put Command Parameters

Required:

  • --vector-bucket-name: Name of the TOS vector bucket
  • --index-name: Name of the vector index in your vector index to store the vector embeddings
  • --model-id: Ark model ID to use for generating embeddings

Input Options (one required):

  • --text-value: Direct text input to embed
  • --text: Text input - supports multiple input types:
    • Local file: ./document.txt
    • Local files with wildcard characters: ./data/*.txt
    • TOS object: tos://bucket/path/file.txt
    • TOS path with wildcard characters: tos://bucket/path/*
  • --image: Image input - supports multiple input types:
    • Local file: ./document.jpg
    • Local wildcard: ./data/*.jpg
    • TOS object: tos://bucket/path/file.jpg
    • TOS path with wildcard characters: tos://bucket/path/*
  • --video: Video input (Local file)

Optional:

  • --region: TOS region name (effective in TOS path mode)
  • --key: Uniquely identifies each vector in the vector index (default: auto-generated UUID)
  • --key-prefix: Prefix to prepend to all vector keys
  • --filename-as-key: Use filename as vector key (mutually exclusive with --key)
  • --metadata: Additional metadata associated with the vector; provided as JSON string
  • --ark-inference-params: Model-specific parameters passed to Ark (JSON format)
  • --max-workers: Maximum parallel workers for batch processing (default: 4)
  • --batch-size: Number of vectors per TOS Vector put_vectors call (1-500, default: 500)
  • --output: Output format (json or table, default: json)

Query Command Parameters

Core Required Parameters:

  • --vector-bucket-name: Name of the TOS vector bucket
  • --index-name: Name of the vector index
  • --model-id: Ark model ID to use for generating embeddings

Query Input Parameters (One Required):

  • --text-value: Direct text query string
  • --text: Text file path (local file or TOS URI)
  • --image: Image file path (local file or TOS URI)
  • --video: Video file path (local file)

Optional Parameters:

  • --region: TOS region name
  • --k: Number of results to return (default: 30)
  • --filter: Filter expression for metadata-based filtering (JSON format)
  • --ark-inference-params: Model-specific parameters passed to Ark (JSON format)
  • --return-metadata: Include metadata in results (default: true)
  • --return-distance: Include similarity distance scores
  • --output: Output format (table or json, default: json)

Metadata Filtering

Supported Operators

Comparison Operators

  • $eq: Equal to
  • $ne: Not equal to
  • $gt: Greater than
  • $gte: Greater than or equal to
  • $lt: Less than
  • $lte: Less than or equal to
  • $in: Value in array
  • $nin: Value not in array

Logical Operators

  • $and: Logical AND (all conditions must be true)
  • $or: Logical OR (at least one condition must be true)
  • $not: Logical NOT (condition must be false)

Filter Examples

Single Condition Filters

# Exact match
--filter '{"category": {"$eq": "documentation"}}'

# Not equal
--filter '{"status": {"$ne": "archived"}}'

Vector Key Management

The CLI provides flexible options for managing vector keys:

  • Auto-Generated UUID (Default): If no key is provided, a random UUID is generated.
  • Custom Key (--key): Specify a unique identifier for each vector.
  • Object-Based Key (--filename-as-key): Use the filename (for local files) or object key (for TOS objects) as the vector key.
  • Key Prefix (--key-prefix): Prepend a string to all generated or provided keys.

Metadata

The Volcengine TOS Vectors Embed CLI automatically adds standard metadata fields to help track and manage your vector embeddings. Understanding these fields is important for filtering and troubleshooting your vector data.

Standard Metadata Fields

The CLI automatically adds the following metadata fields to every vector:

TOS-VECTORS-EMBED-SRC-CONTENT

  • Purpose: Stores the original text content.
  • Behavior:
    • Direct text input (--text-value): Contains the actual text content
    • Text files: Contains the full text content of the file
    • Image files: N/A (images don't have textual content to store)

Examples:

# Direct text - stores the actual text
--text-value "Hello world" 
# Metadata: {"TOS-VECTORS-EMBED-SRC-CONTENT": "Hello world"}

# Text file - stores file content
--text document.txt
# Metadata: {"TOS-VECTORS-EMBED-SRC-CONTENT": "Contents of document.txt..."}

TOS-VECTORS-EMBED-SRC-LOCATION

  • Purpose: Tracks the original file location
  • Behavior:
    • Text files: Contains the file path or TOS URI
    • Image files: Contains the file path or TOS URI
    • Direct text: Not added (no file involved)

Examples:

# Local text file
--text /path/to/document.txt
# Metadata: {
#   "TOS-VECTORS-EMBED-SRC-CONTENT": "File contents...",
#   "TOS-VECTORS-EMBED-SRC-LOCATION": "file:///path/to/document.txt"
# }

# TOS text file
--text tos://my-bucket/docs/file.txt
# Metadata: {
#   "TOS-VECTORS-EMBED-SRC-CONTENT": "File contents...",
#   "TOS-VECTORS-EMBED-SRC-LOCATION": "tos://my-bucket/docs/file.txt"
# }

TOS-VECTORS-EMBED-SRC-CONTENT-TYPE

  • Purpose: Indicates the type of content (TEXT, IMAGE, VIDEO).

Additional Metadata

You can add your own metadata using the --metadata parameter with JSON format:

tos-vectors-embed put \
  --vector-bucket-name my-bucket \
  --index-name my-index \
  --model-id doubao-embedding-vision-250615 \
  --text-value "Sample text" \
  --metadata '{"category": "documentation", "version": "1.0", "author": "team-a"}'

Result: Your metadata is merged with the standard metadata fields:

{
  "TOS-VECTORS-EMBED-SRC-CONTENT": "Sample text",
  "TOS-VECTORS-EMBED-SRC-CONTENT-TYPE": "TEXT",
  "category": "documentation",
  "version": "1.0", 
  "author": "team-a"
}

Batch Processing

The CLI supports efficient batch processing for multiple files using local and TOS wildcard paths.

Batch Processing Features

  • Automatic batching: Large datasets are automatically split into batches of 500 vectors
  • Parallel processing: Configurable workers for concurrent processing
  • Error resilience: Individual file failures don't stop batch processing
  • Performance optimization: Efficient memory usage and API call batching

Processing Strategy by Content Type

The CLI automatically selects the optimal processing strategy based on content type:

Content Type Processing Mode API Used Batch Strategy Output
Text Sync Ark API Parallel batch storage Single vector per file
Image Sync Ark API Parallel batch storage Single vector per file
Video Sync Ark API Per-file storage Multiple vectors per file

Batch Examples

  1. Process local files with custom parallel workers:
tos-vectors-embed --account-id 12345678 put \
  --vector-bucket-name my-bucket \
  --index-name my-index \
  --model-id doubao-embedding-vision-250615 \
  --text "./documents/*.txt" \
  --max-workers 8
  1. Process files with custom batch size for TOS storage:
tos-vectors-embed --account-id 12345678 put \
  --vector-bucket-name my-bucket \
  --index-name my-index \
  --model-id doubao-embedding-vision-250615 \
  --text "tos://bucket/path/*" \
  --batch-size 100

Batch Processing Output

Text/Image Batch Output:

tos-vectors-embed --account-id 12345678 put \
  --vector-bucket-name my-bucket \
  --index-name my-index \
  --model-id doubao-embedding-vision-250615 \
  --text "./documents/*.txt"

Output:

{
  "type": "streaming_batch",
  "bucket": "my-bucket",
  "index": "my-index",
  "model": "doubao-embedding-vision-250615",
  "contentType": "text",
  "totalFiles": 94,
  "processedFiles": 94,
  "failedFiles": 0,
  "totalVectors": 94,
  "vectorKeys": [
    "abc-123...",
    "def-456..."
  ]
}

Troubleshooting

Use Debug Mode

For detailed information about API calls and performance, use the --debug flag:

tos-vectors-embed --debug --account-id 12345678 put \
  --vector-bucket-name my-bucket \
  --index-name my-index \
  --model-id doubao-embedding-vision-250615 \
  --text-value "test"

Common Issues

  1. Credentials Not Found: Ensure ARK_API_KEY, TOS_ACCESS_KEY, and TOS_SECRET_KEY are set in your environment.
  2. Invalid Vector Dimension: The CLI automatically fetches the index dimension. Ensure your Ark model supports the dimension configured in your TOS index.
  3. Account ID Format: The --account-id must be a numeric string.

Model Compatibility

Model Type Use Case
doubao-embedding-vision-250615 Multimodal (Text + Image) Modern text and image embedding
doubao-embedding-vision-251215 Multimodal (Text + Image) Advanced text and image embedding

Repository Structure

tos-vectors-embed-cli/
├── tos_vectors/                       # Main package directory
│   ├── cli.py                        # Main CLI entry point
│   ├── commands/                     # Command implementations
│   │   ├── embed_put.py              # Vector embedding and storage
│   │   └── embed_query.py            # Vector similarity search
│   ├── core/                         # Core functionality
│   │   ├── unified_processor.py      # Unified processing logic
│   │   ├── services.py               # Ark and TOS Vector services
│   │   └── streaming_batch_orchestrator.py  # Batch processing
│   └── utils/                        # Utility functions
│       ├── config.py                 # Configuration management
│       ├── models.py                 # Model definitions and capabilities
│       └── multimodal_helpers.py     # Multimodal processing helpers
├── setup.py                          # Package installation configuration
├── pyproject.toml                    # Modern Python packaging configuration
├── requirements.txt                  # Python dependencies
├── LICENSE                           # Apache 2.0 license
├── NOTICE                            # Attribution notices

Acknowledgement

This project is derived from the s3-vectors-embed-cli project, which is licensed under the Apache License 2.0. We thank the original authors for their contributions to the open-source community.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tos_vectors_embed_cli-0.1.0.tar.gz (37.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

tos_vectors_embed_cli-0.1.0-py3-none-any.whl (41.7 kB view details)

Uploaded Python 3

File details

Details for the file tos_vectors_embed_cli-0.1.0.tar.gz.

File metadata

  • Download URL: tos_vectors_embed_cli-0.1.0.tar.gz
  • Upload date:
  • Size: 37.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.0

File hashes

Hashes for tos_vectors_embed_cli-0.1.0.tar.gz
Algorithm Hash digest
SHA256 e5d61602c2f1897d4e71aebb577f8273998b21a6676d6e8c78b20df1fcfba4d6
MD5 2ae2ad1db604caa24a2548d2380ac299
BLAKE2b-256 9e263d078d41bee77b5e7948ac6e388221508df1fa820b9c710ea0c9ed13d2cd

See more details on using hashes here.

File details

Details for the file tos_vectors_embed_cli-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for tos_vectors_embed_cli-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 a68c0845c42e8e8e8819d72ea5d27e7589027805a6291d97d6ba81d782aaea56
MD5 b931f645b81743d5432e335f733a475e
BLAKE2b-256 e21fd8db7379dd74fb15d43a03826e2280066c6f46f790c4f45fcc250fc4b5aa

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page