Standalone CLI for TOS Vector operations with Volcengine Ark embeddings
Project description
Volcengine TOS Vectors Embed CLI
Volcengine TOS Vectors Embed CLI is a standalone command-line tool that simplifies the process of working with vector embeddings in TOS Vectors. You can create vector embeddings for your data using Volcengine Ark and store and query them in your TOS vector index using single commands.
Supported Commands
tos-vectors-embed put: Embed text, file content, or TOS objects and store them as vectors in a TOS vector index. You can create and ingest vector embeddings into a TOS vector index using a single put command. You specify the data input you want to create an embedding for, a Volcengine Ark embeddings model ID, your TOS vector bucket name, and TOS vector index name. The command supports several input formats including text data, a local text or image file, a TOS image or text object or prefix. The command generates embeddings using the dimensions configured in your TOS vector index properties. If you are ingesting embeddings for several objects in a TOS prefix or local file path, it automatically uses batch processes to maximize throughput.
Note: Each file is processed as a single embedding. Document chunking is not currently supported.
tos-vectors-embed query: Embed a query input and search for similar vectors in a TOS vector index. You can perform similarity queries for vector embeddings in your TOS vector index using a single query command. You specify your query input, a Volcengine Ark embeddings model ID, the vector bucket name, and vector index name. The command accepts several types of query inputs like a text string, an image file, or a single TOS text or image object. The command generates embeddings for your query using the input embeddings model and then performs a similarity search to find the most relevant matches. You can control the number of results returned, apply metadata filters to narrow your search, and choose whether to include similarity distance in the results for comprehensive analysis.
Supported Input Types
Note:
This CLI has introduced a unified --ark-inference-params parameter for all model-specific parameters.
Additionally, the query command uses the following separate parameters:
--text-value: Direct text query string (preferred for text queries)--text: Text file path (local file or TOS URI)--image: Image file path (local file or TOS URI)--video: Video file path (local file or TOS URI)
Installation and Configuration
Prerequisites
- Python 3.9 or higher
- To execute the CLI, you will need Volcengine credentials configured.
- Update your Volcengine account with appropriate permissions to use Volcengine Ark and TOS Vectors
- Access to a Volcengine Ark embedding model
- Create a Volcengine TOS vector bucket and vector index to store your embeddings
Quick Install (Recommended)
pip install tos-vectors-embed-cli
Development Install
# Clone the repository
git clone <repository-url>
cd tos-vectors-embed-cli
# Install in development mode
pip install -e .
Note: All dependencies are automatically installed when you install the package via pip.
Quick Start
Configure credentials
- Configure ARK API key from the environment variables:
export ARK_API_KEY="YOUR_ARK_API_KEY"
- Configure TOS credentials from the environment variables:
export TOS_ACCESS_KEY="YOUR_TOS_ACCESS_KEY"
export TOS_SECRET_KEY="YOUR_TOS_SECRET_KEY"
export TOS_VECTOR_ENDPOINT="tosvectors-cn-beijing.volces.com" # Optional, defaults to cn-beijing
Put Examples
- Embed text and store them as vectors in your TOS vector index:
tos-vectors-embed \
--account-id 12345678 \
put \
--vector-bucket-name my-bucket \
--index-name my-index \
--model-id doubao-embedding-vision-250615 \
--text-value 'Hello, world'
- Process local text files:
tos-vectors-embed \
--account-id 12345678 \
put \
--vector-bucket-name my-bucket \
--index-name my-index \
--model-id doubao-embedding-vision-250615 \
--text "./documents/sample.txt"
- Process files from a local file path using wildcard characters:
tos-vectors-embed \
--account-id 12345678 \
put \
--vector-bucket-name my-bucket \
--index-name my-index \
--model-id doubao-embedding-vision-250615 \
--text "./documents/*.txt"
- Process files from a TOS bucket using wildcard characters:
tos-vectors-embed \
--account-id 12345678 \
put \
--vector-bucket-name my-bucket \
--index-name my-index \
--model-id doubao-embedding-vision-250615 \
--text "tos://bucket/path/*"
- Process a single file from a TOS bucket:
tos-vectors-embed \
--account-id 12345678 \
put \
--vector-bucket-name my-bucket \
--index-name my-index \
--model-id doubao-embedding-vision-250615 \
--image "tos://bucket/images/photo.jpg"
- Process a local image file:
tos-vectors-embed \
--account-id 12345678 \
put \
--vector-bucket-name my-bucket \
--index-name my-index \
--model-id doubao-embedding-vision-250615 \
--image "./images/photo.jpg"
- Process image files from a local path using wildcard characters:
tos-vectors-embed \
--account-id 12345678 \
put \
--vector-bucket-name my-bucket \
--index-name my-index \
--model-id doubao-embedding-vision-250615 \
--image "./images/*.jpg"
- Process image files from a TOS bucket using wildcard characters:
tos-vectors-embed \
--account-id 12345678 \
put \
--vector-bucket-name my-bucket \
--index-name my-index \
--model-id doubao-embedding-vision-250615 \
--image "tos://bucket/images/*"
- Add metadata to your vectors:
tos-vectors-embed \
--account-id 12345678 \
put \
--vector-bucket-name my-bucket \
--index-name my-index \
--model-id doubao-embedding-vision-250615 \
--text-value 'Sample text' \
--metadata '{"category": "documentation", "version": "1.0"}'
- Use a custom vector key:
tos-vectors-embed \
--account-id 12345678 \
put \
--vector-bucket-name my-bucket \
--index-name my-index \
--model-id doubao-embedding-vision-250615 \
--text-value 'Sample text' \
--key "doc-001"
- Use filename as vector key:
tos-vectors-embed \
--account-id 12345678 \
put \
--vector-bucket-name my-bucket \
--index-name my-index \
--model-id doubao-embedding-vision-250615 \
--text "./documents/report.txt" \
--filename-as-key
- Use key prefix with auto-generated UUIDs:
tos-vectors-embed \
--account-id 12345678 \
put \
--vector-bucket-name my-bucket \
--index-name my-index \
--model-id doubao-embedding-vision-250615 \
--text-value 'Sample text' \
--key-prefix "temp/"
- Use key prefix with custom key:
tos-vectors-embed \
--account-id 12345678 \
put \
--vector-bucket-name my-bucket \
--index-name my-index \
--model-id doubao-embedding-vision-250615 \
--text-value 'Sample text' \
--key "doc-001" \
--key-prefix "project-a/"
- Use key prefix with filename:
tos-vectors-embed \
--account-id 12345678 \
put \
--vector-bucket-name my-bucket \
--index-name my-index \
--model-id doubao-embedding-vision-250615 \
--text "./documents/report.txt" \
--filename-as-key \
--key-prefix "docs/"
- Process a local video file:
tos-vectors-embed \
--account-id 12345678 \
put \
--vector-bucket-name my-bucket \
--index-name my-index \
--model-id doubao-embedding-vision-250615 \
--video "./videos/sample.mp4"
- Process a video file from a TOS bucket:
tos-vectors-embed \
--account-id 12345678 \
put \
--vector-bucket-name my-bucket \
--index-name my-index \
--model-id doubao-embedding-vision-250615 \
--video "tos://bucket/videos/sample.mp4"
- Multimodal input (Text + Image): Note: Multimodal input currently only supports one image and one text pair.
tos-vectors-embed \
--account-id 12345678 \
put \
--vector-bucket-name my-bucket \
--index-name my-index \
--model-id doubao-embedding-vision-250615 \
--text-value 'A beautiful sunset over the mountains' \
--image "./images/sunset.jpg"
Query Examples
- Direct text query:
tos-vectors-embed \
--account-id 12345678 \
query \
--vector-bucket-name my-bucket \
--index-name my-index \
--model-id doubao-embedding-vision-250615 \
--text-value 'query text' \
--k 20
- Query using a local text file:
tos-vectors-embed \
--account-id 12345678 \
query \
--vector-bucket-name my-bucket \
--index-name my-index \
--model-id doubao-embedding-vision-250615 \
--text "./documents/query.txt" \
--k 20 \
--output table
- Query using a TOS text file:
tos-vectors-embed \
--account-id 12345678 \
query \
--vector-bucket-name my-bucket \
--index-name my-index \
--model-id doubao-embedding-vision-250615 \
--text "tos://my-bucket/query.txt" \
--k 20
- Image query:
tos-vectors-embed \
--account-id 12345678 \
query \
--vector-bucket-name my-bucket \
--index-name my-index \
--model-id doubao-embedding-vision-250615 \
--image "./documents/image.jpg" \
--k 20
- Query using a TOS image file:
tos-vectors-embed \
--account-id 12345678 \
query \
--vector-bucket-name my-bucket \
--index-name my-index \
--model-id doubao-embedding-vision-250615 \
--image "tos://my-bucket/image.jpg" \
--k 20
- Query with metadata filter (Exact match):
tos-vectors-embed \
--account-id 12345678 \
query \
--vector-bucket-name my-bucket \
--index-name my-index \
--model-id doubao-embedding-vision-250615 \
--text-value 'query text' \
--filter '{"category": {"$eq": "documentation"}}'
- Query with multiple filters (AND):
tos-vectors-embed \
--account-id 12345678 \
query \
--vector-bucket-name my-bucket \
--index-name my-index \
--model-id doubao-embedding-vision-250615 \
--text-value 'query text' \
--filter '{"$and": [{"category": "tech"}, {"version": {"$gte": "1.0"}}]}'
- Video query:
tos-vectors-embed \
--account-id 12345678 \
query \
--vector-bucket-name my-bucket \
--index-name my-index \
--model-id doubao-embedding-vision-250615 \
--video "./videos/query.mp4"
- Multimodal query (Text + Image): Note: Multimodal query currently only supports one image and one text pair.
tos-vectors-embed \
--account-id 12345678 \
query \
--vector-bucket-name my-bucket \
--index-name my-index \
--model-id doubao-embedding-vision-250615 \
--text-value 'search query' \
--image "./images/query.jpg"
Wildcard Character Support
The CLI supports powerful wildcard characters in the input path for processing multiple files efficiently:
Local Filesystem Patterns
- Basic wildcards:
./data/*.txt- all .txt files in data directory - Home directory:
~/documents/*.md- all .md files in user's documents - Recursive patterns:
./docs/**/*.txt- all .txt files recursively - Multiple extensions:
./files/*.{txt,md,json}- multiple file types - Question mark:
./file?.txt- single character wildcard
TOS URI Patterns
Important: TOS wildcards work with prefixes, not file extensions. Use tos://bucket/path/* not tos://bucket/path/*.ext.
Examples:
# Process all files under a TOS prefix
tos-vectors-embed put --vector-bucket-name bucket --index-name idx \
--model-id doubao-embedding-vision-250615 --text "tos://bucket/path1/*"
Important Differences: Local vs TOS Wildcards
Local Filesystem Wildcards:
- ✅ Support file extensions:
./data/*.txt,./docs/*.json - ✅ Support complex patterns:
./files/*.{txt,md},./doc?.txt - ✅ Support recursive patterns:
./docs/**/*.md
TOS Wildcards:
- ✅ Support prefix patterns:
tos://bucket/docs/*,tos://bucket/2024/reports/* - ❌ Do NOT support extension filtering:
tos://bucket/path/*.jsonwon't work - ❌ Do NOT support complex patterns: Use prefix-based organization instead
Best Practices:
- For TOS: Organize files by prefix/path structure:
tos://bucket/json-files/* - For Local: Use full wildcard capabilities:
./data/*.{json,txt}
Global Options
--debug: Enable debug mode with detailed logging for troubleshooting--account-id: Volcengine account id--vectors-region: TOS vectors bucket region name--vectors-endpoint: The domain names that other services can use to access TOS vectors bucket
Put Command Parameters
Required:
--vector-bucket-name: Name of the TOS vector bucket--index-name: Name of the vector index in your vector index to store the vector embeddings--model-id: Ark model ID to use for generating embeddings
Input Options (one required):
--text-value: Direct text input to embed--text: Text input - supports multiple input types:- Local file:
./document.txt - Local files with wildcard characters:
./data/*.txt - TOS object:
tos://bucket/path/file.txt - TOS path with wildcard characters:
tos://bucket/path/*
- Local file:
--image: Image input - supports multiple input types:- Local file:
./document.jpg - Local wildcard:
./data/*.jpg - TOS object:
tos://bucket/path/file.jpg - TOS path with wildcard characters:
tos://bucket/path/*
- Local file:
--video: Video input (Local file)
Optional:
--region: TOS region name (effective in TOS path mode)--key: Uniquely identifies each vector in the vector index (default: auto-generated UUID)--key-prefix: Prefix to prepend to all vector keys--filename-as-key: Use filename as vector key (mutually exclusive with --key)--metadata: Additional metadata associated with the vector; provided as JSON string--ark-inference-params: Model-specific parameters passed to Ark (JSON format)--max-workers: Maximum parallel workers for batch processing (default: 4)--batch-size: Number of vectors per TOS Vector put_vectors call (1-500, default: 500)--output: Output format (json or table, default: json)
Query Command Parameters
Core Required Parameters:
--vector-bucket-name: Name of the TOS vector bucket--index-name: Name of the vector index--model-id: Ark model ID to use for generating embeddings
Query Input Parameters (One Required):
--text-value: Direct text query string--text: Text file path (local file or TOS URI)--image: Image file path (local file or TOS URI)--video: Video file path (local file)
Optional Parameters:
--region: TOS region name--k: Number of results to return (default: 30)--filter: Filter expression for metadata-based filtering (JSON format)--ark-inference-params: Model-specific parameters passed to Ark (JSON format)--return-metadata: Include metadata in results (default: true)--return-distance: Include similarity distance scores--output: Output format (table or json, default: json)
Metadata Filtering
Supported Operators
Comparison Operators
$eq: Equal to$ne: Not equal to$gt: Greater than$gte: Greater than or equal to$lt: Less than$lte: Less than or equal to$in: Value in array$nin: Value not in array
Logical Operators
$and: Logical AND (all conditions must be true)$or: Logical OR (at least one condition must be true)$not: Logical NOT (condition must be false)
Filter Examples
Single Condition Filters
# Exact match
--filter '{"category": {"$eq": "documentation"}}'
# Not equal
--filter '{"status": {"$ne": "archived"}}'
Vector Key Management
The CLI provides flexible options for managing vector keys:
- Auto-Generated UUID (Default): If no key is provided, a random UUID is generated.
- Custom Key (
--key): Specify a unique identifier for each vector. - Object-Based Key (
--filename-as-key): Use the filename (for local files) or object key (for TOS objects) as the vector key. - Key Prefix (
--key-prefix): Prepend a string to all generated or provided keys.
Metadata
The Volcengine TOS Vectors Embed CLI automatically adds standard metadata fields to help track and manage your vector embeddings. Understanding these fields is important for filtering and troubleshooting your vector data.
Standard Metadata Fields
The CLI automatically adds the following metadata fields to every vector:
TOS-VECTORS-EMBED-SRC-CONTENT
- Purpose: Stores the original text content.
- Behavior:
- Direct text input (
--text-value): Contains the actual text content - Text files: Contains the full text content of the file
- Image files: N/A (images don't have textual content to store)
- Direct text input (
Examples:
# Direct text - stores the actual text
--text-value 'Hello world'
# Metadata: {"TOS-VECTORS-EMBED-SRC-CONTENT": "Hello world"}
# Text file - stores file content
--text document.txt
# Metadata: {"TOS-VECTORS-EMBED-SRC-CONTENT": "Contents of document.txt..."}
TOS-VECTORS-EMBED-SRC-LOCATION
- Purpose: Tracks the original file location
- Behavior:
- Text files: Contains the file path or TOS URI
- Image files: Contains the file path or TOS URI
- Direct text: Not added (no file involved)
Examples:
# Local text file
--text /path/to/document.txt
# Metadata: {
# "TOS-VECTORS-EMBED-SRC-CONTENT": "File contents...",
# "TOS-VECTORS-EMBED-SRC-LOCATION": "file:///path/to/document.txt"
# }
# TOS text file
--text tos://my-bucket/docs/file.txt
# Metadata: {
# "TOS-VECTORS-EMBED-SRC-CONTENT": "File contents...",
# "TOS-VECTORS-EMBED-SRC-LOCATION": "tos://my-bucket/docs/file.txt"
# }
TOS-VECTORS-EMBED-SRC-CONTENT-TYPE
- Purpose: Indicates the type of content (TEXT, IMAGE, VIDEO).
Additional Metadata
You can add your own metadata using the --metadata parameter with JSON format:
tos-vectors-embed put \
--vector-bucket-name my-bucket \
--index-name my-index \
--model-id doubao-embedding-vision-250615 \
--text-value 'Sample text' \
--metadata '{"category": "documentation", "version": "1.0", "author": "team-a"}'
Result: Your metadata is merged with the standard metadata fields:
{
"TOS-VECTORS-EMBED-SRC-CONTENT": "Sample text",
"TOS-VECTORS-EMBED-SRC-CONTENT-TYPE": "TEXT",
"category": "documentation",
"version": "1.0",
"author": "team-a"
}
Batch Processing
The CLI supports efficient batch processing for multiple files using local and TOS wildcard paths.
Batch Processing Features
- Automatic batching: Large datasets are automatically split into batches of 500 vectors
- Parallel processing: Configurable workers for concurrent processing
- Error resilience: Individual file failures don't stop batch processing
- Performance optimization: Efficient memory usage and API call batching
Processing Strategy by Content Type
The CLI automatically selects the optimal processing strategy based on content type:
| Content Type | Processing Mode | API Used | Batch Strategy | Output |
|---|---|---|---|---|
| Text | Sync | Ark API | Parallel batch storage | Single vector per file |
| Image | Sync | Ark API | Parallel batch storage | Single vector per file |
| Video | Sync | Ark API | Per-file storage | Multiple vectors per file |
Batch Examples
- Process local files with custom parallel workers:
tos-vectors-embed --account-id 12345678 put \
--vector-bucket-name my-bucket \
--index-name my-index \
--model-id doubao-embedding-vision-250615 \
--text "./documents/*.txt" \
--max-workers 8
- Process files with custom batch size for TOS storage:
tos-vectors-embed --account-id 12345678 put \
--vector-bucket-name my-bucket \
--index-name my-index \
--model-id doubao-embedding-vision-250615 \
--text "tos://bucket/path/*" \
--batch-size 100
Batch Processing Output
Text/Image Batch Output:
tos-vectors-embed --account-id 12345678 put \
--vector-bucket-name my-bucket \
--index-name my-index \
--model-id doubao-embedding-vision-250615 \
--text "./documents/*.txt"
Output:
{
"type": "streaming_batch",
"bucket": "my-bucket",
"index": "my-index",
"model": "doubao-embedding-vision-250615",
"contentType": "text",
"totalFiles": 94,
"processedFiles": 94,
"failedFiles": 0,
"totalVectors": 94,
"vectorKeys": [
"abc-123...",
"def-456..."
]
}
Troubleshooting
Use Debug Mode
For detailed information about API calls and performance, use the --debug flag:
tos-vectors-embed --debug --account-id 12345678 put \
--vector-bucket-name my-bucket \
--index-name my-index \
--model-id doubao-embedding-vision-250615 \
--text-value 'test'
Common Issues
- Credentials Not Found: Ensure
ARK_API_KEY,TOS_ACCESS_KEY, andTOS_SECRET_KEYare set in your environment. - Invalid Vector Dimension: The CLI automatically fetches the index dimension. Ensure your Ark model supports the dimension configured in your TOS index.
- Account ID Format: The
--account-idmust be a numeric string.
Model Compatibility
| Model | Type | Use Case |
|---|---|---|
doubao-embedding-vision-250615 |
Multimodal (Text + Image) | Modern text and image embedding |
doubao-embedding-vision-251215 |
Multimodal (Text + Image) | Advanced text and image embedding |
Repository Structure
tos-vectors-embed-cli/
├── tos_vectors/ # Main package directory
│ ├── cli.py # Main CLI entry point
│ ├── commands/ # Command implementations
│ │ ├── embed_put.py # Vector embedding and storage
│ │ └── embed_query.py # Vector similarity search
│ ├── core/ # Core functionality
│ │ ├── unified_processor.py # Unified processing logic
│ │ ├── services.py # Ark and TOS Vector services
│ │ └── streaming_batch_orchestrator.py # Batch processing
│ └── utils/ # Utility functions
│ ├── config.py # Configuration management
│ ├── models.py # Model definitions and capabilities
│ └── multimodal_helpers.py # Multimodal processing helpers
├── setup.py # Package installation configuration
├── pyproject.toml # Modern Python packaging configuration
├── requirements.txt # Python dependencies
├── LICENSE # Apache 2.0 license
├── NOTICE # Attribution notices
Acknowledgement
This project is derived from the s3-vectors-embed-cli project, which is licensed under the Apache License 2.0. We thank the original authors for their contributions to the open-source community.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file tos_vectors_embed_cli-0.3.0.tar.gz.
File metadata
- Download URL: tos_vectors_embed_cli-0.3.0.tar.gz
- Upload date:
- Size: 37.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
473fc59e4d93eed44f4a229c6017d63d72c94a5c94b659e51d0f4376bd462fae
|
|
| MD5 |
7dbb05bcc8329edfb03790ca1ac35c87
|
|
| BLAKE2b-256 |
68c67688c49cb6cfeee2c62871dc7608763c577652de9c1479066e0fdceb2037
|
File details
Details for the file tos_vectors_embed_cli-0.3.0-py3-none-any.whl.
File metadata
- Download URL: tos_vectors_embed_cli-0.3.0-py3-none-any.whl
- Upload date:
- Size: 41.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e74a0e250952592e555e47de27eacdcc66497744d953430c779cad1c0f89a927
|
|
| MD5 |
422a8910ae03b1c86021190965223276
|
|
| BLAKE2b-256 |
4dca39241feadfea06c5a2529382cad03b60099648137155ffbff286f2363258
|