A powerful and flexible Python tool for text vectorization with multiple embedding models
Project description
text-vectorify
A powerful and flexible Python tool for text vectorization with multiple embedding models and intelligent caching.
๐ Simple Description
text-vectorify is a command-line tool that converts text data in JSONL format into vector embeddings using various state-of-the-art models including OpenAI, SentenceBERT, BGE, M3E, and HuggingFace transformers. It features intelligent caching, multi-field text combination, and seamless JSONL processing for efficient text analysis pipelines.
๐ Quick Start
pip install text-vectorify
# Basic usage with default model
text-vectorify \
--input data.jsonl \
--input-field-main "title" \
--input-field-subtitle "content" \
--process-method "OpenAIEmbedder" \
--process-extra-data "your-openai-api-key"
# Using stdin input
cat data.jsonl | text-vectorify \
--input-field-main "title" \
--process-method "BGEEmbedder"
โจ Features
- ๐ฏ Multiple Embedding Models: OpenAI, SentenceBERT, BGE, M3E, HuggingFace
- ๐ Intelligent Caching: Avoid recomputing embeddings for duplicate texts
- ๐ Flexible Field Combination: Combine multiple JSON fields for embedding
- ๐ JSONL Processing: Seamless input/output in JSONL format
- โก Batch Processing: Efficient processing of large datasets
- ๐ก๏ธ Error Resilience: Continue processing even if individual records fail
- ๐ฅ Stdin Support: Read input from pipes or stdin for flexible data processing
- ๐๏ธ Smart Defaults: Default model names for quick start without configuration
- ๐ง Flexible Input: Support file input, stdin, or explicit stdin markers
๐ Table of Contents
๐ง Installation
Method 1: pip install (Recommended)
# Install core package only
pip install text-vectorify
# Install with specific embedder support
pip install text-vectorify[openai] # OpenAI support
pip install text-vectorify[sentence-transformers] # SentenceBERT, BGE, M3E support
pip install text-vectorify[huggingface] # HuggingFace support
pip install text-vectorify[all] # All embedding models
# Install with development dependencies
pip install text-vectorify[dev]
Method 2: From source
# Clone repository
git clone https://github.com/changyy/py-text-vectorify.git
cd py-text-vectorify
# Create virtual environment
python3 -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install package
make install # Core package only
# or
make install-dev # With development dependencies
# or
make install-all # With all optional dependencies
Method 3: Development setup
# Create virtual environment
python3 -m venv venv
source venv/bin/activate
# Install in development mode
pip install -e ".[dev]"
# Run tests
make test
# Format code
make format
Install additional packages based on the embedding models you plan to use:
# For OpenAI embeddings
pip install openai
# For SentenceBERT, BGE, M3E models
pip install sentence-transformers
# For HuggingFace models
pip install transformers torch
๐ Usage
Command Line Interface
text-vectorify [OPTIONS]
Required Arguments
--input-field-main: Main text fields (comma-separated)--process-method: Embedding method to use
Optional Arguments
--input: Path to input JSONL file (use "-" for stdin, or omit to read from stdin)--process-model-name: Model name to use (optional, will use defaults if not specified)--input-field-subtitle: Additional text fields (comma-separated)--process-extra-data: Extra data like API keys--output-field: Output vector field name (default: "vector")--output-cache-dir: Cache directory (default: "./cache")--output: Output file path (default: auto-generated)
Quick Start Features
The tool now supports smart defaults and flexible input methods for easier usage:
Default Models
Each embedder has intelligent default models, so you don't need to specify --process-model-name:
- OpenAI:
text-embedding-3-small - BGE:
BAAI/bge-small-en-v1.5 - SentenceBERT:
paraphrase-multilingual-MiniLM-L12-v2 - M3E:
moka-ai/m3e-base - HuggingFace:
sentence-transformers/all-MiniLM-L6-v2
Flexible Input Methods
- File input:
--input data.jsonl - Stdin (auto-detect):
cat data.jsonl | text-vectorify ... - Explicit stdin:
--input -
Minimal Example
# The simplest possible usage
cat data.jsonl | text-vectorify --input-field-main "title" --process-method "BGEEmbedder"
Input Format
JSONL file with text data:
{"title": "Sample Article", "content": "This is the content...", "author": "John Doe"}
{"title": "Another Article", "content": "More content here...", "author": "Jane Smith"}
Output Format
JSONL file with added vector embeddings:
{"title": "Sample Article", "content": "This is the content...", "author": "John Doe", "vector": [0.1, 0.2, 0.3, ...]}
{"title": "Another Article", "content": "More content here...", "author": "Jane Smith", "vector": [0.4, 0.5, 0.6, ...]}
๐ค Supported Models
OpenAI Embeddings
- Default Model:
text-embedding-3-small - Other Models:
text-embedding-3-large - API Key: Required via
--process-extra-data - Dimensions: 1536 (small), 3072 (large)
SentenceBERT
- Default Model:
paraphrase-multilingual-MiniLM-L12-v2 - Language: Multilingual support
- Dimensions: 384
BGE (Beijing Academy of AI)
- Default Model:
BAAI/bge-small-en-v1.5 - Other Models:
BAAI/bge-base-zh-v1.5,BAAI/bge-small-zh-v1.5 - Language: Optimized for Chinese and English
- Dimensions: 512 (small), 768 (base)
M3E (Moka Massive Mixed Embedding)
- Default Model:
moka-ai/m3e-base - Other Models:
moka-ai/m3e-small - Language: Chinese specialized
- Dimensions: 768 (base), 512 (small)
HuggingFace Transformers
- Default Model:
sentence-transformers/all-MiniLM-L6-v2 - Flexibility: Custom model selection
- Dimensions: Model-dependent
๐ Examples
Example 1: OpenAI Embeddings (with default model)
text-vectorify \
--input articles.jsonl \
--input-field-main "title" \
--input-field-subtitle "content,summary" \
--process-method "OpenAIEmbedder" \
--process-extra-data "sk-your-openai-api-key" \
--output-field "embedding" \
--output processed_articles.jsonl
Example 2: Using stdin input with default BGE model
cat chinese_news.jsonl | text-vectorify \
--input-field-main "title,content" \
--process-method "BGEEmbedder" \
--output-cache-dir ./models_cache
Example 3: Multilingual with SentenceBERT (default model)
text-vectorify \
--input multilingual_docs.jsonl \
--input-field-main "title" \
--input-field-subtitle "description,tags" \
--process-method "SentenceBertEmbedder"
Example 4: Custom model specification
text-vectorify \
--input products.jsonl \
--input-field-main "name,brand" \
--input-field-subtitle "description,category,tags" \
--process-method "BGEEmbedder" \
--process-model-name "BAAI/bge-base-zh-v1.5" \
--output-field "product_vector"
Example 5: Explicit stdin marker
echo '{"title": "Sample", "content": "Text content"}' | text-vectorify \
--input - \
--input-field-main "title" \
--process-method "M3EEmbedder"
Example 6: Quick start with minimal arguments
# Most minimal usage - using defaults
cat data.jsonl | text-vectorify \
--input-field-main "title" \
--process-method "BGEEmbedder"
๐ง API Reference
Python API Usage
from text_vectorify import TextVectorify, EmbedderFactory
# Create embedder
embedder = EmbedderFactory.create_embedder(
"OpenAIEmbedder",
"text-embedding-3-small",
api_key="your-api-key"
)
# Initialize vectorizer
vectorizer = TextVectorify(embedder)
# Process JSONL file
vectorizer.process_jsonl(
input_path="input.jsonl",
output_path="output.jsonl",
input_field_main=["title"],
input_field_subtitle=["content"],
output_field="vector"
)
Available Embedders
from text_vectorify import EmbedderFactory
# List all available embedders
embedders = EmbedderFactory.list_embedders()
print(embedders)
# ['OpenAIEmbedder', 'SentenceBertEmbedder', 'BGEEmbedder', 'M3EEmbedder', 'HuggingFaceEmbedder']
โ๏ธ Configuration
Cache Management
The tool automatically caches:
- Text embeddings: Avoid recomputing identical texts
- Model files: Download models once and reuse
- Cache location: Configurable via
--output-cache-dir
Cache Structure
cache/
โโโ models/ # Downloaded models
โ โโโ sentence_transformers/
โ โโโ huggingface/
โ โโโ bge/
โโโ [hash].pkl # Cached embeddings
Environment Variables
export OPENAI_API_KEY="your-openai-api-key" # For OpenAI embeddings
๐ Performance Tips
- Use caching: Enable caching to avoid recomputing embeddings
- Batch processing: Process large files in chunks
- Model selection: Choose appropriate model for your language and use case
- Field combination: Combine relevant fields for better semantic representation
- Stdin processing: Use stdin for pipeline integration and memory efficiency
- Default models: Start with default models for quick prototyping, then customize as needed
๐ Troubleshooting
Common Issues
Import Error: Missing dependencies
pip install sentence-transformers transformers torch openai
API Key Error: Invalid or missing OpenAI API key
export OPENAI_API_KEY="your-valid-api-key"
Memory Error: Large models on limited RAM
- Use smaller models like
bge-small-zh-v1.5 - Process files in smaller batches
Cache Permission Error: Insufficient cache directory permissions
chmod 755 ./cache
๐ค Contributing
We welcome contributions! Please see our Contributing Guide for details.
Development Setup
git clone https://github.com/changyy/py-text-vectorify.git
cd py-text-vectorify
pip install -e .
pip install -r requirements-dev.txt
Running Tests
pytest tests/
๐ License
This project is licensed under the MIT License - see the LICENSE file for details.
๐ Benchmarks
| Model | Language | Dimension | Speed | Quality |
|---|---|---|---|---|
| OpenAI text-embedding-3-small | Multi | 1536 | Fast | Excellent |
| BGE-base-zh | Chinese | 768 | Medium | Excellent |
| SentenceBERT | Multi | 384 | Fast | Good |
| M3E-base | Chinese | 768 | Medium | Excellent |
๐ Related Projects
๐ Support
- GitHub Issues: Report bugs or request features
- Documentation: Full documentation
Made with โค๏ธ for the text analysis community
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file text_vectorify-1.0.0.tar.gz.
File metadata
- Download URL: text_vectorify-1.0.0.tar.gz
- Upload date:
- Size: 20.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4fe64281c4d480b6632157b5775f1bb463e7f2e03d56df18164bb767ab12641a
|
|
| MD5 |
38233695a4fa7bf2d7161485f459a900
|
|
| BLAKE2b-256 |
bb8019b73ba8e68ef304099a491a84e98ce69d33b7e935eceb3522e2ca07977d
|
Provenance
The following attestation bundles were made for text_vectorify-1.0.0.tar.gz:
Publisher:
python-publish.yml on changyy/py-text-vectorify
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
text_vectorify-1.0.0.tar.gz -
Subject digest:
4fe64281c4d480b6632157b5775f1bb463e7f2e03d56df18164bb767ab12641a - Sigstore transparency entry: 227790659
- Sigstore integration time:
-
Permalink:
changyy/py-text-vectorify@5d6d67e7744c0dcfc95494e7a46a1a391023ce3c -
Branch / Tag:
refs/tags/v1.0.0 - Owner: https://github.com/changyy
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-publish.yml@5d6d67e7744c0dcfc95494e7a46a1a391023ce3c -
Trigger Event:
release
-
Statement type:
File details
Details for the file text_vectorify-1.0.0-py3-none-any.whl.
File metadata
- Download URL: text_vectorify-1.0.0-py3-none-any.whl
- Upload date:
- Size: 18.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
18e0d6d703ae1812422387dbadb7d1537647aca6d81296877b53901bd2423584
|
|
| MD5 |
7828f51b028067a22524076ff6d5fd9d
|
|
| BLAKE2b-256 |
f91c639a1e21f012e861b1a5b800a8e0917fdb4c950ad4bf41d2931d57514642
|
Provenance
The following attestation bundles were made for text_vectorify-1.0.0-py3-none-any.whl:
Publisher:
python-publish.yml on changyy/py-text-vectorify
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
text_vectorify-1.0.0-py3-none-any.whl -
Subject digest:
18e0d6d703ae1812422387dbadb7d1537647aca6d81296877b53901bd2423584 - Sigstore transparency entry: 227790671
- Sigstore integration time:
-
Permalink:
changyy/py-text-vectorify@5d6d67e7744c0dcfc95494e7a46a1a391023ce3c -
Branch / Tag:
refs/tags/v1.0.0 - Owner: https://github.com/changyy
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-publish.yml@5d6d67e7744c0dcfc95494e7a46a1a391023ce3c -
Trigger Event:
release
-
Statement type: