A powerful and flexible Python tool for text vectorization with multiple embedding models
Project description
text-vectorify
A powerful and flexible Python tool for text vectorization with multiple embedding models and intelligent caching.
๐ Simple Description
text-vectorify is a command-line tool that converts text data in JSONL format into vector embeddings using various state-of-the-art models including OpenAI, SentenceBERT, BGE, M3E, and HuggingFace transformers. It features intelligent caching, multi-field text combination, and seamless JSONL processing for efficient text analysis pipelines.
๐ Quick Start
pip install text-vectorify
# Basic usage with default model
text-vectorify \
--input data.jsonl \
--input-field-main "title" \
--input-field-subtitle "content" \
--process-method "OpenAIEmbedder" \
--process-extra-data "your-openai-api-key"
# Using stdin input
cat data.jsonl | text-vectorify \
--input-field-main "title" \
--process-method "BGEEmbedder"
โจ Features
- ๐ฏ Multiple Embedding Models: OpenAI, SentenceBERT, BGE, M3E, HuggingFace, TF-IDF, Topic Models
- ๐ Multi-layer Vectorization: Combine different embedding types with 3 fusion methods
- ๐ค Chinese Text Support: Intelligent Chinese tokenization (jieba, spaCy, pkuseg)
- ๐ Topic Modeling: LDA and BERTopic support for semantic topic analysis
- ๐ Intelligent Caching: Avoid recomputing embeddings for duplicate texts
- ๐ Flexible Field Combination: Combine multiple JSON fields for embedding
- ๐ JSONL Processing: Seamless input/output in JSONL format
- โก Batch Processing: Efficient processing of large datasets
- ๐ก๏ธ Error Resilience: Continue processing even if individual records fail
- ๐ฅ Stdin Support: Read input from pipes or stdin for flexible data processing
- ๐๏ธ Smart Defaults: Default model names for quick start without configuration
- ๐ง Flexible Input: Support file input, stdin, or explicit stdin markers
๐ Table of Contents
- Installation
- Usage
- Supported Models
- Examples
- Library Usage
- API Reference
- Configuration
- Contributing
- License
๐ง Installation
Method 1: pip install (Recommended)
# Install core package (includes TF-IDF, Topic, and Multi-layer embedders)
pip install text-vectorify
# Install with specific advanced embedder support
pip install text-vectorify[openai] # OpenAI support
pip install text-vectorify[transformers] # SentenceBERT, BGE, M3E support
pip install text-vectorify[advanced] # BERTopic, advanced tokenizers
pip install text-vectorify[all] # All embedding models
# Install with development dependencies
pip install text-vectorify[dev]
๐ Core Features Included by Default:
- โ TF-IDF Embedder with Chinese tokenization (spacy)
- โ Topic Embedder with LDA support
- โ Multi-layer Embedder with 3 fusion methods
- โ Smart caching and CLI tools
Method 2: From source
# Clone repository
git clone https://github.com/changyy/py-text-vectorify.git
cd py-text-vectorify
# Automated development setup (recommended for contributors)
./setup.sh
# Manual setup
python3 -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install package
pip install -e . # Core package only
# or
pip install -e ".[dev]" # With development dependencies
# or
pip install -e ".[all,dev]" # With all optional dependencies
Method 3: Development setup
# Automated setup for developers (recommended)
git clone https://github.com/changyy/py-text-vectorify.git
cd py-text-vectorify
./setup.sh
# Manual development setup
python3 -m venv venv
source venv/bin/activate
# Install in development mode
pip install -e ".[dev]"
# Run tests to verify installation
python test_runner.py --quick
Install additional packages based on the embedding models you plan to use:
# For Chinese text processing (recommended - included by default)
python -m spacy download zh_core_web_sm
# For OpenAI embeddings
pip install openai
# For SentenceBERT, BGE, M3E models
pip install sentence-transformers
# For HuggingFace models
pip install transformers torch
# For alternative Chinese tokenizers (optional)
pip install text-vectorify[advanced] # Includes jieba and pkuseg
๐ค Chinese Tokenization Setup
This package uses spaCy as the default and recommended Chinese tokenizer for superior accuracy and modern NLP features. Alternative tokenizers are available for specific use cases.
spaCy (Recommended - Default) ๐
# Install spaCy Chinese model (required for Chinese text processing)
python -m spacy download zh_core_web_sm
# Usage in CLI (default - no need to specify tokenizer)
text-vectorify --process-method "TFIDFEmbedder" --input data.jsonl
# Explicit usage (optional)
text-vectorify --process-method "TFIDFEmbedder" --chinese-tokenizer spacy --input data.jsonl
Why spaCy?
- ๐ Actively maintained with regular updates
- ๐ฏ Superior accuracy for modern Chinese text
- ๐ ๏ธ Rich linguistic features (POS tagging, NER, etc.)
- ๐ Comprehensive language model support
Alternative Tokenizers
jieba (Lightweight option)
# Install alternative tokenizers
pip install text-vectorify[advanced]
# Usage
text-vectorify --process-method "TFIDFEmbedder" --chinese-tokenizer jieba --input data.jsonl
pkuseg (Academic option)
# Already included in advanced extras
text-vectorify --process-method "TFIDFEmbedder" --chinese-tokenizer pkuseg --input data.jsonl
Automatic Selection Priority: spaCy โ jieba โ pkuseg
๐ Usage
Command Line Interface
text-vectorify [OPTIONS]
Required Arguments
--input-field-main: Main text fields (comma-separated)--process-method: Embedding method to use
Optional Arguments
--input: Path to input JSONL file (use "-" for stdin, or omit to read from stdin)--process-model-name: Model name to use (optional, will use defaults if not specified)--input-field-subtitle: Additional text fields (comma-separated)--process-extra-data: Extra data like API keys--output-field: Output embedding field name (default: "embedding")--output-cache-dir: Cache directory (default: "./cache")--output: Output file path (default: auto-generated)
Quick Start Features
The tool now supports smart defaults and flexible input methods for easier usage:
Default Models
Each embedder has intelligent default models, so you don't need to specify --process-model-name:
- OpenAI:
text-embedding-3-small - BGE:
BAAI/bge-small-en-v1.5 - SentenceBERT:
paraphrase-multilingual-MiniLM-L12-v2 - M3E:
moka-ai/m3e-base - HuggingFace:
sentence-transformers/all-MiniLM-L6-v2
Flexible Input Methods
- File input:
--input data.jsonl - Stdin (auto-detect):
cat data.jsonl | text-vectorify ... - Explicit stdin:
--input -
Minimal Example
# The simplest possible usage
cat data.jsonl | text-vectorify --input-field-main "title" --process-method "BGEEmbedder"
Input Format
JSONL file with text data:
{"title": "Sample Article", "content": "This is the content...", "author": "John Doe"}
{"title": "Another Article", "content": "More content here...", "author": "Jane Smith"}
Output Format
JSONL file with added vector embeddings:
{"title": "Sample Article", "content": "This is the content...", "author": "John Doe", "embedding": [0.1, 0.2, 0.3, ...]}
{"title": "Another Article", "content": "More content here...", "author": "Jane Smith", "embedding": [0.4, 0.5, 0.6, ...]}
๐ค Supported Models
Core Embedders (Included by Default)
TF-IDF Embedder
- Method:
TFIDFEmbedder - Default Features: 50 dimensions
- Chinese Support: Uses spaCy tokenization by default (alternative: jieba, pkuseg)
- Customizable:
max_features,min_df,max_dfparameters - Cache: Full sklearn vectorizer caching
Topic Embedder
- Method:
TopicEmbedder - Default Topics: 10 topics (LDA)
- Algorithms: LDA (default), BERTopic (with advanced install)
- Dimensions: n_topics + 6 (topic distribution + metadata)
- Custom Config:
--process-extra-data "n_topics=20,method=lda"
Multi-layer Embedder
- Method:
MultiLayerEmbedder - Fusion Methods: concatenate (default), average, weighted
- Layer Combinations: Any combination of available embedders
- Config Support: JSON config files for complex setups
- Dimensions: Varies by fusion method and layers
Advanced Embedders (Optional Install)
OpenAI Embeddings
- Default Model:
text-embedding-3-small - Other Models:
text-embedding-3-large - API Key: Required via
--process-extra-data - Dimensions: 1536 (small), 3072 (large)
- Install:
pip install text-vectorify[openai]
SentenceBERT
- Default Model:
paraphrase-multilingual-MiniLM-L12-v2 - Language: Multilingual support
- Dimensions: 384
- Install:
pip install text-vectorify[transformers]
BGE (Beijing Academy of AI)
- Default Model:
BAAI/bge-small-en-v1.5 - Other Models:
BAAI/bge-base-zh-v1.5,BAAI/bge-small-zh-v1.5 - Language: Optimized for Chinese and English
- Dimensions: 512 (small), 768 (base)
- Install:
pip install text-vectorify[transformers]
M3E (Moka Massive Mixed Embedding)
- Default Model:
moka-ai/m3e-base - Other Models:
moka-ai/m3e-small - Language: Chinese specialized
- Dimensions: 768 (base), 512 (small)
- Install:
pip install text-vectorify[transformers]
HuggingFace Transformers
- Default Model:
sentence-transformers/all-MiniLM-L6-v2 - Flexibility: Custom model selection
- Dimensions: Model-dependent
- Install:
pip install text-vectorify[transformers]
๐ง Quick Model Selection Guide
# Core models (no extra install needed)
--process-method "TFIDFEmbedder" # Fast, traditional, good baseline
--process-method "TopicEmbedder" # Semantic topics, interpretable
--process-method "MultiLayerEmbedder" # Rich, combined representations
# Advanced models (require optional dependencies)
--process-method "BGEEmbedder" # Best for Chinese/English
--process-method "SentenceBertEmbedder" # Multilingual, balanced
--process-method "OpenAIEmbedder" # High quality, requires API key
--process-method "HuggingFaceEmbedder" # Most flexible, custom models
๐ Examples
Example 1: OpenAI Embeddings (with default model)
text-vectorify \
--input articles.jsonl \
--input-field-main "title" \
--input-field-subtitle "content,summary" \
--process-method "OpenAIEmbedder" \
--process-extra-data "sk-your-openai-api-key" \
--output-field "embedding" \
--output processed_articles.jsonl
Example 2: Using stdin input with default BGE model
cat chinese_news.jsonl | text-vectorify \
--input-field-main "title,content" \
--process-method "BGEEmbedder" \
--output-cache-dir ./models_cache
Example 3: Multilingual with SentenceBERT (default model)
text-vectorify \
--input multilingual_docs.jsonl \
--input-field-main "title" \
--input-field-subtitle "description,tags" \
--process-method "SentenceBertEmbedder"
Example 4: Custom model specification
text-vectorify \
--input products.jsonl \
--input-field-main "name,brand" \
--input-field-subtitle "description,category,tags" \
--process-method "BGEEmbedder" \
--process-model-name "BAAI/bge-base-zh-v1.5" \
--output-field "product_vector"
Example 5: Explicit stdin marker
echo '{"title": "Sample", "content": "Text content"}' | text-vectorify \
--input - \
--input-field-main "title" \
--process-method "M3EEmbedder"
Example 6: Quick start with minimal arguments
# Most minimal usage - using defaults
cat data.jsonl | text-vectorify \
--input-field-main "title" \
--process-method "BGEEmbedder"
Example 7: TF-IDF Embeddings (included by default)
# TF-IDF with Chinese tokenization (spacy is the default)
text-vectorify \
--input chinese_articles.jsonl \
--input-field-main "title,content" \
--process-method "TFIDFEmbedder" \
--max-features 1500
# Use alternative tokenizer if needed (default is spacy)
text-vectorify \
--input chinese_articles.jsonl \
--input-field-main "title,content" \
--process-method "TFIDFEmbedder" \
--chinese-tokenizer jieba \
--max-features 1500
Example 8: Topic Embeddings with LDA
# Generate topic-based embeddings
text-vectorify \
--input documents.jsonl \
--input-field-main "content" \
--process-method "TopicEmbedder" \
--n-topics 20
Example 9: Multi-layer Vectorization with Config File
First, create a config file multi_config.json:
{
"fusion_method": "concatenate",
"layers": [
{
"embedder_type": "TFIDFEmbedder",
"config": {
"tokenizer": "chinese",
"max_features": 1000
}
},
{
"embedder_type": "TopicEmbedder",
"config": {
"n_topics": 10,
"method": "lda"
}
}
]
}
Then run:
# Use config file for complex multi-layer setup
text-vectorify \
--input documents.jsonl \
--input-field-main "content" \
--process-method "MultiLayerEmbedder" \
--config-file multi_config.json
Example 10: Multi-layer with Direct Parameters
# Multi-layer with weighted fusion (uses default layers)
text-vectorify \
--input mixed_docs.jsonl \
--input-field-main "title,content" \
--process-method "MultiLayerEmbedder" \
--fusion-method weighted
Example 11: Cache Management
# Show cache statistics
text-vectorify --show-cache-stats
# List cached files
text-vectorify --list-cache-files
# Use custom cache directory
text-vectorify \
--input data.jsonl \
--input-field-main "title" \
--process-method "TFIDFEmbedder" \
--cache-dir "./my_project_cache"
# Clear specific cache before processing
text-vectorify \
--input data.jsonl \
--input-field-main "title" \
--process-method "TFIDFEmbedder" \
--clear-cache
# Clear all caches (with confirmation)
text-vectorify --clear-all-caches
๐ฏ CLI-First Development Workflow
For the most efficient development experience, we recommend starting with CLI experimentation:
# Step 1: Try the interactive workflow demo
python examples/cli_first_workflow_demo.py
# Step 2: Try the comprehensive development guide
python examples/development_workflow_guide.py
# Step 3: Use your own data with the CLI-first approach
Benefits: Fast iteration โ Cache building โ Library integration โ Zero recomputation
๐ Library Usage
For programmatic integration, text-vectorify provides a powerful Python API that allows you to process data in-memory using List[Dict] format instead of file-based operations. We recommend a CLI-first development workflow for maximum efficiency and cache optimization.
๐ Recommended Development Workflow
Step 1: CLI Experimentation (Cache Building)
Start with CLI commands on small datasets to build cache and verify results:
# Quick test with small dataset
python -m text_vectorify.main \
--input small_sample.jsonl \
--input-field-main "title" \
--input-field-subtitle "content" \
--process-method "BGEEmbedder" \
--cache-dir "./my_project_cache" \
--output test_results.jsonl
# Verify results
head -3 test_results.jsonl
python -m text_vectorify.main --show-cache-stats
Step 2: Library Integration (Cache Reuse)
Switch to library usage while reusing the CLI-built cache:
from text_vectorify import EmbedderFactory
# Reuse cache from CLI experiments - no recomputation!
embedder = EmbedderFactory.create_embedder(
"BGEEmbedder",
cache_dir="./my_project_cache" # Same as CLI
)
# Process data in memory
data = [
{"title": "Python Programming", "content": "Learn Python basics"},
{"title": "Machine Learning", "content": "Introduction to ML concepts"}
]
for item in data:
text = f"{item['title']} {item['content']}"
vector = embedder.encode(text) # Uses cache - instant results!
item['embedding'] = vector
๐ Quick Start with In-Memory Processing
from text_vectorify import TextVectorify, EmbedderFactory
# Process data directly in memory
data = [
{"title": "Python Programming", "content": "Learn Python basics"},
{"title": "Machine Learning", "content": "Introduction to ML concepts"},
{"title": "Data Science", "content": "Working with data in Python"}
]
# Create embedder with custom cache location
embedder = EmbedderFactory.create_embedder(
"BGEEmbedder",
cache_dir="./my_custom_cache" # Customize cache location
)
# Initialize vectorizer
vectorizer = TextVectorify(embedder)
# Process list of dictionaries directly
for item in data:
# Combine text fields
text = f"{item['title']} {item['content']}"
# Generate embedding
vector = embedder.encode(text)
# Add to your data structure
item['embedding'] = vector
print(f"Processed {len(data)} items with embeddings")
๐ง Advanced Library Integration
Batch Processing with Custom Cache Management
from text_vectorify import TextVectorify, EmbedderFactory
from pathlib import Path
import tempfile
def process_documents_batch(documents: List[Dict],
embedder_type: str = "BGEEmbedder",
cache_dir: str = None) -> List[Dict]:
"""
Process a batch of documents with intelligent caching.
Args:
documents: List of dictionaries with text data
embedder_type: Type of embedder to use
cache_dir: Custom cache directory (optional)
Returns:
List of documents with added embeddings
"""
# Use temporary cache if not specified
if cache_dir is None:
cache_dir = tempfile.mkdtemp(prefix="text_vectorify_")
# Create embedder with custom cache
embedder = EmbedderFactory.create_embedder(
embedder_type,
cache_dir=cache_dir
)
vectorizer = TextVectorify(embedder)
# Process each document
results = []
for doc in documents:
# Extract and combine text fields
text = " ".join(str(doc[field]) for field in ['title', 'content', 'description'] if field in doc)
# Generate embedding with automatic caching
vector = embedder.encode(text)
# Create result with original data + embedding
result = doc.copy()
result['embedding'] = vector
results.append(result)
return results
# Usage example
documents = [
{"title": "AI Research", "content": "Latest developments in AI"},
{"title": "Data Analysis", "content": "Tools for data processing"},
]
# Process with custom cache location
processed = process_documents_batch(
documents,
embedder_type="BGEEmbedder",
cache_dir="./persistent_cache"
)
Multi-Model Processing Pipeline
from text_vectorify import EmbedderFactory
from typing import Dict, List, Optional
import logging
class MultiModelEmbeddingPipeline:
"""
Pipeline for processing text with multiple embedding models.
Useful for comparing embeddings or ensemble approaches.
"""
def __init__(self, cache_base_dir: str = "./embeddings_cache"):
self.cache_base_dir = Path(cache_base_dir)
self.embedders = {}
def add_embedder(self, name: str, embedder_type: str, model_name: str = None, **kwargs):
"""Add an embedder to the pipeline."""
cache_dir = self.cache_base_dir / name
cache_dir.mkdir(parents=True, exist_ok=True)
embedder = EmbedderFactory.create_embedder(
embedder_type,
model_name=model_name,
cache_dir=str(cache_dir),
**kwargs
)
self.embedders[name] = embedder
logging.info(f"Added embedder '{name}' with cache at {cache_dir}")
def process_texts(self, texts: List[str]) -> Dict[str, List[List[float]]]:
"""Process texts with all configured embedders."""
results = {}
for embedder_name, embedder in self.embedders.items():
embeddings = []
for text in texts:
vector = embedder.encode(text)
embeddings.append(vector)
results[embedder_name] = embeddings
logging.info(f"Processed {len(texts)} texts with {embedder_name}")
return results
# Usage example
pipeline = MultiModelEmbeddingPipeline("./multi_model_cache")
# Add different embedders with custom cache locations
pipeline.add_embedder("bge_small", "BGEEmbedder", "BAAI/bge-small-en-v1.5")
pipeline.add_embedder("sentence_bert", "SentenceBertEmbedder")
# For OpenAI (requires API key)
if openai_api_key:
pipeline.add_embedder("openai", "OpenAIEmbedder", api_key=openai_api_key)
# Process texts
texts = ["Machine learning concepts", "Natural language processing"]
all_embeddings = pipeline.process_texts(texts)
# Results contain embeddings from all models
for model_name, embeddings in all_embeddings.items():
print(f"{model_name}: {len(embeddings)} embeddings, dim={len(embeddings[0])}")
๐ฏ Integration Patterns
Django/Flask Web Application Integration
# views.py or similar
from text_vectorify import EmbedderFactory
from django.conf import settings
class DocumentEmbeddingService:
"""Service for generating embeddings in web applications."""
def __init__(self):
# Use application cache directory
cache_dir = getattr(settings, 'EMBEDDINGS_CACHE_DIR', './embeddings_cache')
self.embedder = EmbedderFactory.create_embedder(
"BGEEmbedder", # Fast and good quality
cache_dir=cache_dir
)
def embed_document(self, title: str, content: str) -> List[float]:
"""Generate embedding for a document."""
text = f"{title} {content}"
return self.embedder.encode(text)
def search_similar(self, query: str, document_embeddings: List[List[float]],
threshold: float = 0.7) -> List[int]:
"""Find similar documents using cosine similarity."""
query_embedding = self.embedder.encode(query)
# Simple similarity calculation (use numpy/scipy for better performance)
similar_indices = []
for i, doc_embedding in enumerate(document_embeddings):
similarity = self._cosine_similarity(query_embedding, doc_embedding)
if similarity >= threshold:
similar_indices.append(i)
return similar_indices
def _cosine_similarity(self, a: List[float], b: List[float]) -> float:
"""Calculate cosine similarity between two vectors."""
import math
dot_product = sum(x * y for x, y in zip(a, b))
norm_a = math.sqrt(sum(x * x for x in a))
norm_b = math.sqrt(sum(x * x for x in b))
return dot_product / (norm_a * norm_b)
# Usage in views
embedding_service = DocumentEmbeddingService()
def create_document(request):
# ... extract title and content from request ...
# Generate embedding
embedding = embedding_service.embed_document(title, content)
# Save to database with embedding
Document.objects.create(
title=title,
content=content,
embedding=embedding # Store as JSON or use vector database
)
Data Pipeline Integration
import pandas as pd
from text_vectorify import EmbedderFactory
from typing import Iterator
def embedding_pipeline(data_source: Iterator[Dict],
output_path: str,
batch_size: int = 100,
cache_dir: str = "./pipeline_cache") -> None:
"""
Process large datasets in batches with persistent caching.
This approach avoids memory issues and provides resume capability.
"""
# Create embedder with persistent cache
embedder = EmbedderFactory.create_embedder(
"BGEEmbedder",
cache_dir=cache_dir
)
batch = []
processed_count = 0
for item in data_source:
batch.append(item)
if len(batch) >= batch_size:
# Process batch
processed_batch = process_batch(batch, embedder)
# Save batch results
save_batch_results(processed_batch, output_path, processed_count)
processed_count += len(batch)
batch = []
print(f"Processed {processed_count} items")
# Process remaining items
if batch:
processed_batch = process_batch(batch, embedder)
save_batch_results(processed_batch, output_path, processed_count)
def process_batch(batch: List[Dict], embedder) -> List[Dict]:
"""Process a single batch of items."""
results = []
for item in batch:
# Extract text
text = f"{item.get('title', '')} {item.get('content', '')}"
# Generate embedding (cached automatically)
embedding = embedder.encode(text)
# Add embedding to item
result = item.copy()
result['embedding'] = embedding
results.append(result)
return results
# Usage with pandas
def process_dataframe_with_embeddings(df: pd.DataFrame,
text_columns: List[str],
cache_dir: str = None) -> pd.DataFrame:
"""Add embeddings to a pandas DataFrame."""
embedder = EmbedderFactory.create_embedder(
"BGEEmbedder",
cache_dir=cache_dir or "./dataframe_cache"
)
# Combine text columns
df['combined_text'] = df[text_columns].fillna('').agg(' '.join, axis=1)
# Generate embeddings
embeddings = []
for text in df['combined_text']:
embedding = embedder.encode(text)
embeddings.append(embedding)
df['embedding'] = embeddings
return df
๐ Multi-layer Vectorization Examples
The multi-layer embedder allows you to combine different embedding techniques for richer text representations. Here are comprehensive examples:
Basic Multi-layer Usage
from text_vectorify import EmbedderFactory
# Create multi-layer embedder with concatenation fusion
multi_embedder = EmbedderFactory.create_embedder(
"MultiLayerEmbedder",
extra_data="fusion_method=concatenate,layer1=TFIDFEmbedder,layer2=TopicEmbedder"
)
# Process text
text = "Machine learning algorithms in natural language processing applications"
fused_vector = multi_embedder.encode(text)
print(f"Fused vector dimensions: {len(fused_vector)}")
# Output: Fused vector dimensions: 66 (TF-IDF: 50 + Topic: 16)
Advanced Multi-layer with Configuration
import json
from text_vectorify import EmbedderFactory
# Create configuration for complex multi-layer setup
config = {
"fusion_method": "concatenate",
"layers": [
{
"embedder_type": "TFIDFEmbedder",
"config": {
"tokenizer": "chinese",
"max_features": 1000,
"min_df": 2,
"max_df": 0.8
}
},
{
"embedder_type": "TopicEmbedder",
"config": {
"n_topics": 20,
"method": "lda",
"random_state": 42
}
}
]
}
# Save config to file
with open("advanced_multi_config.json", "w") as f:
json.dump(config, f, indent=2)
# Create embedder using config file
multi_embedder = EmbedderFactory.create_embedder(
"MultiLayerEmbedder",
model_name="advanced_multi_config.json"
)
# Process documents
documents = [
"ไบบๅทฅๆบ่ฝๆๆฏๅจ่ช็ถ่ฏญ่จๅค็ไธญ็ๅบ็จ",
"ๆบๅจๅญฆไน ็ฎๆณ็ๆๆฐๅๅฑ่ถๅฟ",
"ๆทฑๅบฆๅญฆไน ๆจกๅๅจๆๆฌๅๆไธญ็่กจ็ฐ"
]
embeddings = []
for doc in documents:
vector = multi_embedder.encode(doc)
embeddings.append(vector)
print(f"Document: {doc[:30]}... -> Vector dim: {len(vector)}")
Multi-layer with Different Fusion Methods
from text_vectorify import EmbedderFactory
# Test different fusion methods
fusion_methods = ["concatenate", "average", "weighted"]
text = "Advanced machine learning techniques for text analysis"
for method in fusion_methods:
embedder = EmbedderFactory.create_embedder(
"MultiLayerEmbedder",
extra_data=f"fusion_method={method},layer1=TFIDFEmbedder,layer2=TopicEmbedder"
)
vector = embedder.encode(text)
print(f"{method.capitalize()} fusion: {len(vector)} dimensions")
# Output:
# Concatenate fusion: 66 dimensions
# Average fusion: 50 dimensions (max of layer dimensions)
# Weighted fusion: 50 dimensions (max of layer dimensions)
Multi-layer Processing Pipeline
from text_vectorify import EmbedderFactory
from typing import List, Dict
import numpy as np
class MultiLayerDocumentProcessor:
"""Advanced document processor using multi-layer embeddings."""
def __init__(self, config_path: str = None, cache_dir: str = "./multi_cache"):
if config_path:
self.embedder = EmbedderFactory.create_embedder(
"MultiLayerEmbedder",
model_name=config_path,
cache_dir=cache_dir
)
else:
# Default configuration
self.embedder = EmbedderFactory.create_embedder(
"MultiLayerEmbedder",
extra_data="fusion_method=concatenate,layer1=TFIDFEmbedder,layer2=TopicEmbedder",
cache_dir=cache_dir
)
def process_documents(self, documents: List[Dict]) -> List[Dict]:
"""Process documents with multi-layer embeddings."""
results = []
for doc in documents:
# Combine text fields
text_parts = []
for field in ['title', 'content', 'description', 'tags']:
if field in doc and doc[field]:
text_parts.append(str(doc[field]))
combined_text = " ".join(text_parts)
# Generate multi-layer embedding
embedding = self.embedder.encode(combined_text)
# Add to result
result = doc.copy()
result['embedding'] = embedding
result['embedding_dim'] = len(embedding)
result['text_combined'] = combined_text
results.append(result)
return results
def calculate_similarities(self, documents: List[Dict],
query_text: str) -> List[tuple]:
"""Calculate similarities between query and documents."""
query_embedding = self.embedder.encode(query_text)
similarities = []
for i, doc in enumerate(documents):
if 'embedding' in doc:
similarity = self._cosine_similarity(query_embedding, doc['embedding'])
similarities.append((i, similarity))
# Sort by similarity (descending)
similarities.sort(key=lambda x: x[1], reverse=True)
return similarities
def _cosine_similarity(self, a: List[float], b: List[float]) -> float:
"""Calculate cosine similarity."""
a_np = np.array(a)
b_np = np.array(b)
return np.dot(a_np, b_np) / (np.linalg.norm(a_np) * np.linalg.norm(b_np))
# Usage example
processor = MultiLayerDocumentProcessor()
documents = [
{
"title": "Machine Learning Basics",
"content": "Introduction to ML algorithms and concepts",
"tags": "AI, ML, algorithms"
},
{
"title": "Deep Learning Networks",
"content": "Neural networks and deep learning architectures",
"tags": "DL, neural networks, AI"
}
]
# Process documents
processed_docs = processor.process_documents(documents)
# Find similar documents
query = "artificial intelligence algorithms"
similarities = processor.calculate_similarities(processed_docs, query)
print("Document similarities:")
for doc_idx, similarity in similarities:
title = processed_docs[doc_idx]['title']
print(f" {title}: {similarity:.4f}")
Batch Multi-layer Processing
from text_vectorify import EmbedderFactory
import json
from pathlib import Path
def batch_multi_layer_processing(input_file: str,
config_file: str,
output_file: str,
batch_size: int = 100):
"""
Process large datasets with multi-layer embeddings in batches.
Args:
input_file: Path to input JSONL file
config_file: Path to configuration file
output_file: Path to output JSONL file
batch_size: Number of records per batch
"""
# Create multi-layer embedder
embedder = EmbedderFactory.create_embedder(
"MultiLayerEmbedder",
model_name=config_file,
cache_dir="./batch_multi_cache"
)
# Process in batches
batch = []
processed_count = 0
with open(input_file, 'r', encoding='utf-8') as infile:
with open(output_file, 'w', encoding='utf-8') as outfile:
for line in infile:
item = json.loads(line.strip())
batch.append(item)
if len(batch) >= batch_size:
# Process batch
for item in batch:
# Combine text fields
text = f"{item.get('title', '')} {item.get('content', '')}"
# Generate multi-layer embedding
embedding = embedder.encode(text)
# Add embedding to item
item['embedding'] = embedding
item['embedding_dim'] = len(embedding)
# Write to output
outfile.write(json.dumps(item, ensure_ascii=False) + '\n')
processed_count += len(batch)
batch = []
print(f"Processed {processed_count} documents")
# Process remaining items
if batch:
for item in batch:
text = f"{item.get('title', '')} {item.get('content', '')}"
embedding = embedder.encode(text)
item['embedding'] = embedding
item['embedding_dim'] = len(embedding)
outfile.write(json.dumps(item, ensure_ascii=False) + '\n')
processed_count += len(batch)
print(f"Total processed: {processed_count} documents")
print(f"Output saved to: {output_file}")
# Usage
batch_multi_layer_processing(
input_file="large_dataset.jsonl",
config_file="configs/multi_layer_simple.json",
output_file="processed_with_multilayer.jsonl",
batch_size=50
)
๐ Performance Optimization
Cache Management Best Practices
from text_vectorify import EmbedderFactory
from pathlib import Path
import tempfile
import shutil
class OptimizedEmbeddingManager:
"""
Optimized embedding manager with advanced cache control.
"""
def __init__(self, base_cache_dir: str = None,
max_cache_size_gb: float = 5.0):
self.base_cache_dir = Path(base_cache_dir or "./optimized_cache")
self.max_cache_size_gb = max_cache_size_gb
self.embedders = {}
def get_embedder(self, embedder_type: str, model_name: str = None, **kwargs):
"""Get or create embedder with optimized caching."""
# Create unique cache directory for this configuration
cache_key = f"{embedder_type}_{model_name or 'default'}"
cache_dir = self.base_cache_dir / cache_key
if cache_key not in self.embedders:
# Ensure cache directory exists
cache_dir.mkdir(parents=True, exist_ok=True)
# Create embedder
embedder = EmbedderFactory.create_embedder(
embedder_type,
model_name=model_name,
cache_dir=str(cache_dir),
**kwargs
)
self.embedders[cache_key] = embedder
# Clean old caches if needed
self._cleanup_cache_if_needed()
return self.embedders[cache_key]
def _cleanup_cache_if_needed(self):
"""Clean up cache if it exceeds size limit."""
total_size = self._get_cache_size_gb()
if total_size > self.max_cache_size_gb:
print(f"Cache size ({total_size:.2f}GB) exceeds limit ({self.max_cache_size_gb}GB)")
self._cleanup_oldest_caches()
def _get_cache_size_gb(self) -> float:
"""Calculate total cache size in GB."""
total_bytes = 0
for cache_dir in self.base_cache_dir.iterdir():
if cache_dir.is_dir():
total_bytes += sum(f.stat().st_size for f in cache_dir.rglob('*') if f.is_file())
return total_bytes / (1024 ** 3) # Convert to GB
def _cleanup_oldest_caches(self):
"""Remove oldest cache directories to free space."""
cache_dirs = [(d, d.stat().st_mtime) for d in self.base_cache_dir.iterdir() if d.is_dir()]
cache_dirs.sort(key=lambda x: x[1]) # Sort by modification time
# Remove oldest 30% of caches
to_remove = cache_dirs[:len(cache_dirs) // 3]
for cache_dir, _ in to_remove:
print(f"Removing old cache: {cache_dir}")
shutil.rmtree(cache_dir)
# Usage
manager = OptimizedEmbeddingManager(max_cache_size_gb=2.0)
# Get embedders with automatic cache management
bge_embedder = manager.get_embedder("BGEEmbedder")
openai_embedder = manager.get_embedder("OpenAIEmbedder", api_key=api_key)
๐ก Key Benefits of CLI-First + Library Workflow
- ๐ฌ Fast Development Cycle: CLI for quick iteration and testing
- ๐พ Smart Cache Reuse: No duplicate computations between CLI and library usage
- ๐ Enhanced Observability: Easy result inspection with CLI output files
- ๐งช Improved Testability: Start small with CLI, scale with library
- โก Zero Recomputation: Library leverages CLI-built cache automatically
- ๐ฏ Custom Cache Control: Configure cache location per project or use case
- ๐ Seamless Transition: Move from CLI prototyping to library integration
- ๐ก๏ธ Production Ready: Robust error handling and resource management
๐ฏ CLI-First Workflow Benefits
- Development Phase: Use CLI with small datasets (5-10 samples)
- Verification Phase: Inspect output files and tune parameters
- Integration Phase: Switch to library with instant cache hits
- Production Phase: Scale up processing with zero wasted computation
๐ Next Steps
- Try the workflow:
python examples/development_workflow_guide.py - See API Reference for detailed method documentation
- Check Examples for more CLI usage patterns
- Review Configuration for cache and performance tuning
๐ง API Reference
Python API Usage
from text_vectorify import TextVectorify, EmbedderFactory
# Create embedder
embedder = EmbedderFactory.create_embedder(
"OpenAIEmbedder",
"text-embedding-3-small",
api_key="your-api-key"
)
# Initialize vectorizer
vectorizer = TextVectorify(embedder)
# Process JSONL file
vectorizer.process_jsonl(
input_path="input.jsonl",
output_path="output.jsonl",
input_field_main=["title"],
input_field_subtitle=["content"],
output_field="embedding"
)
Available Embedders
from text_vectorify import EmbedderFactory
# List all available embedders
embedders = EmbedderFactory.list_embedders()
print(embedders)
# ['OpenAIEmbedder', 'SentenceBertEmbedder', 'BGEEmbedder', 'M3EEmbedder', 'HuggingFaceEmbedder']
โ๏ธ Configuration
๐ Pre-built Configuration Files
The configs/ directory provides ready-to-use configuration files for various scenarios:
Single Embedder Configurations
tfidf_spacy_example.json- TF-IDF with spaCy tokenization (recommended for accuracy)tfidf_jieba_example.json- TF-IDF with jieba tokenization (lightweight option)tfidf_pkuseg_example.json- TF-IDF with pkuseg tokenization (domain-specific)
Multi-Layer Configurations
multi_layer_spacy_recommended.json- Recommended multi-layer setup (TF-IDF + BGE + Topic)multi_layer_jieba_lightweight.json- Lightweight multi-layer (TF-IDF + M3E)multi_layer_simple.json- Basic multi-layer configurationmulti_layer_advanced_research.json- Advanced research configurationmulti_layer_1500_articles.json- Large-scale processing configuration
Usage with Config Files
# Use pre-built configuration
python -m text_vectorify --config configs/multi_layer_spacy_recommended.json --input data.jsonl
# List available configurations
ls configs/*.json
๐ก Tip: See configs/README.md for detailed descriptions of each configuration file.
Multi-layer Configuration Files
Multi-layer embedder supports JSON configuration files for complex setups:
Basic Configuration
{
"fusion_method": "concatenate",
"layers": [
{
"embedder_type": "TFIDFEmbedder",
"config": {
"tokenizer": "chinese",
"max_features": 1000
}
},
{
"embedder_type": "TopicEmbedder",
"config": {
"n_topics": 10,
"method": "lda"
}
}
]
}
Advanced Configuration
{
"fusion_method": "weighted",
"fusion_weights": [0.7, 0.3],
"layers": [
{
"embedder_type": "TFIDFEmbedder",
"config": {
"tokenizer": "chinese",
"max_features": 2000,
"min_df": 2,
"max_df": 0.8,
"use_idf": true,
"smooth_idf": false,
"sublinear_tf": true
}
},
{
"embedder_type": "TopicEmbedder",
"config": {
"n_topics": 20,
"method": "lda",
"random_state": 42,
"max_iter": 100,
"doc_topic_prior": 0.1,
"topic_word_prior": 0.01
}
}
]
}
Available Configuration Options
Fusion Methods:
concatenate: Join vectors end-to-end (default)average: Element-wise average of vectorsweighted: Weighted average (requiresfusion_weights)
TFIDFEmbedder Config:
chinese_tokenizer: "spacy", "jieba", "pkuseg" (default: "spacy")max_features: Maximum number of features (default: 50)min_df: Minimum document frequency (default: 1)max_df: Maximum document frequency (default: 1.0)use_idf: Use inverse document frequency (default: true)smooth_idf: Smooth IDF weights (default: true)sublinear_tf: Apply sublinear tf scaling (default: false)
TopicEmbedder Config:
n_topics: Number of topics (default: 10)method: "lda" or "bertopic" (default: "lda")random_state: Random seed for reproducibilitymax_iter: Maximum iterations for LDA (default: 20)doc_topic_prior: Document-topic prior (alpha)topic_word_prior: Topic-word prior (beta)
Cache Management
The tool automatically caches:
- Text embeddings: Avoid recomputing identical texts
- Model files: Download models once and reuse
- Layer embeddings: Cache individual layer outputs
- Fused vectors: Cache final multi-layer results
- Cache location: Configurable via
--output-cache-dir
Cache Structure
cache/
โโโ models/ # Downloaded models
โ โโโ sentence_transformers/
โ โโโ huggingface/
โ โโโ bge/
โโโ TFIDFEmbedder_*.json # TF-IDF cache
โโโ TopicEmbedder_*.json # Topic embedder cache
โโโ MultiLayerEmbedder_*.json # Multi-layer cache
โโโ layers/ # Layer-specific caches
โโโ tfidf/
โโโ topic/
โโโ multi_layer/
Cache Management Commands
# Show cache statistics
text-vectorify --show-cache-stats
# List cached files
text-vectorify --list-cache-files
# Use custom cache directory
text-vectorify \
--input data.jsonl \
--input-field-main "title" \
--process-method "MultiLayerEmbedder" \
--output-cache-dir "./my_project_cache"
Environment Variables
export OPENAI_API_KEY="your-openai-api-key" # For OpenAI embeddings
export HUGGINGFACE_HUB_TOKEN="your-hf-token" # For private HF models
export TEXT_VECTORIFY_CACHE_DIR="./custom_cache" # Default cache directory
๐ Performance Tips
- Use caching: Enable caching to avoid recomputing embeddings
- Batch processing: Process large files in chunks
- Model selection: Choose appropriate model for your language and use case
- Field combination: Combine relevant fields for better semantic representation
- Stdin processing: Use stdin for pipeline integration and memory efficiency
- Default models: Start with default models for quick prototyping, then customize as needed
๐ Troubleshooting
Common Issues
Import Error: Missing dependencies
pip install sentence-transformers transformers torch openai
API Key Error: Invalid or missing OpenAI API key
export OPENAI_API_KEY="your-valid-api-key"
Memory Error: Large models on limited RAM
- Use smaller models like
bge-small-zh-v1.5 - Process files in smaller batches
Cache Permission Error: Insufficient cache directory permissions
chmod 755 ./cache
๐ค Contributing
We welcome contributions! Please see our Contributing Guide for details.
Development Setup
# Automated setup (recommended)
git clone https://github.com/changyy/py-text-vectorify.git
cd py-text-vectorify
./setup.sh
# Manual setup
git clone https://github.com/changyy/py-text-vectorify.git
cd py-text-vectorify
python3 -m venv venv
source venv/bin/activate
pip install -e ".[dev]"
Running Tests
Using the test runner (recommended):
# Quick validation (fastest)
python test_runner.py --quick
# Core functionality tests (no external dependencies)
python test_runner.py --core
# Integration tests (requires external models)
python test_runner.py --integration
# All tests
python test_runner.py --all
# Tests with coverage report
python test_runner.py --coverage
# Check dependencies
python test_runner.py --deps
Direct pytest commands:
# Quick smoke tests
pytest -m "quick or smoke" -v
# Core functionality
pytest -m "core" -v
# Integration tests
pytest -m "integration" -v
# Cache-related tests
pytest -m "cache" -v
# Embedder-specific tests
pytest -m "embedder" -v
# All tests
pytest -v
# With coverage
pytest --cov=text_vectorify --cov-report=html -v
Development Tools
# Cache management
python tools/cache_tool.py --stats
python tools/cache_tool.py --clear-all
# Feature demonstration
python tools/demo_features.py
๐ License
This project is licensed under the MIT License - see the LICENSE file for details.
๐ Benchmarks
| Model | Language | Dimension | Speed | Quality |
|---|---|---|---|---|
| OpenAI text-embedding-3-small | Multi | 1536 | Fast | Excellent |
| BGE-base-zh | Chinese | 768 | Medium | Excellent |
| SentenceBERT | Multi | 384 | Fast | Good |
| M3E-base | Chinese | 768 | Medium | Excellent |
๐ Related Projects
๐ Support
- GitHub Issues: Report bugs or request features
- Documentation: Full documentation
Made with โค๏ธ for the text analysis community
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file text_vectorify-1.2.0.tar.gz.
File metadata
- Download URL: text_vectorify-1.2.0.tar.gz
- Upload date:
- Size: 78.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c293ac7d749ee2355f32cc7f5d4e1b338543de3270a05845996dd05b84fe72b2
|
|
| MD5 |
8099f185f8bc06b38a889f4866e38671
|
|
| BLAKE2b-256 |
b0f3c0b1094b3eee7caadc6adcdc0fb92177bfff34b1647bd42b969e1a6284e7
|
Provenance
The following attestation bundles were made for text_vectorify-1.2.0.tar.gz:
Publisher:
python-publish.yml on changyy/py-text-vectorify
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
text_vectorify-1.2.0.tar.gz -
Subject digest:
c293ac7d749ee2355f32cc7f5d4e1b338543de3270a05845996dd05b84fe72b2 - Sigstore transparency entry: 231849414
- Sigstore integration time:
-
Permalink:
changyy/py-text-vectorify@6454a8d5db192936a8c4cf345d9c306a695abeb9 -
Branch / Tag:
refs/tags/v1.2.0 - Owner: https://github.com/changyy
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-publish.yml@6454a8d5db192936a8c4cf345d9c306a695abeb9 -
Trigger Event:
release
-
Statement type:
File details
Details for the file text_vectorify-1.2.0-py3-none-any.whl.
File metadata
- Download URL: text_vectorify-1.2.0-py3-none-any.whl
- Upload date:
- Size: 43.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f1916b8117d1e2d319425e4e782b065105a0564c3e5faa3e27c8eeebba6e056e
|
|
| MD5 |
0a717b7678b7fc9365cca1019c676d92
|
|
| BLAKE2b-256 |
24e86c48ce38d39816710b0831e504ac0f88c80b159d030342bd2511a2c0c535
|
Provenance
The following attestation bundles were made for text_vectorify-1.2.0-py3-none-any.whl:
Publisher:
python-publish.yml on changyy/py-text-vectorify
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
text_vectorify-1.2.0-py3-none-any.whl -
Subject digest:
f1916b8117d1e2d319425e4e782b065105a0564c3e5faa3e27c8eeebba6e056e - Sigstore transparency entry: 231849418
- Sigstore integration time:
-
Permalink:
changyy/py-text-vectorify@6454a8d5db192936a8c4cf345d9c306a695abeb9 -
Branch / Tag:
refs/tags/v1.2.0 - Owner: https://github.com/changyy
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-publish.yml@6454a8d5db192936a8c4cf345d9c306a695abeb9 -
Trigger Event:
release
-
Statement type: