Unified embedding for tabular, text, and multimodal data
Project description
HybridVectorizer
Unified embedding for tabular, text, and multimodal data with powerful similarity search.
HybridVectorizer automatically handles mixed data types (numerical, categorical, text, dates) and creates high-quality vector representations for similarity search, recommendation systems, and machine learning pipelines.
HybridVectorizer
Late-fusion embeddings for mixed tabular + text data. One line to vectorize text + numeric + categorical into a single search space with adjustable block weights.
Quick links:
• PyPI · https://pypi.org/project/hybrid-vectorizer/
• Examples · https://github.com/hariharaprabhu/hybrid-vectorizer/tree/main/Examples
• Issues · https://github.com/hariharaprabhu/hybrid-vectorizer/issues
Quick Start
Basic Usage
import pandas as pd
from hybrid_vectorizer import HybridVectorizer
# Load financial data (S&P 500 companies)
df = pd.read_csv("sp500_companies.csv")
# Select relevant columns for analysis
df = df[['Symbol', 'Sector', 'Industry', 'Currentprice', 'Marketcap',
'Fulltimeemployees', 'Longbusinesssummary']]
# Initialize with company symbol as index
hv = HybridVectorizer(index_column="Symbol")
vectors = hv.fit_transform(df)
print(f"Generated {vectors.shape[0]} vectors with {vectors.shape[1]} features")
Finding Similar Companies
# Find companies similar to Google (GOOGL)
query = df.loc[df['Symbol']=='GOOGL'].iloc[0].to_dict()
results = hv.similarity_search(query, ignore_exact_matches=True, top_n=5)
print(results[['Symbol', 'Sector', 'Marketcap', 'similarity']])
Weight Tuning for Different Use Cases
Focus on Business Description (Text Similarity)
# Emphasize business model similarity
results = hv.similarity_search(
query,
block_weights={'text': 2.0, 'numerical': 0.5, 'categorical': 0.5},
top_n=5
)
print("Companies with similar business models:")
print(results[['Symbol', 'Sector', 'Longbusinesssummary', 'similarity']])
Focus on Financial Metrics
# Emphasize financial similarity
results = hv.similarity_search(
query,
block_weights={'text': 0.3, 'numerical': 2.0, 'categorical': 0.5},
top_n=5
)
print("Companies with similar financials:")
print(results[['Symbol', 'Currentprice', 'Marketcap', 'similarity']])
Focus on Industry/Sector
# Emphasize sector/industry similarity
results = hv.similarity_search(
query,
block_weights={'text': 0.3, 'numerical': 0.5, 'categorical': 2.0},
top_n=5
)
print("Companies in similar sectors:")
print(results[['Symbol', 'Sector', 'Industry', 'similarity']])
Custom Queries
# Search for specific characteristics
custom_query = {
'Sector': 'Technology',
'Longbusinesssummary': 'cloud computing artificial intelligence',
'Marketcap': 500000000000, # $500B market cap
'Fulltimeemployees': 100000
}
results = hv.similarity_search(custom_query, top_n=5)
print("Large tech companies with AI/cloud focus:")
print(results[['Symbol', 'Sector', 'Marketcap', 'similarity']])
Real-World Use Cases
- Investment Research: Find companies with similar business models or financials
- Competitor Analysis: Identify direct and indirect competitors
- Portfolio Construction: Build diversified portfolios based on similarity
- Market Research: Understand sector clustering and relationships
📦 Installation
pip install hybrid-vectorizer
GPU Support (Recommended for Better Performance)
For faster text embedding with large datasets, install GPU-accelerated PyTorch:
Check Your CUDA Version
nvidia-smi
Look for "CUDA Version: X.X" in the output.
Install GPU Support
# For CUDA 11.8 (most common)
pip install hybrid-vectorizer
pip install torch --index-url https://download.pytorch.org/whl/cu118
# For CUDA 12.1
pip install torch --index-url https://download.pytorch.org/whl/cu121
Verify GPU Installation
import torch
from hybrid_vectorizer import HybridVectorizer
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
print(f"GPU: {torch.cuda.get_device_name()}")
# Test with your data
hv = HybridVectorizer()
# Should show "GPU detected" instead of "Using CPU"
Performance Notes
- CPU: Works well for datasets <1000 rows
- GPU: 5-10x faster for text embedding, recommended for larger datasets
- Text columns benefit most from GPU acceleration
Requirements:
- Python 3.8+
- pandas, numpy, scikit-learn
- sentence-transformers, torch
🏗️ Architecture
HybridVectorizer uses a novel late fusion approach for multimodal similarity search:
✨ Key Features
🔄 Automatic Data Type Handling
- Numerical: Auto-normalized with MinMaxScaler
- Categorical: One-hot or frequency encoding (smart threshold)
- Text: SentenceTransformer embeddings
- Dates: Extract features or ignore
- Mixed: Handles missing values, inf, NaN gracefully
🎯 Powerful Similarity Search
- Late Fusion: Combines modalities with configurable weights
- Block-level Control: Weight text vs. numerical vs. categorical separately
- Explanation: See which features drive similarity
🛠️ Production Ready
- Memory Efficient: Optimized for large datasets
- GPU Support: Automatic GPU detection for text encoding
- Persistence: Save/load trained models
- Error Handling: Informative custom exceptions
💡 Usage Examples
Basic Usage
# Fit and transform
hv = HybridVectorizer()
vectors = hv.fit_transform(df)
# Simple query
results = hv.similarity_search({'description': 'machine learning'})
Advanced Configuration
hv = HybridVectorizer(
column_encodings={'description': 'text', 'category': 'categorical'},
ignore_columns=['id', 'created_at'],
index_column='id',
onehot_threshold=15,
text_batch_size=64
)
Weighted Search
# Emphasize text over numerical features
results = hv.similarity_search(
query,
block_weights={'text': 3, 'categorical': 2, 'numerical': 1}
)
Text-Only Search
results = hv.similarity_search(
{'description': 'AI startup'}
)
🔧 Configuration Options
| Parameter | Description | Default |
|---|---|---|
column_encodings |
Manual type overrides | {} |
ignore_columns |
Skip these columns | [] |
index_column |
ID column (preserved in results) | None |
onehot_threshold |
Max categories for one-hot encoding | 10 |
default_text_model |
SentenceTransformer model | 'all-MiniLM-L6-v2' |
text_batch_size |
Batch size for text encoding | 128 |
📊 Data Type Detection
HybridVectorizer automatically detects:
- Numerical:
int64,float64, etc. → MinMax normalization - Categorical:
objectwith ≤10 unique values → One-hot encoding - Text:
objectwith >10 unique values → SentenceTransformer embeddings - Dates:
datetime64→ Extract year/month/day or ignore
Override with column_encodings={'col': 'text'} if needed.
🎛️ Advanced Features
Model Persistence
# Save trained model
hv.save('my_vectorizer.pkl')
# Load later
hv2 = HybridVectorizer.load('my_vectorizer.pkl')
results = hv2.similarity_search(query)
Encoding Report
# See how each column was processed
report = hv.get_encoding_report()
print(report)
External Vector Database
import faiss
# Use FAISS for faster search
index = faiss.IndexFlatIP(vectors.shape[1])
index.add(vectors)
hv.set_vector_db(index)
🚨 Error Handling
from hybrid_vectorizer import HybridVectorizerError, ModelNotFittedError
try:
results = hv.similarity_search(query)
except ModelNotFittedError:
print("Call fit_transform() first!")
except HybridVectorizerError as e:
print(f"HybridVectorizer error: {e}")
📈 Performance
Typical performance on modern hardware:
| Dataset Size | Fit Time | Search Time | Memory |
|---|---|---|---|
| 1K rows | <1s | <1ms | ~50MB |
| 10K rows | <10s | <10ms | ~200MB |
| 100K rows | <2min | <100ms | ~1GB |
With mixed data types including text columns
🛠️ Development
# Clone repository
git clone https://github.com/hariharaprabhu/hybrid-vectorizer
cd hybrid-vectorizer
# Install in development mode
pip install -e .
# Run tests
python tests/test_basic.py
📄 License
Apache-2.0 License - see LICENSE file for details.
📞 Support
- Issues: GitHub Issues
- Documentation: See this README and docstrings
- Questions: Open an issue for questions or feature requests
HybridVectorizer - Making multimodal similarity search simple and powerful. 🚀
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file hybrid_vectorizer-0.1.2.tar.gz.
File metadata
- Download URL: hybrid_vectorizer-0.1.2.tar.gz
- Upload date:
- Size: 25.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
25652734cea77faac7b77373494966526d94df6eea49201a8219264693145e0c
|
|
| MD5 |
0709d684fb4cc7b2798a07b625a5c907
|
|
| BLAKE2b-256 |
8359aeb77a1fc9fdc6e916a0660424cc4c10c651b00a1ffbc154bdb0f18f3b6c
|
File details
Details for the file hybrid_vectorizer-0.1.2-py3-none-any.whl.
File metadata
- Download URL: hybrid_vectorizer-0.1.2-py3-none-any.whl
- Upload date:
- Size: 20.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9443a4fc049ff32ce7cfd79ac6d008e3c22c67c8ac81c87b9bae14bd312cc307
|
|
| MD5 |
5ea0b39cacfb3d66ca65f6ff45431afb
|
|
| BLAKE2b-256 |
6976bd6e8fbcc1cbc1b65fa5501c9912574261893a09cca11b83a03802223749
|