Unified embedding for tabular, text, and multimodal data
Project description
tabsearch
Most similarity search breaks on real-world mixed datasets (text + numbers + categories). This package fixes that.
tabsearch makes real-world datasets searchable — by meaning, category, and numbers — all at once.
🚀 Why This Package?
- Quick to use: Install → run demo → see results in 60 seconds
- Actually works: Find similar items across text, categories, and numbers
- Intelligently automated: Automatic type detection with configurable weights
# Text-only similarity: "GOOGL" ≈ ["Nike", "Starbucks"] 🤦
# tabsearch: "GOOGL" ≈ ["MSFT", "AMZN", "META"] ✅
🔧 Quick Start
pip install tabsearch
Hello World example:
import pandas as pd
from tabsearch import HybridVectorizer
# Simple mixed dataset
df = pd.DataFrame({
"id": [1, 2, 3],
"category": ["Tech", "Retail", "Tech"],
"price": [100, 50, 200],
"description": ["AI software", "Online store", "Cloud platform"]
})
hv = HybridVectorizer(index_column="id")
hv.fit_transform(df)
results = hv.similarity_search(df.iloc[0].to_dict(), ignore_exact_matches=True)
print(results) # Finds id=3 (both Tech + high price) over id=2 (different category)
Real-world example with S&P 500:
import pandas as pd
from tabsearch import HybridVectorizer
from tabsearch.datasets import load_sp500_demo
# Load real S&P 500 dataset
df = load_sp500_demo()
# Select mixed-type columns
df = df[["Symbol", "Sector", "Industry", "Currentprice", "Marketcap",
"Fulltimeemployees", "Longbusinesssummary"]].copy()
# Automatic setup - detects column types
hv = HybridVectorizer(index_column="Symbol")
vectors = hv.fit_transform(df)
print(f"Generated {vectors.shape[0]} vectors with {vectors.shape[1]} dimensions")
# Find companies similar to Google
query = df.loc[df['Symbol']=='GOOGL'].iloc[0].to_dict()
results = hv.similarity_search(query, ignore_exact_matches=True)
print(results[['Symbol', 'Sector', 'Industry', 'similarity']].head())
# Symbol Sector Industry similarity
# MSFT Technology Software—Infrastructure 0.92
# AMZN Technology Internet Retail 0.87
# META Technology Internet Content & Info 0.85
# AAPL Technology Consumer Electronics 0.83
🔍 How It Works
[text embeddings] [categorical encoding] [numerical scaling]
│ │ │
└──────────── weighted fusion ──────────────┘ → similarity score
- 📝 Text → Sentence transformer embeddings (Longbusinesssummary)
- 🏷️ Categories → Automatic encoding (Sector, Industry)
- 🔢 Numbers → Normalized scaling (Currentprice, Marketcap, Fulltimeemployees)
- ⚖️ Fusion → Configurable weighted combination
The key insight: Each data type contributes equally to similarity, regardless of dimension count.
⚙️ Configuration
# Control similarity focus with real S&P 500 data
query = df.loc[df['Symbol']=='GOOGL'].iloc[0].to_dict()
# Emphasize business description similarity
text_heavy = hv.similarity_search(
query,
block_weights={'text': 1.0, 'numerical': 0.5, 'categorical': 0.5},
ignore_exact_matches=True
)
# Emphasize financial metrics
numeric_heavy = hv.similarity_search(
query,
block_weights={'text': 0.3, 'numerical': 1.0, 'categorical': 0.5},
ignore_exact_matches=True
)
# Emphasize sector/industry similarity
categorical_heavy = hv.similarity_search(
query,
block_weights={'text': 0.3, 'numerical': 0.5, 'categorical': 1.0},
ignore_exact_matches=True
)
print("Text-heavy results:", text_heavy[['Symbol', 'Sector']].head(3).values.tolist())
print("Numeric-heavy results:", numeric_heavy[['Symbol', 'Sector']].head(3).values.tolist())
print("Categorical-heavy results:", categorical_heavy[['Symbol', 'Sector']].head(3).values.tolist())
✅ Features
- Automatic type detection - No manual column specification needed
- Block weight tuning - Control text vs numerical vs categorical importance
- Model persistence - Save/load trained models
- FAISS integration - Speed up search on large datasets
- Encoding inspection - See how each column was processed
💡 Use Cases
- Investment research → Find companies by business model + financial metrics
- E-commerce → Product recommendations by description + category + price
- Customer analytics → Segment users by demographics + behavior + purchase history
- Content matching → Similar articles by topic + engagement + metadata
⚠️ Known Limitations
-
First-time run is slow → The package downloads a pre-trained sentence transformer (~100 MB). Subsequent runs are cached and much faster.
-
Memory scaling → Each additional 1,000 rows adds ~100 MB in memory usage. For very large datasets, use the FAISS integration.
-
FAISS optional → High-speed nearest neighbor search requires installing faiss-cpu (not bundled by default, especially tricky on macOS).
-
GPU acceleration → Recommended if you plan to embed text for 100K+ rows; otherwise CPU is fine for small/medium data.
-
Mixed data assumption → tabsearch is designed to handle text, categorical, and numerical data together. For text-only datasets, it still works effectively (via sentence-transformers), but its real advantage comes when your data mixes different types.
🔗 Advanced Usage
Model persistence:
# Save trained model
hv.save('sp500_model.pkl')
# Load later
hv2 = HybridVectorizer.load('sp500_model.pkl')
results = hv2.similarity_search(query)
FAISS integration for large datasets:
import faiss
# L2 normalize vectors for cosine similarity
def l2norm(vectors):
norms = np.linalg.norm(vectors, axis=1, keepdims=True) + 1e-12
return vectors / norms
normalized_vectors = l2norm(vectors)
# Build FAISS index
index = faiss.IndexFlatIP(normalized_vectors.shape[1])
index.add(normalized_vectors.astype('float32'))
hv.set_vector_db(index)
# Now similarity_search uses FAISS internally
results = hv.similarity_search(query, top_n=10, ignore_exact_matches=True)
Inspect encoding:
# See how each column was processed
report = hv.get_encoding_report()
for column, encoding_type in report.items():
print(f"{column}: {encoding_type}")
# Output example:
# Symbol: categorical
# Sector: categorical
# Industry: categorical
# Currentprice: numerical
# Marketcap: numerical
# Fulltimeemployees: numerical
# Longbusinesssummary: text
❓ FAQ
Q: Can I use a different text model instead of the default?
Yes. By default, tabsearch uses the all-MiniLM-L6-v2 model from sentence-transformers.
Advanced users can override this in two ways:
-
Pass a different model name:
from tabsearch import HybridVectorizer hv = HybridVectorizer(default_text_model="multi-qa-mpnet-base-dot-v1")
-
Pass a pre-loaded model:
from tabsearch import HybridVectorizer from sentence_transformers import SentenceTransformer custom_model = SentenceTransformer("paraphrase-multilingual-MiniLM-L12-v2", device="cuda") hv = HybridVectorizer(default_text_model=custom_model)
Q: Does this scale to millions of rows?
Yes, via external vector databases.
By default, embeddings are stored in memory as NumPy arrays.
For large datasets, connect to FAISS (or pgvector, Milvus, Qdrant) using:
hv.set_vector_db(faiss_index)
Q: How do I evaluate similarity quality without labels?
We recommend:
- Compare hybrid vs text-only vs numeric-only neighbors
- Use metrics like precision@k or nDCG if you have partial labels
- For exploratory use, inspect block-level contributions with
explain_similarity()
Q: How is this different from just using FAISS or Pinecone directly?
Vector databases index embeddings but don’t handle mixed tabular data.
tabsearch automatically:
- Detects column types
- Encodes each block appropriately
- Fuses them with configurable weights
- Provides explainable results
📊 Example Results
On the real S&P 500 dataset (500 companies, 7 mixed columns):
| Method | GOOGL → Top 3 Results | Why This Matters |
|---|---|---|
| Text-only | Consumer/retail companies | Ignores financial scale and tech sector |
| Numbers-only | Any large companies | Ignores business model similarity |
| Hybrid (this) | MSFT, AMZN, META | Captures business + financial + sector similarity |
⚡ Scaling Note:
The default in-memory NumPy backend works well up to ~100k rows.
For larger datasets, use hv.set_vector_db() with FAISS or another vector DB.
See the Scaling Guide for examples.
🛠️ Installation
Basic:
pip install tabsearch
With FAISS for large datasets:
pip install tabsearch faiss-cpu
With GPU support:
pip install tabsearch
pip install torch --index-url https://download.pytorch.org/whl/cu118
🧪 Try It Now
Run this demo in Google Colab:
🗺️ Roadmap
Planned improvements and extensions:
-
Hugging Face ecosystem integration
Allow users to plug in any Hugging Face text embedding model, with plans to publish a Hugging Face Space demo. -
Benchmarks on open datasets
Evaluate performance on UCI Adult, Amazon product reviews, and other public datasets to demonstrate real-world gains. -
Streamlit demo UI
Build an interactive demo app to explore mixed-data similarity search without writing code. -
LangChain retriever wrapper (future)
Provide a retriever class for hybrid similarity search inside LangChain workflows. -
Community feedback loop
Expand tutorials, examples, and features based on user input and contributions. -
Performance Testing Test this package on various datasets and record the benchmarks
📄 License
Apache-2.0
Try it because: tabsearch is the fastest way to get meaningful similarity search on datasets that combine business descriptions, financial metrics, and categorical data.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file tabsearch-0.1.3.tar.gz.
File metadata
- Download URL: tabsearch-0.1.3.tar.gz
- Upload date:
- Size: 27.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9c84c03a1526d65690b367cfd7989a76e7a58a8d32e0d30b130196de71418ebc
|
|
| MD5 |
76f90bb863a5a795bcbadbb94f92a0e3
|
|
| BLAKE2b-256 |
9eb8203f05630c122df66784b1a98c3ddc1722b7c7c33e27ef49983f0bb63f78
|
File details
Details for the file tabsearch-0.1.3-py3-none-any.whl.
File metadata
- Download URL: tabsearch-0.1.3-py3-none-any.whl
- Upload date:
- Size: 21.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e9a2e4fb0fc6d8620ecda5c0d830a3ecebf80a0ff3a863b7a6f552d233a4bb04
|
|
| MD5 |
3335560c495a47eb2391d1059b4f125e
|
|
| BLAKE2b-256 |
27df9a78c53da3d2e2d48baff1ad1f4163d88db4a2029be66be940c2839eeb0e
|