Unified embedding for tabular, text, and multimodal data

These details have not been verified by PyPI

Project links

Project description

tabsearch

Most similarity search breaks on real-world mixed datasets (text + numbers + categories). This package fixes that.

tabsearch makes real-world datasets searchable — by meaning, category, and numbers — all at once.

🚀 Why This Package?

Quick to use: Install → run demo → see results in 60 seconds
Actually works: Find similar items across text, categories, and numbers
Intelligently automated: Automatic type detection with configurable weights

# Text-only similarity: "GOOGL" ≈ ["Nike", "Starbucks"] 🤦
# tabsearch: "GOOGL" ≈ ["MSFT", "AMZN", "META"] ✅

🔧 Quick Start

pip install tabsearch

Hello World example:

import pandas as pd
from tabsearch import HybridVectorizer

# Simple mixed dataset
df = pd.DataFrame({
    "id": [1, 2, 3],
    "category": ["Tech", "Retail", "Tech"], 
    "price": [100, 50, 200],
    "description": ["AI software", "Online store", "Cloud platform"]
})

hv = HybridVectorizer(index_column="id")
hv.fit_transform(df)
results = hv.similarity_search(df.iloc[0].to_dict(), ignore_exact_matches=True)
print(results)  # Finds id=3 (both Tech + high price) over id=2 (different category)

Real-world example with S&P 500:

import pandas as pd
from tabsearch import HybridVectorizer
from tabsearch.datasets import load_sp500_demo

# Load real S&P 500 dataset
df = load_sp500_demo()

# Select mixed-type columns
df = df[["Symbol", "Sector", "Industry", "Currentprice", "Marketcap", 
         "Fulltimeemployees", "Longbusinesssummary"]].copy()

# Automatic setup - detects column types
hv = HybridVectorizer(index_column="Symbol")
vectors = hv.fit_transform(df)
print(f"Generated {vectors.shape[0]} vectors with {vectors.shape[1]} dimensions")

# Find companies similar to Google
query = df.loc[df['Symbol']=='GOOGL'].iloc[0].to_dict()
results = hv.similarity_search(query, ignore_exact_matches=True)

print(results[['Symbol', 'Sector', 'Industry', 'similarity']].head())
#   Symbol  Sector      Industry                    similarity
#   MSFT    Technology  Software—Infrastructure     0.92
#   AMZN    Technology  Internet Retail             0.87
#   META    Technology  Internet Content & Info     0.85
#   AAPL    Technology  Consumer Electronics        0.83

🔍 How It Works

[text embeddings]  [categorical encoding]  [numerical scaling]
         │                    │                      │
         └──────────── weighted fusion ──────────────┘  → similarity score

📝 Text → Sentence transformer embeddings (Longbusinesssummary)
🏷️ Categories → Automatic encoding (Sector, Industry)
🔢 Numbers → Normalized scaling (Currentprice, Marketcap, Fulltimeemployees)
⚖️ Fusion → Configurable weighted combination

The key insight: Each data type contributes equally to similarity, regardless of dimension count.

⚙️ Configuration

# Control similarity focus with real S&P 500 data
query = df.loc[df['Symbol']=='GOOGL'].iloc[0].to_dict()

# Emphasize business description similarity
text_heavy = hv.similarity_search(
    query,
    block_weights={'text': 1.0, 'numerical': 0.5, 'categorical': 0.5},
    ignore_exact_matches=True
)

# Emphasize financial metrics
numeric_heavy = hv.similarity_search(
    query,
    block_weights={'text': 0.3, 'numerical': 1.0, 'categorical': 0.5},
    ignore_exact_matches=True
)

# Emphasize sector/industry similarity
categorical_heavy = hv.similarity_search(
    query,
    block_weights={'text': 0.3, 'numerical': 0.5, 'categorical': 1.0},
    ignore_exact_matches=True
)

print("Text-heavy results:", text_heavy[['Symbol', 'Sector']].head(3).values.tolist())
print("Numeric-heavy results:", numeric_heavy[['Symbol', 'Sector']].head(3).values.tolist()) 
print("Categorical-heavy results:", categorical_heavy[['Symbol', 'Sector']].head(3).values.tolist())

✅ Features

Automatic type detection - No manual column specification needed
Block weight tuning - Control text vs numerical vs categorical importance
Model persistence - Save/load trained models
FAISS integration - Speed up search on large datasets
Encoding inspection - See how each column was processed

💡 Use Cases

Investment research → Find companies by business model + financial metrics
E-commerce → Product recommendations by description + category + price
Customer analytics → Segment users by demographics + behavior + purchase history
Content matching → Similar articles by topic + engagement + metadata

⚠️ Known Limitations

First-time run is slow → The package downloads a pre-trained sentence transformer (~100 MB). Subsequent runs are cached and much faster.
Memory scaling → Each additional 1,000 rows adds ~100 MB in memory usage. For very large datasets, use the FAISS integration.
FAISS optional → High-speed nearest neighbor search requires installing faiss-cpu (not bundled by default, especially tricky on macOS).
GPU acceleration → Recommended if you plan to embed text for 100K+ rows; otherwise CPU is fine for small/medium data.
Mixed data assumption → tabsearch is designed to handle text, categorical, and numerical data together. For text-only datasets, it still works effectively (via sentence-transformers), but its real advantage comes when your data mixes different types.

🔗 Advanced Usage

Model persistence:

# Save trained model
hv.save('sp500_model.pkl')

# Load later
hv2 = HybridVectorizer.load('sp500_model.pkl')
results = hv2.similarity_search(query)

FAISS integration for large datasets:

import faiss

# L2 normalize vectors for cosine similarity
def l2norm(vectors):
    norms = np.linalg.norm(vectors, axis=1, keepdims=True) + 1e-12
    return vectors / norms

normalized_vectors = l2norm(vectors)

# Build FAISS index
index = faiss.IndexFlatIP(normalized_vectors.shape[1])
index.add(normalized_vectors.astype('float32'))
hv.set_vector_db(index)

# Now similarity_search uses FAISS internally
results = hv.similarity_search(query, top_n=10, ignore_exact_matches=True)

Inspect encoding:

# See how each column was processed
report = hv.get_encoding_report()
for column, encoding_type in report.items():
    print(f"{column}: {encoding_type}")

# Output example:
# Symbol: categorical
# Sector: categorical  
# Industry: categorical
# Currentprice: numerical
# Marketcap: numerical
# Fulltimeemployees: numerical
# Longbusinesssummary: text

❓ FAQ

Q: Can I use a different text model instead of the default?
Yes. By default, tabsearch uses the all-MiniLM-L6-v2 model from sentence-transformers.
Advanced users can override this in two ways:

Pass a different model name:

from tabsearch import HybridVectorizer
hv = HybridVectorizer(default_text_model="multi-qa-mpnet-base-dot-v1")

Pass a pre-loaded model:

from tabsearch import HybridVectorizer
from sentence_transformers import SentenceTransformer

custom_model = SentenceTransformer("paraphrase-multilingual-MiniLM-L12-v2", device="cuda")
hv = HybridVectorizer(default_text_model=custom_model)

Q: Does this scale to millions of rows?
Yes, via external vector databases.
By default, embeddings are stored in memory as NumPy arrays.
For large datasets, connect to FAISS (or pgvector, Milvus, Qdrant) using:

hv.set_vector_db(faiss_index)

Q: How do I evaluate similarity quality without labels?
We recommend:

Compare hybrid vs text-only vs numeric-only neighbors
Use metrics like precision@k or nDCG if you have partial labels
For exploratory use, inspect block-level contributions with explain_similarity()

Q: How is this different from just using FAISS or Pinecone directly?
Vector databases index embeddings but don’t handle mixed tabular data.
tabsearch automatically:

Detects column types
Encodes each block appropriately
Fuses them with configurable weights
Provides explainable results

📊 Example Results

On the real S&P 500 dataset (500 companies, 7 mixed columns):

Method	GOOGL → Top 3 Results	Why This Matters
Text-only	Consumer/retail companies	Ignores financial scale and tech sector
Numbers-only	Any large companies	Ignores business model similarity
Hybrid (this)	MSFT, AMZN, META	Captures business + financial + sector similarity

⚡ Scaling Note:
The default in-memory NumPy backend works well up to ~100k rows.
For larger datasets, use hv.set_vector_db() with FAISS or another vector DB.
See the Scaling Guide for examples.

🛠️ Installation

Basic:

pip install tabsearch

With FAISS for large datasets:

pip install tabsearch faiss-cpu

With GPU support:

pip install tabsearch
pip install torch --index-url https://download.pytorch.org/whl/cu118

🧪 Try It Now

Run this demo in Google Colab:

🗺️ Roadmap

Planned improvements and extensions:

Hugging Face ecosystem integration
Allow users to plug in any Hugging Face text embedding model, with plans to publish a Hugging Face Space demo.
Benchmarks on open datasets
Evaluate performance on UCI Adult, Amazon product reviews, and other public datasets to demonstrate real-world gains.
Streamlit demo UI
Build an interactive demo app to explore mixed-data similarity search without writing code.
LangChain retriever wrapper (future)
Provide a retriever class for hybrid similarity search inside LangChain workflows.
Community feedback loop
Expand tutorials, examples, and features based on user input and contributions.
Performance Testing Test this package on various datasets and record the benchmarks

📄 License

Apache-2.0

Try it because: tabsearch is the fastest way to get meaningful similarity search on datasets that combine business descriptions, financial metrics, and categorical data.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.3

Sep 23, 2025

0.1.2

Sep 21, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tabsearch-0.1.3.tar.gz (27.5 kB view details)

Uploaded Sep 23, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

tabsearch-0.1.3-py3-none-any.whl (21.5 kB view details)

Uploaded Sep 23, 2025 Python 3

File details

Details for the file tabsearch-0.1.3.tar.gz.

File metadata

Download URL: tabsearch-0.1.3.tar.gz
Upload date: Sep 23, 2025
Size: 27.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.10

File hashes

Hashes for tabsearch-0.1.3.tar.gz
Algorithm	Hash digest
SHA256	`9c84c03a1526d65690b367cfd7989a76e7a58a8d32e0d30b130196de71418ebc`
MD5	`76f90bb863a5a795bcbadbb94f92a0e3`
BLAKE2b-256	`9eb8203f05630c122df66784b1a98c3ddc1722b7c7c33e27ef49983f0bb63f78`

See more details on using hashes here.

File details

Details for the file tabsearch-0.1.3-py3-none-any.whl.

File metadata

Download URL: tabsearch-0.1.3-py3-none-any.whl
Upload date: Sep 23, 2025
Size: 21.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.10

File hashes

Hashes for tabsearch-0.1.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`e9a2e4fb0fc6d8620ecda5c0d830a3ecebf80a0ff3a863b7a6f552d233a4bb04`
MD5	`3335560c495a47eb2391d1059b4f125e`
BLAKE2b-256	`27df9a78c53da3d2e2d48baff1ad1f4163d88db4a2029be66be940c2839eeb0e`

See more details on using hashes here.

tabsearch 0.1.3

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

tabsearch

🔧 Quick Start

🔍 How It Works

⚙️ Configuration

✅ Features

💡 Use Cases

⚠️ Known Limitations

🔗 Advanced Usage

❓ FAQ

📊 Example Results

🛠️ Installation

🧪 Try It Now

🗺️ Roadmap

📄 License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes