Skip to main content

Unified embedding for tabular, text, and multimodal data

Project description

tabsearch

Most similarity search breaks on real-world mixed datasets (text + numbers + categories). This package fixes that.

PyPI Downloads License Python GitHub stars

tabsearch makes real-world datasets searchable — by meaning, category, and numbers — all at once.


🚀 Why This Package?

  • Quick to use: Install → run demo → see results in 60 seconds
  • Actually works: Find similar items across text, categories, and numbers
  • Intelligently automated: Automatic type detection with configurable weights
# Text-only similarity: "GOOGL" ≈ ["Nike", "Starbucks"] 🤦
# tabsearch: "GOOGL" ≈ ["MSFT", "AMZN", "META"] ✅

🔧 Quick Start

pip install tabsearch

Hello World example:

import pandas as pd
from tabsearch import HybridVectorizer

# Simple mixed dataset
df = pd.DataFrame({
    "id": [1, 2, 3],
    "category": ["Tech", "Retail", "Tech"], 
    "price": [100, 50, 200],
    "description": ["AI software", "Online store", "Cloud platform"]
})

hv = HybridVectorizer(index_column="id")
hv.fit_transform(df)
results = hv.similarity_search(df.iloc[0].to_dict(), ignore_exact_matches=True)
print(results)  # Finds id=3 (both Tech + high price) over id=2 (different category)

Real-world example with S&P 500:

import pandas as pd
from tabsearch import HybridVectorizer
from tabsearch.datasets import load_sp500_demo

# Load real S&P 500 dataset
df = load_sp500_demo()

# Select mixed-type columns
df = df[["Symbol", "Sector", "Industry", "Currentprice", "Marketcap", 
         "Fulltimeemployees", "Longbusinesssummary"]].copy()

# Automatic setup - detects column types
hv = HybridVectorizer(index_column="Symbol")
vectors = hv.fit_transform(df)
print(f"Generated {vectors.shape[0]} vectors with {vectors.shape[1]} dimensions")

# Find companies similar to Google
query = df.loc[df['Symbol']=='GOOGL'].iloc[0].to_dict()
results = hv.similarity_search(query, ignore_exact_matches=True)

print(results[['Symbol', 'Sector', 'Industry', 'similarity']].head())
#   Symbol  Sector      Industry                    similarity
#   MSFT    Technology  Software—Infrastructure     0.92
#   AMZN    Technology  Internet Retail             0.87
#   META    Technology  Internet Content & Info     0.85
#   AAPL    Technology  Consumer Electronics        0.83

🔍 How It Works

[text embeddings]  [categorical encoding]  [numerical scaling]
         │                    │                      │
         └──────────── weighted fusion ──────────────┘  → similarity score
  • 📝 Text → Sentence transformer embeddings (Longbusinesssummary)
  • 🏷️ Categories → Automatic encoding (Sector, Industry)
  • 🔢 Numbers → Normalized scaling (Currentprice, Marketcap, Fulltimeemployees)
  • ⚖️ Fusion → Configurable weighted combination

The key insight: Each data type contributes equally to similarity, regardless of dimension count.


⚙️ Configuration

# Control similarity focus with real S&P 500 data
query = df.loc[df['Symbol']=='GOOGL'].iloc[0].to_dict()

# Emphasize business description similarity
text_heavy = hv.similarity_search(
    query,
    block_weights={'text': 1.0, 'numerical': 0.5, 'categorical': 0.5},
    ignore_exact_matches=True
)

# Emphasize financial metrics
numeric_heavy = hv.similarity_search(
    query,
    block_weights={'text': 0.3, 'numerical': 1.0, 'categorical': 0.5},
    ignore_exact_matches=True
)

# Emphasize sector/industry similarity
categorical_heavy = hv.similarity_search(
    query,
    block_weights={'text': 0.3, 'numerical': 0.5, 'categorical': 1.0},
    ignore_exact_matches=True
)

print("Text-heavy results:", text_heavy[['Symbol', 'Sector']].head(3).values.tolist())
print("Numeric-heavy results:", numeric_heavy[['Symbol', 'Sector']].head(3).values.tolist()) 
print("Categorical-heavy results:", categorical_heavy[['Symbol', 'Sector']].head(3).values.tolist())

✅ Features

  • Automatic type detection - No manual column specification needed
  • Block weight tuning - Control text vs numerical vs categorical importance
  • Model persistence - Save/load trained models
  • FAISS integration - Speed up search on large datasets
  • Encoding inspection - See how each column was processed

💡 Use Cases

  • Investment research → Find companies by business model + financial metrics
  • E-commerce → Product recommendations by description + category + price
  • Customer analytics → Segment users by demographics + behavior + purchase history
  • Content matching → Similar articles by topic + engagement + metadata

⚠️ Known Limitations

  • First-time run is slow → The package downloads a pre-trained sentence transformer (~100 MB). Subsequent runs are cached and much faster.

  • Memory scaling → Each additional 1,000 rows adds ~100 MB in memory usage. For very large datasets, use the FAISS integration.

  • FAISS optional → High-speed nearest neighbor search requires installing faiss-cpu (not bundled by default, especially tricky on macOS).

  • GPU acceleration → Recommended if you plan to embed text for 100K+ rows; otherwise CPU is fine for small/medium data.

  • Mixed data assumption → tabsearch is designed to handle text, categorical, and numerical data together. For text-only datasets, it still works effectively (via sentence-transformers), but its real advantage comes when your data mixes different types.


🔗 Advanced Usage

Model persistence:

# Save trained model
hv.save('sp500_model.pkl')

# Load later
hv2 = HybridVectorizer.load('sp500_model.pkl')
results = hv2.similarity_search(query)

FAISS integration for large datasets:

import faiss

# L2 normalize vectors for cosine similarity
def l2norm(vectors):
    norms = np.linalg.norm(vectors, axis=1, keepdims=True) + 1e-12
    return vectors / norms

normalized_vectors = l2norm(vectors)

# Build FAISS index
index = faiss.IndexFlatIP(normalized_vectors.shape[1])
index.add(normalized_vectors.astype('float32'))
hv.set_vector_db(index)

# Now similarity_search uses FAISS internally
results = hv.similarity_search(query, top_n=10, ignore_exact_matches=True)

Inspect encoding:

# See how each column was processed
report = hv.get_encoding_report()
for column, encoding_type in report.items():
    print(f"{column}: {encoding_type}")

# Output example:
# Symbol: categorical
# Sector: categorical  
# Industry: categorical
# Currentprice: numerical
# Marketcap: numerical
# Fulltimeemployees: numerical
# Longbusinesssummary: text

❓ FAQ

Q: Can I use a different text model instead of the default?
Yes. By default, tabsearch uses the all-MiniLM-L6-v2 model from sentence-transformers.
Advanced users can override this in two ways:

  • Pass a different model name:

    from tabsearch import HybridVectorizer
    hv = HybridVectorizer(default_text_model="multi-qa-mpnet-base-dot-v1")
    
  • Pass a pre-loaded model:

    from tabsearch import HybridVectorizer
    from sentence_transformers import SentenceTransformer
    
    custom_model = SentenceTransformer("paraphrase-multilingual-MiniLM-L12-v2", device="cuda")
    hv = HybridVectorizer(default_text_model=custom_model)
    

Q: Does this scale to millions of rows?
Yes, via external vector databases.
By default, embeddings are stored in memory as NumPy arrays.
For large datasets, connect to FAISS (or pgvector, Milvus, Qdrant) using:

hv.set_vector_db(faiss_index)

Q: How do I evaluate similarity quality without labels?
We recommend:

  • Compare hybrid vs text-only vs numeric-only neighbors
  • Use metrics like precision@k or nDCG if you have partial labels
  • For exploratory use, inspect block-level contributions with explain_similarity()

Q: How is this different from just using FAISS or Pinecone directly?
Vector databases index embeddings but don’t handle mixed tabular data.
tabsearch automatically:

  • Detects column types
  • Encodes each block appropriately
  • Fuses them with configurable weights
  • Provides explainable results

📊 Example Results

On the real S&P 500 dataset (500 companies, 7 mixed columns):

Method GOOGL → Top 3 Results Why This Matters
Text-only Consumer/retail companies Ignores financial scale and tech sector
Numbers-only Any large companies Ignores business model similarity
Hybrid (this) MSFT, AMZN, META Captures business + financial + sector similarity

Scaling Note:
The default in-memory NumPy backend works well up to ~100k rows.
For larger datasets, use hv.set_vector_db() with FAISS or another vector DB.
See the Scaling Guide for examples.

🛠️ Installation

Basic:

pip install tabsearch

With FAISS for large datasets:

pip install tabsearch faiss-cpu

With GPU support:

pip install tabsearch
pip install torch --index-url https://download.pytorch.org/whl/cu118

🧪 Try It Now

Run this demo in Google Colab: Open In Colab

🗺️ Roadmap

Planned improvements and extensions:

  • Hugging Face ecosystem integration
    Allow users to plug in any Hugging Face text embedding model, with plans to publish a Hugging Face Space demo.

  • Benchmarks on open datasets
    Evaluate performance on UCI Adult, Amazon product reviews, and other public datasets to demonstrate real-world gains.

  • Streamlit demo UI
    Build an interactive demo app to explore mixed-data similarity search without writing code.

  • LangChain retriever wrapper (future)
    Provide a retriever class for hybrid similarity search inside LangChain workflows.

  • Community feedback loop
    Expand tutorials, examples, and features based on user input and contributions.

  • Performance Testing Test this package on various datasets and record the benchmarks


📄 License

Apache-2.0


Try it because: tabsearch is the fastest way to get meaningful similarity search on datasets that combine business descriptions, financial metrics, and categorical data.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tabsearch-0.1.3.tar.gz (27.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

tabsearch-0.1.3-py3-none-any.whl (21.5 kB view details)

Uploaded Python 3

File details

Details for the file tabsearch-0.1.3.tar.gz.

File metadata

  • Download URL: tabsearch-0.1.3.tar.gz
  • Upload date:
  • Size: 27.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.10

File hashes

Hashes for tabsearch-0.1.3.tar.gz
Algorithm Hash digest
SHA256 9c84c03a1526d65690b367cfd7989a76e7a58a8d32e0d30b130196de71418ebc
MD5 76f90bb863a5a795bcbadbb94f92a0e3
BLAKE2b-256 9eb8203f05630c122df66784b1a98c3ddc1722b7c7c33e27ef49983f0bb63f78

See more details on using hashes here.

File details

Details for the file tabsearch-0.1.3-py3-none-any.whl.

File metadata

  • Download URL: tabsearch-0.1.3-py3-none-any.whl
  • Upload date:
  • Size: 21.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.10

File hashes

Hashes for tabsearch-0.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 e9a2e4fb0fc6d8620ecda5c0d830a3ecebf80a0ff3a863b7a6f552d233a4bb04
MD5 3335560c495a47eb2391d1059b4f125e
BLAKE2b-256 27df9a78c53da3d2e2d48baff1ad1f4163d88db4a2029be66be940c2839eeb0e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page