Skip to main content

A Polars plugin for fast lexical text embeddings

Project description

Polars Luxical

A high-performance Polars plugin for Luxical text embeddings, implemented in Rust.

Overview

This plugin provides Luxical embeddings directly within Polars expressions. Luxical combines:

  • Subword tokenization (BERT uncased)
  • N-gram feature extraction with TF-IDF weighting
  • Sparse-to-dense neural network projection via knowledge distillation

Luxical models achieve dramatically higher throughput than transformer-based embedding models while maintaining competitive quality for document-level similarity tasks like clustering, classification, and semantic deduplication.

It should be noted that they were not trained on queries, so you cannot use them for search! A demonstration of this is given in the benchmarks, where the results are fast but not useful.

Installation

pip install polars-luxical

Or build from source:

maturin develop --release

Model Download

Models are automatically downloaded from HuggingFace Hub and cached locally on first use.

Cache locations:

  • Linux: ~/.cache/polars-luxical/
  • macOS: ~/Library/Caches/polars-luxical/
  • Windows: C:\Users\<User>\AppData\Local\polars-luxical\

To use a local model file instead:

register_model("/path/to/your/model")

Both .safetensors and .npz formats are supported.

Usage

import polars as pl
from polars_luxical import register_model, embed_text

# Register a Luxical model (downloads and caches automatically)
register_model("DatologyAI/luxical-one")

# Create a DataFrame
df = pl.DataFrame({
    "id": [1, 2, 3],
    "text": [
        "Hello world",
        "Machine learning is fascinating",
        "Polars and Rust are fast",
    ],
})

# Embed text
df_emb = df.with_columns(
    embed_text("text", model_id="DatologyAI/luxical-one").alias("embedding")
)
print(df_emb)

# Or use the namespace API
df_emb = df.luxical.embed(
    columns="text",
    model_name="DatologyAI/luxical-one",
    output_column="embedding",
)

# Retrieve similar documents
results = df_emb.luxical.retrieve(
    query="Tell me about speed",
    model_name="DatologyAI/luxical-one",
    embedding_column="embedding",
    k=3,
)
print(results)

Available Models

Model ID Description Embedding Dim
DatologyAI/luxical-one English web documents, distilled from snowflake-arctic-embed-m-v2.0 192

Performance

Luxical embeddings avoid transformer inference entirely, achieving throughput up to ~100x faster than large transformer embedding models (e.g., Qwen3-0.6B) and significantly faster than smaller models like MiniLM-L6-v2, particularly on CPU.

For benchmarks and methodology, see the Luxical technical report.

API Reference

Functions

register_model(model_name: str, providers: list[str] | None = None) -> None

Register/load a Luxical model into the global registry. If already loaded, this is a no-op.

  • model_name: HuggingFace model ID (e.g., "DatologyAI/luxical-one") or local path.
  • providers: Ignored (kept for API compatibility).

embed_text(expr, *, model_id: str | None = None) -> pl.Expr

Embed text using a Luxical model.

  • expr: Column expression containing text to embed.
  • model_id: Model name/ID. If None, uses the default model.

clear_registry() -> None

Clear all loaded models from the registry (frees memory).

list_models() -> list[str]

Return a list of currently loaded model names.

DataFrame Namespace

df.luxical.embed(columns, model_name, output_column="embedding", join_columns=True)

Embed text from specified columns.

df.luxical.retrieve(query, model_name, embedding_column="embedding", k=None, threshold=None, similarity_metric="cosine", add_similarity_column=True)

Retrieve rows most similar to a query.

License

Apache 2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

polars_luxical-0.1.0.tar.gz (8.9 MB view details)

Uploaded Source

File details

Details for the file polars_luxical-0.1.0.tar.gz.

File metadata

  • Download URL: polars_luxical-0.1.0.tar.gz
  • Upload date:
  • Size: 8.9 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.3

File hashes

Hashes for polars_luxical-0.1.0.tar.gz
Algorithm Hash digest
SHA256 f606b4b05e1f73315d1af740b8ff275d625ce5f6f81c4b058b6eb5af2056083f
MD5 39e0d03c95a6ac020aad76e0f506313d
BLAKE2b-256 96042c4c34cacecba2875a7691522ebeab2939bbf3c34f047201d9194119559f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page