Skip to main content

A Python library that embeds and indexes Arrow-based dataframes

Project description

DF Embedder Logo

PyPI Downloads

DF Embedder is a blazing-fast Python library (with a Rust backend) that embeds, indexes and turns your dataframes into fast vector stores (based on [Lance format](https://github.com/lancedb/lance)) in a few lines of code. It is aimed for use cases in which you have a dataframe with textual data that you want to embed and load to a vector db, in order to conduct vector search. It is opinionated and specifically aimed to deal with huge tables that need to be embedded fast. It's fast and efficient but uses its own embedding model and textual representation method. Read on for the details.
pip install dfembed
from dfembed import DfEmbedder
import polars as pl

# read a dataset using polars or pandas
df = pl.read_csv("tmdb.csv")
# turn into an arrow dataset
arrow_table = df.to_arrow()
embedder = DfEmbedder(database_name="tmdb_db")
# embed and index the dataframe to a lance table
embedder.index_table(arrow_table, table_name="films_table")
# run similarities queries
similar_movies = embedder.find_similar("adventures jungle animals", "films_table", 10)

See more usage examples in the notebook here

DfEmbedder is still an early version and work in progress. Feedback and comments will be highly appriciated.

Main Features

  • Rust Backend: For blazing-fast, multi-threaded embedding and indexing.
  • Apache Arrow: To seamlessly work with data from libraries like Polars, Pandas (via PyArrow), etc.
  • Static Embeddings: Uses efficient static embedding model for generating text embedding 100X faster.
  • Lance Format: For optimized storage and fast vector similarity searches.
  • PyO3: To provide a clean and easy-to-use Python API.

How fast is DF Embedder? benchamrks are often misleading and users should run their own analysis. To give a general idea, I was able to index about 1.2M rows from the TMDB movie dataset in about 100 seconds, using a machine with 10 CPU cores. Thats reading, embedding, indexing and writing more than 10K rows per second. And there are still ways to improve its performance by further tunning its params.

How It Works

There are quite a few tabular data embedding methods (e.g. TabNet, TABBIE, etc). However, many of which assume that data has a specific structure and type whereas RAG use cases involve embedding of unstructured free text queries into the same vector space. To resolve this I tried to follow an approach similar to the one taken by Koloski et al. with some minor change in order to be agnostic to the field type. Accordingly, indexing a dataframe using DfEmbedder starts by representing each row in the dataframe as a string that follows the format: col0_name is col0_value; col1_name is col1_value (Koloski were working with a known schema and thus offered a more "typed" approach). Next, all strings are embedded using a static embedding model (an embedding method that can generate embedding on CPU in blazing speed with very little loss of quality). Finally, it writes data as a table in Lance format.

There are several ways to search and query Lance tables created using DfEmbedder

  1. You can use DfEmbedder's find_similar method
  2. You can use LanceDB
import lancedb
db = lancedb.connect("tmdb_db")
tbl = db.open_table("films_table")
# you need the embedder to embed a query
vector = embedder.embed_string(text)
# run a vector search
tbl.search(vector).limit(10).to_list()
  1. You can use its LlamaIndex VectorStore interface
from dfembed import DfEmbedder, DfEmbedVectorStore

# because we use our own embedding model
Settings.embed_model = MockEmbedding(embed_dim=1024)
vector_store = DfEmbedVectorStore(
    df_embedder=embedder,
    table_name=table_name
)
index = VectorStoreIndex.from_vector_store(vector_store=vector_store)
query_engine = index.as_query_engine(similarity_top_k=5, llm=llm)

See more usage examples in the notebook here

Usage

Constructor Parameters

The DfEmbedder constructor accepts the following parameters:

  • num_threads (default: CPU count): Number of parallel worker threads used for embedding. Setting this to the number of available CPU cores typically gives the best performance.
  • embedding_chunk_size (default: 500): Number of records to process in each embedding batch. Larger values may improve throughput but require more memory.
  • write_buffer_size (default: 2000): Number of embeddings to buffer before writing to storage. Increasing this reduces the number of write operations, potentially improving performance for large datasets.
  • database_name (default: "./lance_db"): Path to the Lance database directory where tables will be stored.
  • table_name (default: "embeddings"): Default name for tables created in the database. Can be overridden in index_table().
  • vector_dim (default: 1024): Dimensionality of the embedding vectors produced by the static embedder. Please keep it on default for this version
import polars as pl # could also use Pandas or DuckDB
import pyarrow as pa # Although not directly used, good practice to import
from dfembed import DfEmbedder

# Load data from a CSV using Polars
df = pl.read_csv("tmdb.csv")
# transform to PyArrow Table format
arrow_table = df.to_arrow()
# Configure database path, and optional performance params
embedder = DfEmbedder(
    num_threads=8,              # Use 8 threads for embedding or defaults to avail num of cores
    write_buffer_size=3500,     # Buffer 3500 embeddings before writing
    database_name="tmdb_db",    # Path to the Lance database directory      
)
table_name = "tmdb_table" 
embedder.index_table(arrow_table, table_name=table_name)
# get 10 most similar items
query = "adventures jungle animals"
results = embedder.find_similar(query=query, table_name=table_name, k=10)

Core Methods

  • index_table(table, table_name=None): Embeds and indexes an Arrow table.

    • table: A PyArrow Table object containing the data to index.
    • table_name: Name for the created Lance table. If None, uses the default name from the constructor.
  • find_similar(query, table_name, k): Performs semantic search for similar items.

    • query: String query to search for.
    • table_name: Name of the Lance table to search in.
    • k: Number of results to return.
    • Returns a list of the k most similar text records.
  • embed_string(text): Directly access the static embedder to encode a single string.

    • text: String to embed.
    • Returns a vector of floats (the embedding).

Performance Tips

  • For large datasets, increase write_buffer_size to reduce write operations.
  • Adjust embedding_chunk_size based on your available memory and dataset characteristics.
  • The num_threads parameter should typically match your CPU core count for optimal performance.
  • For production use, consider using a fast SSD for the database storage location.

License

MIT

GitHub Actions CI/CD (WIP)

Releasing a New Version

To release a new version:

  1. Update the version in Cargo.toml and pyproject.toml
  2. Commit and push your changes
  3. Create and push a new tag with the format v{version} (e.g., v0.1.2):
    git tag v0.1.2
    git push origin v0.1.2
    
  4. The GitHub Actions workflow will automatically build wheels and publish them to PyPI

Manual Builds

You can also manually trigger the build workflow from the GitHub Actions tab in your repository.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dfembed-0.1.4.tar.gz (2.8 MB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

dfembed-0.1.4-cp313-cp313-win_amd64.whl (35.6 MB view details)

Uploaded CPython 3.13Windows x86-64

dfembed-0.1.4-cp313-cp313-macosx_11_0_arm64.whl (37.0 MB view details)

Uploaded CPython 3.13macOS 11.0+ ARM64

dfembed-0.1.4-cp313-cp313-macosx_10_12_x86_64.whl (39.2 MB view details)

Uploaded CPython 3.13macOS 10.12+ x86-64

dfembed-0.1.4-cp312-cp312-win_amd64.whl (35.6 MB view details)

Uploaded CPython 3.12Windows x86-64

dfembed-0.1.4-cp312-cp312-manylinux_2_28_x86_64.whl (44.7 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.28+ x86-64

dfembed-0.1.4-cp312-cp312-macosx_11_0_arm64.whl (37.0 MB view details)

Uploaded CPython 3.12macOS 11.0+ ARM64

dfembed-0.1.4-cp312-cp312-macosx_10_12_x86_64.whl (39.2 MB view details)

Uploaded CPython 3.12macOS 10.12+ x86-64

dfembed-0.1.4-cp311-cp311-win_amd64.whl (35.6 MB view details)

Uploaded CPython 3.11Windows x86-64

dfembed-0.1.4-cp311-cp311-manylinux_2_28_x86_64.whl (44.8 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.28+ x86-64

dfembed-0.1.4-cp311-cp311-macosx_11_0_arm64.whl (37.0 MB view details)

Uploaded CPython 3.11macOS 11.0+ ARM64

dfembed-0.1.4-cp311-cp311-macosx_10_12_x86_64.whl (39.2 MB view details)

Uploaded CPython 3.11macOS 10.12+ x86-64

dfembed-0.1.4-cp310-cp310-win_amd64.whl (35.6 MB view details)

Uploaded CPython 3.10Windows x86-64

dfembed-0.1.4-cp310-cp310-manylinux_2_28_x86_64.whl (44.7 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.28+ x86-64

dfembed-0.1.4-cp310-cp310-macosx_11_0_arm64.whl (37.0 MB view details)

Uploaded CPython 3.10macOS 11.0+ ARM64

dfembed-0.1.4-cp310-cp310-macosx_10_12_x86_64.whl (39.2 MB view details)

Uploaded CPython 3.10macOS 10.12+ x86-64

dfembed-0.1.4-cp39-cp39-win_amd64.whl (35.6 MB view details)

Uploaded CPython 3.9Windows x86-64

dfembed-0.1.4-cp39-cp39-manylinux_2_28_x86_64.whl (44.7 MB view details)

Uploaded CPython 3.9manylinux: glibc 2.28+ x86-64

dfembed-0.1.4-cp38-cp38-win_amd64.whl (35.6 MB view details)

Uploaded CPython 3.8Windows x86-64

File details

Details for the file dfembed-0.1.4.tar.gz.

File metadata

  • Download URL: dfembed-0.1.4.tar.gz
  • Upload date:
  • Size: 2.8 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.8.3

File hashes

Hashes for dfembed-0.1.4.tar.gz
Algorithm Hash digest
SHA256 783c9a80cee7ab78029e66a99edbe22343c2087dea4c06ed37704e72ed3b1afd
MD5 66feeb95352074381f56bf5b8e082ee6
BLAKE2b-256 28d2724e898d48f0582fa0ab6009903f092e39ed054b6918f36eda637e4b9cd2

See more details on using hashes here.

File details

Details for the file dfembed-0.1.4-cp313-cp313-win_amd64.whl.

File metadata

  • Download URL: dfembed-0.1.4-cp313-cp313-win_amd64.whl
  • Upload date:
  • Size: 35.6 MB
  • Tags: CPython 3.13, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.8.3

File hashes

Hashes for dfembed-0.1.4-cp313-cp313-win_amd64.whl
Algorithm Hash digest
SHA256 f659c12f96d8c4dbc02bd6bb6a4457b94664b29592778bb1a4a159de4527df55
MD5 02ee774c6c286922d81d7815b8e3ef4e
BLAKE2b-256 6a56439890fc5d26b05b739dfb632e7c484586f490076b9c5353e44e7f652ead

See more details on using hashes here.

File details

Details for the file dfembed-0.1.4-cp313-cp313-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for dfembed-0.1.4-cp313-cp313-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 84d59a69a49f67894d5ae662ca50626794d4415eac2015cefb205948f829761c
MD5 992a858c56b2ffa0e77e92407aa661f9
BLAKE2b-256 877f2c4de07806bbcb66924b372f7b023fbc1afe58198d384e429d697e98184b

See more details on using hashes here.

File details

Details for the file dfembed-0.1.4-cp313-cp313-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for dfembed-0.1.4-cp313-cp313-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 9700900178868789c54cfdf22ff2b8fee791155d544b279b7c1523514a5bdd42
MD5 290a9a4bf43362af6ad477811944411e
BLAKE2b-256 c828a06ca7d5066eb89294f98355eca70f02bdcdee50cb5704f703e84abde67a

See more details on using hashes here.

File details

Details for the file dfembed-0.1.4-cp312-cp312-win_amd64.whl.

File metadata

  • Download URL: dfembed-0.1.4-cp312-cp312-win_amd64.whl
  • Upload date:
  • Size: 35.6 MB
  • Tags: CPython 3.12, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.8.3

File hashes

Hashes for dfembed-0.1.4-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 a223b1a4551093aed834d92634715d7e6075f592251ed62d7082a8fd24dcf95e
MD5 386ac5809b824772e831f0fc3081f1ce
BLAKE2b-256 2c68d7e451ad1650ee688da96d795fb78a1919d89738a4e81d29edce64da67d1

See more details on using hashes here.

File details

Details for the file dfembed-0.1.4-cp312-cp312-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for dfembed-0.1.4-cp312-cp312-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 de940becf196b60dc4a684ec78c5e112003e24144b557329bddc881b05d33cae
MD5 9c364c1ee40f8bce40afdb5ff0d1e77a
BLAKE2b-256 6527e509672208935a0b272b0d32b1e3f7e3faeb102f5de3bbab02167d174ab4

See more details on using hashes here.

File details

Details for the file dfembed-0.1.4-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for dfembed-0.1.4-cp312-cp312-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 8a3fa1518ec343e20b294489bbe8a7b77c4cda58b92e121c9e4c7f917233bd8b
MD5 a3d815fc479a593cb18d1e068fa24cc4
BLAKE2b-256 65b0ed9a82f59d4a86d7fd74f136f838a9ca6b8e3ac730b2724c0db9aac19e13

See more details on using hashes here.

File details

Details for the file dfembed-0.1.4-cp312-cp312-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for dfembed-0.1.4-cp312-cp312-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 6b791ec38096b2c81afb129af6eb67d5e3b9795fba34cd95679f91bde65ca964
MD5 f729d4c977b0ef1a087672b5cdf73afa
BLAKE2b-256 1bfeaed10d35ab949555a9c539d287a19410f33e7807347c958df4eb6448233c

See more details on using hashes here.

File details

Details for the file dfembed-0.1.4-cp311-cp311-win_amd64.whl.

File metadata

  • Download URL: dfembed-0.1.4-cp311-cp311-win_amd64.whl
  • Upload date:
  • Size: 35.6 MB
  • Tags: CPython 3.11, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.8.3

File hashes

Hashes for dfembed-0.1.4-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 cf228420095b71c1a95e359e3e2ccacb48a9f1a7f77307d0a24dc3258af7e7dc
MD5 a2b1d461a6cc51631c10a1a17b30b60c
BLAKE2b-256 f2c31b247d4e0ac261da6b3ac23239af4682a8762400f2bd9961c5de643fc63a

See more details on using hashes here.

File details

Details for the file dfembed-0.1.4-cp311-cp311-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for dfembed-0.1.4-cp311-cp311-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 5e427614e169620fe949f028dbd2263d5ddb2cf2cdc8f408850ec8a4f4431b14
MD5 c5d5ceaacdd81436630f22846bcdf2a9
BLAKE2b-256 e100a413ee402baf1a795b38d2205d5af8371f3434d606ec2284d9122af4b5af

See more details on using hashes here.

File details

Details for the file dfembed-0.1.4-cp311-cp311-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for dfembed-0.1.4-cp311-cp311-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 a264fa96f83a49b0b61b8249739b81888dffcf794198251b7fc683e86278588f
MD5 406dd8373d5d76806a1179868d577833
BLAKE2b-256 9364283502f445201695806daf117fd9f3355f4f9d4e813a450a6288c6fa0567

See more details on using hashes here.

File details

Details for the file dfembed-0.1.4-cp311-cp311-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for dfembed-0.1.4-cp311-cp311-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 bd2f9bf25c1b87e844bb859c4f7a881bfa088e6344a5c37d1cb0cbc44b20d2a4
MD5 f5d4ecf0c95f9f5db611fa0e8c7aa5cd
BLAKE2b-256 88ed431004efbda141924b48d7969c1d902b0719383ea76e0b172b9fc92883f8

See more details on using hashes here.

File details

Details for the file dfembed-0.1.4-cp310-cp310-win_amd64.whl.

File metadata

  • Download URL: dfembed-0.1.4-cp310-cp310-win_amd64.whl
  • Upload date:
  • Size: 35.6 MB
  • Tags: CPython 3.10, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.8.3

File hashes

Hashes for dfembed-0.1.4-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 c11487a705634457e699976223f0b710b1ea5d0d513ce4b724b09f88d077cefe
MD5 7018a34eb74513975bb149ef08093d39
BLAKE2b-256 9d7d7bf78e57b52487ff107810e6b20a449bac44792914613f268fbf5d0508a7

See more details on using hashes here.

File details

Details for the file dfembed-0.1.4-cp310-cp310-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for dfembed-0.1.4-cp310-cp310-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 2fa5af4b24cef520bc8369f459aeec8d9c4770d19a3a33d35e3b5bac783f3f08
MD5 c12a4841e41546db5fbc0f1cd6fd7351
BLAKE2b-256 e34af9ad4b5dbca0c0d93d05adbe08779435f173a32633b071c72cf05798f5f6

See more details on using hashes here.

File details

Details for the file dfembed-0.1.4-cp310-cp310-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for dfembed-0.1.4-cp310-cp310-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 748945b429c26ff6b913a5dd21ade232d3d9b260a2b689104ce20f35aaedc2a3
MD5 78212d0b84f89e08197cc5361d0a66f9
BLAKE2b-256 3723298f21ade3a5c44baf6f89c30ea51ea418635a16cb2844e013c74ad86400

See more details on using hashes here.

File details

Details for the file dfembed-0.1.4-cp310-cp310-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for dfembed-0.1.4-cp310-cp310-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 8f016cfb9f701b15ab5d553e357cd8a64bab20dcaa8afc2062828c4cb0894b96
MD5 084e2317f84e2147f61074829540386d
BLAKE2b-256 509c3ef932eca437215bd02fc89e1ebbb16e73552e01ea7546ea962934e9f577

See more details on using hashes here.

File details

Details for the file dfembed-0.1.4-cp39-cp39-win_amd64.whl.

File metadata

  • Download URL: dfembed-0.1.4-cp39-cp39-win_amd64.whl
  • Upload date:
  • Size: 35.6 MB
  • Tags: CPython 3.9, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.8.3

File hashes

Hashes for dfembed-0.1.4-cp39-cp39-win_amd64.whl
Algorithm Hash digest
SHA256 cbcdeae13de02f0a7aad860e574c5684be7acc686d507858e520d3d7c92bed4e
MD5 878230fc8c5eae7c01e73c6e0b23af9c
BLAKE2b-256 fbe5dbba1128d1d2a9d13cebbd372d2f152286dda5b15942b503d028a4410ace

See more details on using hashes here.

File details

Details for the file dfembed-0.1.4-cp39-cp39-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for dfembed-0.1.4-cp39-cp39-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 f4d2d9efb2795fbbc5fd39445e66f1b9e00f78c0a3f0d5e81e993427da51acca
MD5 05313b022af8f125ebbd7985ee38ad77
BLAKE2b-256 5fb3109cf6060f4d124c7a9da5947077f9c24f28f3ec6b70877574e79031a643

See more details on using hashes here.

File details

Details for the file dfembed-0.1.4-cp38-cp38-win_amd64.whl.

File metadata

  • Download URL: dfembed-0.1.4-cp38-cp38-win_amd64.whl
  • Upload date:
  • Size: 35.6 MB
  • Tags: CPython 3.8, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.8.3

File hashes

Hashes for dfembed-0.1.4-cp38-cp38-win_amd64.whl
Algorithm Hash digest
SHA256 ee7f2f41c2904d10c09da12adbc241ae06d4dd15b07f5510819add5d1490cefe
MD5 ac336f436e1fed98635592ec1a2c5ed0
BLAKE2b-256 dbdd3541deded3e1d8c933f2b37859ff97ff75cc46bba46d5402402391a577f6

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page