Skip to main content

A Python library that embeds and indexes Arrow-based dataframes

Project description

DF Embedder

PyPI Downloads
DF Embedder is a high-performance Python library (with a Rust backend) that embeds, indexes and turns your dataframes into fast vector stores (based on Lance format) in a few lines of code. It is aimed for use cases in which you have a dataframe with textual data that you want to embed and load to a vector db, in order to conduct vector search. It is opinionated and specifically aimed to deal with huge tables that need to be embedded fast. It's fast and efficient but uses its own embedding model and textual representation method. Read on for the details.

pip install dfembed
# read a dataset using polars or pandas
df = pl.read_csv("tmdb.csv")
# turn into an arrow dataset
arrow_table = df.to_arrow()
embedder = DfEmbedder(database_name="tmdb_db")
# embed and index the dataframe to a lance table
embedder.index_table(arrow_table, table_name="films_table")
# run similarities queries
similar_movies = embedder.find_similar("adventures jungle animals", "films_table", 10)

DfEmbedder is still an early version and work in progress. Feedback and comments will be highly appriciated.

Main Features

  • Rust Backend: For blazing-fast, multi-threaded embedding and indexing.
  • Apache Arrow: To seamlessly work with data from libraries like Polars, Pandas (via PyArrow), etc.
  • Static Embeddings: Uses efficient static embedding model for generating text embedding 100X faster.
  • Lance Format: For optimized storage and fast vector similarity searches.
  • PyO3: To provide a clean and easy-to-use Python API.

How fast is DF Embedder? benchamrks are often misleading and users should run their own analysis. To give a general idea, I was able to index about 1.2M rows from the TMDB movie dataset in about 100 seconds, using a machine with 10 CPU cores. Thats reading, embedding, indexing and writing more than 10K rows per second. And there are still ways to improve its performance by further tunning its params.

How It Works

There are quite a few tabular data embedding methods (e.g. TabNet, TABBIE, etc). However, many of which assume that data has a specific structure and type whereas RAG use cases involve embedding of unstructured free text queries into the same vector space. To resolve this I tried to follow an approach similar to the one taken by Koloski et al. with some minor change in order to be agnostic to the field type. Accordingly, indexing a dataframe using DfEmbedder starts by representing each row in the dataframe as a string that follows the format: col0_name is col0_value; col1_name is col1_value (Koloski were working with a known schema and thus offered a more "typed" approach). Next, all strings are embedded using a static embedding model (an embedding method that can generate embedding on CPU in blazing speed with very little loss of quality). Finally, it writes data as a table in Lance format.

There are several ways to search and query Lance tables created using DfEmbedder

  1. You can use DfEmbedder's find_similar method
  2. You can use LanceDB
import lancedb
db = lancedb.connect("tmdb_db")
tbl = db.open_table("films_table")
# you need the embedder to embed a query
vector = embedder.embed_string(text)
# run a vector search
tbl.search(vector).limit(10).to_list()
  1. You can use its LlamaIndex VectorStore interface
from dfembed import DfEmbedder, DfEmbedVectorStore

# because we use our own embedding model
Settings.embed_model = MockEmbedding(embed_dim=1024)
vector_store = DfEmbedVectorStore(
    df_embedder=embedder,
    table_name=table_name
)
index = VectorStoreIndex.from_vector_store(vector_store=vector_store)
query_engine = index.as_query_engine(similarity_top_k=5, llm=llm)

See more usage examples in the notebook here

Usage

Constructor Parameters

The DfEmbedder constructor accepts the following parameters:

  • num_threads (default: CPU count): Number of parallel worker threads used for embedding. Setting this to the number of available CPU cores typically gives the best performance.
  • embedding_chunk_size (default: 500): Number of records to process in each embedding batch. Larger values may improve throughput but require more memory.
  • write_buffer_size (default: 2000): Number of embeddings to buffer before writing to storage. Increasing this reduces the number of write operations, potentially improving performance for large datasets.
  • database_name (default: "./lance_db"): Path to the Lance database directory where tables will be stored.
  • table_name (default: "embeddings"): Default name for tables created in the database. Can be overridden in index_table().
  • vector_dim (default: 1024): Dimensionality of the embedding vectors produced by the static embedder. Please keep it on default for this version
import polars as pl # could also use Pandas or DuckDB
import pyarrow as pa # Although not directly used, good practice to import
from dfembed import DfEmbedder

# Load data from a CSV using Polars
df = pl.read_csv("tmdb.csv")
# transform to PyArrow Table format
arrow_table = df.to_arrow()
# Configure database path, and optional performance params
embedder = DfEmbedder(
    num_threads=8,              # Use 8 threads for embedding or defaults to avail num of cores
    write_buffer_size=3500,     # Buffer 3500 embeddings before writing
    database_name="tmdb_db",    # Path to the Lance database directory      
)
table_name = "tmdb_table" 
embedder.index_table(arrow_table, table_name=table_name)
# get 10 most similar items
query = "adventures jungle animals"
results = embedder.find_similar(query=query, table_name=table_name, k=10)

Core Methods

  • index_table(table, table_name=None): Embeds and indexes an Arrow table.

    • table: A PyArrow Table object containing the data to index.
    • table_name: Name for the created Lance table. If None, uses the default name from the constructor.
  • find_similar(query, table_name, k): Performs semantic search for similar items.

    • query: String query to search for.
    • table_name: Name of the Lance table to search in.
    • k: Number of results to return.
    • Returns a list of the k most similar text records.
  • embed_string(text): Directly access the static embedder to encode a single string.

    • text: String to embed.
    • Returns a vector of floats (the embedding).

Performance Tips

  • For large datasets, increase write_buffer_size to reduce write operations.
  • Adjust embedding_chunk_size based on your available memory and dataset characteristics.
  • The num_threads parameter should typically match your CPU core count for optimal performance.
  • For production use, consider using a fast SSD for the database storage location.

License

MIT

GitHub Actions CI/CD (WIP)

Releasing a New Version

To release a new version:

  1. Update the version in Cargo.toml and pyproject.toml
  2. Commit and push your changes
  3. Create and push a new tag with the format v{version} (e.g., v0.1.2):
    git tag v0.1.2
    git push origin v0.1.2
    
  4. The GitHub Actions workflow will automatically build wheels and publish them to PyPI

Manual Builds

You can also manually trigger the build workflow from the GitHub Actions tab in your repository.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dfembed-0.1.3.tar.gz (1.3 MB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

dfembed-0.1.3-cp313-cp313-win_amd64.whl (35.6 MB view details)

Uploaded CPython 3.13Windows x86-64

dfembed-0.1.3-cp313-cp313-manylinux_2_35_x86_64.whl (44.8 MB view details)

Uploaded CPython 3.13manylinux: glibc 2.35+ x86-64

dfembed-0.1.3-cp313-cp313-macosx_11_0_arm64.whl (37.0 MB view details)

Uploaded CPython 3.13macOS 11.0+ ARM64

dfembed-0.1.3-cp313-cp313-macosx_10_12_x86_64.whl (39.2 MB view details)

Uploaded CPython 3.13macOS 10.12+ x86-64

dfembed-0.1.3-cp312-cp312-win_amd64.whl (35.6 MB view details)

Uploaded CPython 3.12Windows x86-64

dfembed-0.1.3-cp312-cp312-manylinux_2_28_x86_64.whl (44.7 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.28+ x86-64

dfembed-0.1.3-cp312-cp312-macosx_11_0_arm64.whl (37.0 MB view details)

Uploaded CPython 3.12macOS 11.0+ ARM64

dfembed-0.1.3-cp312-cp312-macosx_10_12_x86_64.whl (39.2 MB view details)

Uploaded CPython 3.12macOS 10.12+ x86-64

dfembed-0.1.3-cp311-cp311-win_amd64.whl (35.6 MB view details)

Uploaded CPython 3.11Windows x86-64

dfembed-0.1.3-cp311-cp311-manylinux_2_28_x86_64.whl (44.8 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.28+ x86-64

dfembed-0.1.3-cp311-cp311-macosx_11_0_arm64.whl (37.0 MB view details)

Uploaded CPython 3.11macOS 11.0+ ARM64

dfembed-0.1.3-cp311-cp311-macosx_10_12_x86_64.whl (39.2 MB view details)

Uploaded CPython 3.11macOS 10.12+ x86-64

dfembed-0.1.3-cp310-cp310-win_amd64.whl (35.6 MB view details)

Uploaded CPython 3.10Windows x86-64

dfembed-0.1.3-cp310-cp310-manylinux_2_35_x86_64.whl (44.8 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.35+ x86-64

dfembed-0.1.3-cp310-cp310-manylinux_2_28_x86_64.whl (44.7 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.28+ x86-64

dfembed-0.1.3-cp39-cp39-win_amd64.whl (35.6 MB view details)

Uploaded CPython 3.9Windows x86-64

dfembed-0.1.3-cp39-cp39-manylinux_2_28_x86_64.whl (44.7 MB view details)

Uploaded CPython 3.9manylinux: glibc 2.28+ x86-64

dfembed-0.1.3-cp38-cp38-win_amd64.whl (35.6 MB view details)

Uploaded CPython 3.8Windows x86-64

File details

Details for the file dfembed-0.1.3.tar.gz.

File metadata

  • Download URL: dfembed-0.1.3.tar.gz
  • Upload date:
  • Size: 1.3 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.8.3

File hashes

Hashes for dfembed-0.1.3.tar.gz
Algorithm Hash digest
SHA256 f18052c1e23106487c1dcd5643cfaa5a5a2c2288398eab5674181c3e2cd92eee
MD5 70908714e6471b3e0ef923f7714b8486
BLAKE2b-256 dd7818f74e958e5c5007cce8962a77870810ed1a48b6242f3545383fa2e6b55d

See more details on using hashes here.

File details

Details for the file dfembed-0.1.3-cp313-cp313-win_amd64.whl.

File metadata

  • Download URL: dfembed-0.1.3-cp313-cp313-win_amd64.whl
  • Upload date:
  • Size: 35.6 MB
  • Tags: CPython 3.13, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.8.3

File hashes

Hashes for dfembed-0.1.3-cp313-cp313-win_amd64.whl
Algorithm Hash digest
SHA256 d39b1dcd9430e2a99d54f8fa697c557f0497d6ed4bd7c59ce8c9e8eba09c837e
MD5 3bc64f89912c8c62bd3daf4434676651
BLAKE2b-256 5186fc663db4b7e5eb313263a3ffa9975e253fae8a5b288e19c6ebdbb2dbd3f9

See more details on using hashes here.

File details

Details for the file dfembed-0.1.3-cp313-cp313-manylinux_2_35_x86_64.whl.

File metadata

File hashes

Hashes for dfembed-0.1.3-cp313-cp313-manylinux_2_35_x86_64.whl
Algorithm Hash digest
SHA256 a003c67a70dd17ac27e5278de280e8ea4540e0becb6c84c90827341c3f9c25dd
MD5 6cdb2ae52977f19047bbc23be4d04381
BLAKE2b-256 ff89c5b041e090f2bef4c6a36b6105ce311b8ab70d37e2d27adee6c3f2697008

See more details on using hashes here.

File details

Details for the file dfembed-0.1.3-cp313-cp313-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for dfembed-0.1.3-cp313-cp313-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 9a4c7e9b765cd9416e51719d940643dbd21460b4441246cb8b8fd4fc53ca6ecb
MD5 0f1fcb806e6e2cc0eaaa48ea5bddea01
BLAKE2b-256 f9d7d5689f006bf5acc300e50e48bd2a2dcc28e185f63e14c9c0e1a0d3a916f4

See more details on using hashes here.

File details

Details for the file dfembed-0.1.3-cp313-cp313-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for dfembed-0.1.3-cp313-cp313-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 970479e11c468a76cbfbdbe39b85ff916eff928c0ddeefe254a2da6ae8e98a1f
MD5 e2db9ff7a24bdbaa8700580e287f2e57
BLAKE2b-256 93f9681882639f297a70f4faa6fb71c16a64f23221d63ebf883d4ec660e9dadb

See more details on using hashes here.

File details

Details for the file dfembed-0.1.3-cp312-cp312-win_amd64.whl.

File metadata

  • Download URL: dfembed-0.1.3-cp312-cp312-win_amd64.whl
  • Upload date:
  • Size: 35.6 MB
  • Tags: CPython 3.12, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.8.3

File hashes

Hashes for dfembed-0.1.3-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 44d02a9cd20c5a4cba1ec38ecf3433bb9983229e370bd24644fdf244a3d925ba
MD5 308b19865a47e98fd4fc918623b55248
BLAKE2b-256 826f1a2714a41bdca1243d75273f49a2bfccb53afddf44b306698d7527dcc3be

See more details on using hashes here.

File details

Details for the file dfembed-0.1.3-cp312-cp312-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for dfembed-0.1.3-cp312-cp312-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 94844bcc5cfae195186bc9ce19959c0d175bdd2ddc8a926dd31cceeb1b1e49fe
MD5 bebd8043424a064419e3e28e48fc8c93
BLAKE2b-256 7316f1668b0675f14f74c5ccf8c5b334a42bcb1d56c892c902846a964fa140f1

See more details on using hashes here.

File details

Details for the file dfembed-0.1.3-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for dfembed-0.1.3-cp312-cp312-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 1f24c7a8a20fc03332a827c3711958526cbb0bca1eca6352749e670cb4052c7c
MD5 20ace7e646daabf24ac3451194ecff9f
BLAKE2b-256 e7f3593d74570f3f7a0d79e69ed95e813ceed535371b1823ac5c6f1f59773b28

See more details on using hashes here.

File details

Details for the file dfembed-0.1.3-cp312-cp312-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for dfembed-0.1.3-cp312-cp312-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 88364f9b49cd877c49b0f3ea5ae1798225afb18a8640006601ea0e0201ab9043
MD5 0a8fcbc00552f1890fccb5cbbb341c5b
BLAKE2b-256 11bd7f12f2707f5bd81b720d1148a727cf7bb5de9b3c5447fab0bd3df9857d54

See more details on using hashes here.

File details

Details for the file dfembed-0.1.3-cp311-cp311-win_amd64.whl.

File metadata

  • Download URL: dfembed-0.1.3-cp311-cp311-win_amd64.whl
  • Upload date:
  • Size: 35.6 MB
  • Tags: CPython 3.11, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.8.3

File hashes

Hashes for dfembed-0.1.3-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 519532e16491ddaa8e8dbf0548b4d8236a60406290295c83e971f76874215e72
MD5 855c1421922b073adeac5b77cf29cdfc
BLAKE2b-256 25870d0c77b7115d233cece2b67d7d68db1a98010eb84f894f98ae8b9ea53ed5

See more details on using hashes here.

File details

Details for the file dfembed-0.1.3-cp311-cp311-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for dfembed-0.1.3-cp311-cp311-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 fe532c497454356a67bc1475b93e9d00cb304f00702fdfd088694320cb3a706e
MD5 42e12f078f92b586626ce1fbb0476b73
BLAKE2b-256 5156a6e6e766cdda43a60606261b9e59c10bca9e76d10bbca1130b236dbd0d2e

See more details on using hashes here.

File details

Details for the file dfembed-0.1.3-cp311-cp311-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for dfembed-0.1.3-cp311-cp311-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 64164f679c8cee70cd895640cb9cd9eea2225387c961b91a957c387b34ff2c93
MD5 dfac7f4f4e83c845ac79468d7fb3b910
BLAKE2b-256 0d42fba6eb0b8f83407069c9d4e04ee1f67b4a8db21262fd413f34890241055d

See more details on using hashes here.

File details

Details for the file dfembed-0.1.3-cp311-cp311-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for dfembed-0.1.3-cp311-cp311-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 0aacee4d590857c7204c8be76dc1329c2dc9262a07148cdbff18fd2ca2ad10f3
MD5 0c07a608cd2d27ad0af9445efb0b82c1
BLAKE2b-256 0de2312b362f62f77483f994984896306945c2faaea67851d587d264b402237a

See more details on using hashes here.

File details

Details for the file dfembed-0.1.3-cp310-cp310-win_amd64.whl.

File metadata

  • Download URL: dfembed-0.1.3-cp310-cp310-win_amd64.whl
  • Upload date:
  • Size: 35.6 MB
  • Tags: CPython 3.10, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.8.3

File hashes

Hashes for dfembed-0.1.3-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 a7af06ce185808815e4e0ee2f4b3a3aa6fef27b899e1dad6b542f4ca449138de
MD5 6a216e03afe1bfc17d736b9390aa124b
BLAKE2b-256 57ebd1730bca53ece692795933c01e23abda535c58a345f277cae0f1497d3086

See more details on using hashes here.

File details

Details for the file dfembed-0.1.3-cp310-cp310-manylinux_2_35_x86_64.whl.

File metadata

File hashes

Hashes for dfembed-0.1.3-cp310-cp310-manylinux_2_35_x86_64.whl
Algorithm Hash digest
SHA256 5175ab85b1a18c566c7357ceeb14f0d20649e23e625d12b506654ba61575240f
MD5 db6c9b01f450c9ca0c47f059e1e28d35
BLAKE2b-256 aec71d81ae73d369d82dec9cd802c66464e83493ee8063fbb65914f21ddd2ef3

See more details on using hashes here.

File details

Details for the file dfembed-0.1.3-cp310-cp310-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for dfembed-0.1.3-cp310-cp310-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 12c807e9487b583a3388fde84009b55ed458fd99416a47dde82514739ec920ab
MD5 e2affaee4b26460cf92ecb7aa98b0ee7
BLAKE2b-256 6d92b6385e0e32a6089c1d47efbf609d6c5d418d96339a0fa23693c6defe3913

See more details on using hashes here.

File details

Details for the file dfembed-0.1.3-cp39-cp39-win_amd64.whl.

File metadata

  • Download URL: dfembed-0.1.3-cp39-cp39-win_amd64.whl
  • Upload date:
  • Size: 35.6 MB
  • Tags: CPython 3.9, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.8.3

File hashes

Hashes for dfembed-0.1.3-cp39-cp39-win_amd64.whl
Algorithm Hash digest
SHA256 5ce800dc57f3b85e621832e74f694080c58d2c90a6de8b9ab99a7bad246a1a73
MD5 5d07de1f5eb89267de618f026c3de3a1
BLAKE2b-256 c58ac6993cb3abea47a7428a872870bf36dcabd7596648522b689b7627847199

See more details on using hashes here.

File details

Details for the file dfembed-0.1.3-cp39-cp39-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for dfembed-0.1.3-cp39-cp39-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 2cbaaa26be7e5d9c3667b38a99e15ad6f8aa862ad775d20b998fcc01d44299ee
MD5 fb32d1f39f2919bee34ba340cce6c044
BLAKE2b-256 5ed900125d54b768a354bbc2dcd74ca5fafa8342d1a74713044d4cc9f42d0272

See more details on using hashes here.

File details

Details for the file dfembed-0.1.3-cp38-cp38-win_amd64.whl.

File metadata

  • Download URL: dfembed-0.1.3-cp38-cp38-win_amd64.whl
  • Upload date:
  • Size: 35.6 MB
  • Tags: CPython 3.8, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.8.3

File hashes

Hashes for dfembed-0.1.3-cp38-cp38-win_amd64.whl
Algorithm Hash digest
SHA256 3445b0a6052d07d0eefffab03dccaa00455066f59aa4426ddb5878f9d65c9c3e
MD5 7f015dfc60fd90bf028f4d18d9bfd8cd
BLAKE2b-256 17675be05634228924f0b774ca7dfefa0a0e067b32363583c46084811c429ba0

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page