Skip to main content

A Python library that embeds and indexes Arrow-based dataframes

Project description

DF Embedder Logo

PyPI Downloads

DF Embedder is a blazing-fast Python library (with a Rust backend) that embeds, indexes and turns your dataframes into fast vector stores based on Lance format in a few lines of code. It is aimed for use cases in which you have a dataframe with textual data that you want to embed and load to a vector db, in order to conduct vector search. It is opinionated and specifically aimed to deal with huge tables that need to be embedded fast. It's fast and efficient but uses its own embedding model and textual representation method. Read on for the details.

(Requires Python >= 3.10)

pip install dfembed
from dfembed import DfEmbedder
import polars as pl

# read a dataset using polars or pandas
df = pl.read_csv("tmdb.csv")
# turn into an arrow dataset
arrow_table = df.to_arrow()
embedder = DfEmbedder(database_name="tmdb_db")
# embed and index the dataframe to a lance table
embedder.index_table(arrow_table, table_name="films_table")
# run similarities queries
similar_movies = embedder.find_similar("adventures jungle animals", "films_table", 10)

See more usage examples in the notebook here

DfEmbedder is still an early version and work in progress. Feedback and comments will be highly appriciated.

Main Features

  • Rust Backend: For blazing-fast, multi-threaded embedding and indexing.
  • Apache Arrow: To seamlessly work with data from libraries like Polars, Pandas (via PyArrow), etc.
  • Static Embeddings: Uses efficient static embedding model for generating text embedding 100X faster.
  • Lance Format: For optimized storage and fast vector similarity searches.
  • PyO3: To provide a clean and easy-to-use Python API.

How fast is DF Embedder? benchamrks are often misleading and users should run their own analysis. To give a general idea, I was able to index about 1.2M rows from the TMDB movie dataset in about 100 seconds, using a machine with 10 CPU cores. Thats reading, embedding, indexing and writing more than 10K rows per second. And there are still ways to improve its performance by further tunning its params.

How It Works

There are quite a few tabular data embedding methods (e.g. TabNet, TABBIE, etc). However, many of which assume that data has a specific structure and type whereas RAG use cases involve embedding of unstructured free text queries into the same vector space. To resolve this I tried to follow an approach similar to the one taken by Koloski et al. with some minor change in order to be agnostic to the field type. Accordingly, indexing a dataframe using DfEmbedder starts by representing each row in the dataframe as a string that follows the format: col0_name is col0_value; col1_name is col1_value (Koloski were working with a known schema and thus offered a more "typed" approach). Next, all strings are embedded using a static embedding model (an embedding method that can generate embedding on CPU in blazing speed with very little loss of quality). Finally, it writes data as a table in Lance format.

There are several ways to search and query Lance tables created using DfEmbedder

  1. You can use DfEmbedder's find_similar method
  2. You can use LanceDB
import lancedb
db = lancedb.connect("tmdb_db")
tbl = db.open_table("films_table")
# you need the embedder to embed a query
vector = embedder.embed_string(text)
# run a vector search
tbl.search(vector).limit(10).to_list()
  1. You can use its LlamaIndex VectorStore interface
from dfembed import DfEmbedder, DfEmbedVectorStore

# because we use our own embedding model
Settings.embed_model = MockEmbedding(embed_dim=1024)
vector_store = DfEmbedVectorStore(
    df_embedder=embedder,
    table_name=table_name
)
index = VectorStoreIndex.from_vector_store(vector_store=vector_store)
query_engine = index.as_query_engine(similarity_top_k=5, llm=llm)

See more usage examples in the notebook here

Usage

Constructor Parameters

The DfEmbedder constructor accepts the following parameters:

  • num_threads (default: CPU count): Number of parallel worker threads used for embedding. Setting this to the number of available CPU cores typically gives the best performance.
  • embedding_chunk_size (default: 500): Number of records to process in each embedding batch. Larger values may improve throughput but require more memory.
  • write_buffer_size (default: 2000): Number of embeddings to buffer before writing to storage. Increasing this reduces the number of write operations, potentially improving performance for large datasets.
  • database_name (default: "./lance_db"): Path to the Lance database directory where tables will be stored.
  • table_name (default: "embeddings"): Default name for tables created in the database. Can be overridden in index_table().
  • vector_dim (default: 1024): Dimensionality of the embedding vectors produced by the static embedder. Please keep it on default for this version
import polars as pl # could also use Pandas or DuckDB
import pyarrow as pa # Although not directly used, good practice to import
from dfembed import DfEmbedder

# Load data from a CSV using Polars
df = pl.read_csv("tmdb.csv")
# transform to PyArrow Table format
arrow_table = df.to_arrow()
# Configure database path, and optional performance params
embedder = DfEmbedder(
    num_threads=8,              # Use 8 threads for embedding or defaults to avail num of cores
    write_buffer_size=3500,     # Buffer 3500 embeddings before writing
    database_name="tmdb_db",    # Path to the Lance database directory      
)
table_name = "tmdb_table" 
embedder.index_table(arrow_table, table_name=table_name)
# get 10 most similar items
query = "adventures jungle animals"
results = embedder.find_similar(query=query, table_name=table_name, k=10)

Core Methods

  • index_table(table, table_name=None): Embeds and indexes an Arrow table.

    • table: A PyArrow Table object containing the data to index.
    • table_name: Name for the created Lance table. If None, uses the default name from the constructor.
  • find_similar(query, table_name, k): Performs semantic search for similar items.

    • query: String query to search for.
    • table_name: Name of the Lance table to search in.
    • k: Number of results to return.
    • Returns a list of the k most similar text records.
  • embed_string(text): Directly access the static embedder to encode a single string.

    • text: String to embed.
    • Returns a vector of floats (the embedding).

Performance Tips

  • For large datasets, increase write_buffer_size to reduce write operations.
  • Adjust embedding_chunk_size based on your available memory and dataset characteristics.
  • The num_threads parameter should typically match your CPU core count for optimal performance.
  • For production use, consider using a fast SSD for the database storage location.

License

MIT

GitHub Actions CI/CD (WIP)

Releasing a New Version

To release a new version:

  1. Update the version in Cargo.toml and pyproject.toml
  2. Commit and push your changes
  3. Create and push a new tag with the format v{version} (e.g., v0.1.2):
    git tag v0.1.2
    git push origin v0.1.2
    
  4. The GitHub Actions workflow will automatically build wheels and publish them to PyPI

Manual Builds

You can also manually trigger the build workflow from the GitHub Actions tab in your repository.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dfembed-0.1.5.tar.gz (2.8 MB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

dfembed-0.1.5-cp313-cp313-win_amd64.whl (35.6 MB view details)

Uploaded CPython 3.13Windows x86-64

dfembed-0.1.5-cp313-cp313-macosx_11_0_arm64.whl (36.8 MB view details)

Uploaded CPython 3.13macOS 11.0+ ARM64

dfembed-0.1.5-cp313-cp313-macosx_10_12_x86_64.whl (39.2 MB view details)

Uploaded CPython 3.13macOS 10.12+ x86-64

dfembed-0.1.5-cp312-cp312-win_amd64.whl (35.6 MB view details)

Uploaded CPython 3.12Windows x86-64

dfembed-0.1.5-cp312-cp312-manylinux_2_28_x86_64.whl (44.8 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.28+ x86-64

dfembed-0.1.5-cp312-cp312-macosx_11_0_arm64.whl (36.8 MB view details)

Uploaded CPython 3.12macOS 11.0+ ARM64

dfembed-0.1.5-cp312-cp312-macosx_10_12_x86_64.whl (39.2 MB view details)

Uploaded CPython 3.12macOS 10.12+ x86-64

dfembed-0.1.5-cp311-cp311-win_amd64.whl (35.6 MB view details)

Uploaded CPython 3.11Windows x86-64

dfembed-0.1.5-cp311-cp311-manylinux_2_28_x86_64.whl (44.8 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.28+ x86-64

dfembed-0.1.5-cp311-cp311-macosx_11_0_arm64.whl (36.8 MB view details)

Uploaded CPython 3.11macOS 11.0+ ARM64

dfembed-0.1.5-cp311-cp311-macosx_10_12_x86_64.whl (39.2 MB view details)

Uploaded CPython 3.11macOS 10.12+ x86-64

dfembed-0.1.5-cp310-cp310-win_amd64.whl (35.6 MB view details)

Uploaded CPython 3.10Windows x86-64

dfembed-0.1.5-cp310-cp310-manylinux_2_28_x86_64.whl (44.8 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.28+ x86-64

dfembed-0.1.5-cp310-cp310-macosx_11_0_arm64.whl (36.8 MB view details)

Uploaded CPython 3.10macOS 11.0+ ARM64

dfembed-0.1.5-cp310-cp310-macosx_10_12_x86_64.whl (39.2 MB view details)

Uploaded CPython 3.10macOS 10.12+ x86-64

dfembed-0.1.5-cp39-cp39-win_amd64.whl (35.6 MB view details)

Uploaded CPython 3.9Windows x86-64

dfembed-0.1.5-cp39-cp39-manylinux_2_28_x86_64.whl (44.8 MB view details)

Uploaded CPython 3.9manylinux: glibc 2.28+ x86-64

dfembed-0.1.5-cp38-cp38-win_amd64.whl (35.6 MB view details)

Uploaded CPython 3.8Windows x86-64

File details

Details for the file dfembed-0.1.5.tar.gz.

File metadata

  • Download URL: dfembed-0.1.5.tar.gz
  • Upload date:
  • Size: 2.8 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.8.6

File hashes

Hashes for dfembed-0.1.5.tar.gz
Algorithm Hash digest
SHA256 62a13a32927646081ff8fd8d818319a12b2c91d420f321e64b8271c15ec1f337
MD5 05bc8258ebce48b3c04aa155b058e64e
BLAKE2b-256 fc37fc53461478cb392dfca02d089c29ac06dd02f9bfa61f3526b3fb26ac47cd

See more details on using hashes here.

File details

Details for the file dfembed-0.1.5-cp313-cp313-win_amd64.whl.

File metadata

  • Download URL: dfembed-0.1.5-cp313-cp313-win_amd64.whl
  • Upload date:
  • Size: 35.6 MB
  • Tags: CPython 3.13, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.8.6

File hashes

Hashes for dfembed-0.1.5-cp313-cp313-win_amd64.whl
Algorithm Hash digest
SHA256 65d5698b1c81be8b25b7989d162aaeb92391a665bfcf3b408a6f8ace39f68af0
MD5 8623df8ac01c25625356df12b57fa9cf
BLAKE2b-256 ff594978564b2b8ce6b68a6d86b3342e06f5fc31498741dc2fd37e9cfbb82c76

See more details on using hashes here.

File details

Details for the file dfembed-0.1.5-cp313-cp313-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for dfembed-0.1.5-cp313-cp313-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 f8bf10ca78f93773183d0602fbcf089932f3fa27be314db40809f685bedb32e3
MD5 e1ac300b4734ce081aa2646862d4a89e
BLAKE2b-256 3c00e582344a8a32d43ceca6571b7f32f5560f36647a14438ca1783507d7a228

See more details on using hashes here.

File details

Details for the file dfembed-0.1.5-cp313-cp313-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for dfembed-0.1.5-cp313-cp313-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 d7fd1ff6da29e82722bd4e1427db92383c7d0c6feb7d656fc7797aa5cd7a5be0
MD5 ec337e2b4c6aa720ae780fbdea25d866
BLAKE2b-256 a9d9d61dd2233e4f2e844914d3e4b7aeee9dedf51c46d6c8a8b711eca3689058

See more details on using hashes here.

File details

Details for the file dfembed-0.1.5-cp312-cp312-win_amd64.whl.

File metadata

  • Download URL: dfembed-0.1.5-cp312-cp312-win_amd64.whl
  • Upload date:
  • Size: 35.6 MB
  • Tags: CPython 3.12, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.8.6

File hashes

Hashes for dfembed-0.1.5-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 53420c46d9c339a871d57938fada168f76278a0170177f47ea2ca09fbfbe8bbb
MD5 30fad3dcc5cf692dd5cb0617982333a3
BLAKE2b-256 634bc8c52ada54b3aa185a48c28645e850aa7d59740a0ced4a67cd1b93a55ad5

See more details on using hashes here.

File details

Details for the file dfembed-0.1.5-cp312-cp312-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for dfembed-0.1.5-cp312-cp312-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 6f092989ff36a6798552477240bfea3d4bb152502aa4e53cb40859d79d0955c5
MD5 bb8443afc88b6595fa94b3a8c79ef937
BLAKE2b-256 05ba84da7c8f386f2a9c82130c292ac80bcd5dbed5d3319bc917ea33cf3ed8e9

See more details on using hashes here.

File details

Details for the file dfembed-0.1.5-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for dfembed-0.1.5-cp312-cp312-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 7570e96103327a6ad7edde35e40cc71b0ee37909e3b709db0d3a87b004334c4e
MD5 a1c37a5491452bfe4a11197964df093f
BLAKE2b-256 972fc027efb90ae68140c58f7a77db77bc9a82f7f8b6b6dbfaff04d6bc2df04f

See more details on using hashes here.

File details

Details for the file dfembed-0.1.5-cp312-cp312-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for dfembed-0.1.5-cp312-cp312-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 e152c2a514307ccb88ad0bc0568a6d5793cd2954c7c585f5f53bf01e28b22728
MD5 4894e513f13e0717e3158cfc979ddae0
BLAKE2b-256 a73d94968b73658d758f35419d04cf57f9541bf86dc6869ba49b7fb9969a4569

See more details on using hashes here.

File details

Details for the file dfembed-0.1.5-cp311-cp311-win_amd64.whl.

File metadata

  • Download URL: dfembed-0.1.5-cp311-cp311-win_amd64.whl
  • Upload date:
  • Size: 35.6 MB
  • Tags: CPython 3.11, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.8.6

File hashes

Hashes for dfembed-0.1.5-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 b7de0832cafcdafd5e0d9ea2a6e7ee534c0991d777c145527874bb84ba70cf0c
MD5 7e0c0a660aff4de7f6697d0590710211
BLAKE2b-256 31db281207f79e9583b9ad3a3e04adcd925468e92098ae6307805cb34771a640

See more details on using hashes here.

File details

Details for the file dfembed-0.1.5-cp311-cp311-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for dfembed-0.1.5-cp311-cp311-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 a69720e642e02131a6f54e163decc90957440183cc41acc62ae01e822800fa7d
MD5 c27ea29c594d6d1a445c5fdbe79375e5
BLAKE2b-256 e280031de694280490c8bfe7a6863ffb04f4fe2587ae49d63cfefe95113850df

See more details on using hashes here.

File details

Details for the file dfembed-0.1.5-cp311-cp311-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for dfembed-0.1.5-cp311-cp311-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 0325d5418a00c21a446bb0a9b63dd4776115f728ef502126cf778f882bb14b93
MD5 b7743895baa7abb846460ce7cda5cc84
BLAKE2b-256 9f96f71b57a249b8a201de24dbc66455ea76285a535c18e8fba3bfb501e40674

See more details on using hashes here.

File details

Details for the file dfembed-0.1.5-cp311-cp311-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for dfembed-0.1.5-cp311-cp311-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 17d4fc31e7d2613320fbe41490580711a6eaf1a4ea0b59e6555bb8a4c8a4729c
MD5 5fdaae94efe53ce9948dee8128c6a71e
BLAKE2b-256 05beaddb6aa756568af3ed3a541ea6a35b0cbe2d89ab872bc812c9edc97d27ce

See more details on using hashes here.

File details

Details for the file dfembed-0.1.5-cp310-cp310-win_amd64.whl.

File metadata

  • Download URL: dfembed-0.1.5-cp310-cp310-win_amd64.whl
  • Upload date:
  • Size: 35.6 MB
  • Tags: CPython 3.10, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.8.6

File hashes

Hashes for dfembed-0.1.5-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 72ccb81a8ff4d2c42887eda904a313a49da8505368ad627cebe36a384a1ad55e
MD5 77540d080d4576db229a9c558a281137
BLAKE2b-256 2e580d35e471edc72bfc316e1640438593d8a6b45189f0bdc826cd3a7e1780bc

See more details on using hashes here.

File details

Details for the file dfembed-0.1.5-cp310-cp310-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for dfembed-0.1.5-cp310-cp310-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 472e41c5a11469f15e1ad760421821a9962a2e6d31ee697a4e76bfacf8aa420b
MD5 36305fa06a82fd306f44036ec4469316
BLAKE2b-256 321bc56a5c6af0d954c38202dd03489563aca48a77827cfaa2da11489aa23adc

See more details on using hashes here.

File details

Details for the file dfembed-0.1.5-cp310-cp310-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for dfembed-0.1.5-cp310-cp310-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 e3f86b9816e32cf59274778c671ac2d9f55f2f3330bf73916dfd3187af71df7c
MD5 7c74800bf1544da95a51b3b52743bb1d
BLAKE2b-256 1b4c69ec593b3ca0472833ed3a5b3e002da40154f4269610ede3d2f8b706a933

See more details on using hashes here.

File details

Details for the file dfembed-0.1.5-cp310-cp310-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for dfembed-0.1.5-cp310-cp310-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 02bcab81ec82405cbcd7d0878b09d25569c91e71dd8d47bc26b34191b58836b4
MD5 316d6358c7c657f8172816e756442b54
BLAKE2b-256 4b3561151343520f12236a8a2807190e9e2eb58034718a9163b8d80cc2bcdac4

See more details on using hashes here.

File details

Details for the file dfembed-0.1.5-cp39-cp39-win_amd64.whl.

File metadata

  • Download URL: dfembed-0.1.5-cp39-cp39-win_amd64.whl
  • Upload date:
  • Size: 35.6 MB
  • Tags: CPython 3.9, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.8.6

File hashes

Hashes for dfembed-0.1.5-cp39-cp39-win_amd64.whl
Algorithm Hash digest
SHA256 d8e3e072ad1d84bcc6cab9bbe1219786854ee4812ca33bc58a82489f38f64e68
MD5 0a9a1e9aa4f90f0b20eb443a103f7199
BLAKE2b-256 caa2f15d0d01f247be756f5fd0a26d610869247ee94125fb60bb3f1bbc79e569

See more details on using hashes here.

File details

Details for the file dfembed-0.1.5-cp39-cp39-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for dfembed-0.1.5-cp39-cp39-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 b11334fd7f54e67624b90b77306be2cbb7346138d32806fa68fcedd23f9d3708
MD5 23730a525d63b60a89e6eed4a3f34f66
BLAKE2b-256 1407432e2aee11efb5a5cdfb35faf3fcc4d110b97aa90872e443869748c3ddcb

See more details on using hashes here.

File details

Details for the file dfembed-0.1.5-cp38-cp38-win_amd64.whl.

File metadata

  • Download URL: dfembed-0.1.5-cp38-cp38-win_amd64.whl
  • Upload date:
  • Size: 35.6 MB
  • Tags: CPython 3.8, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.8.6

File hashes

Hashes for dfembed-0.1.5-cp38-cp38-win_amd64.whl
Algorithm Hash digest
SHA256 2a365b3222a15ed25e298bfbc0a56ea0b758e4572bc14ee489bab9b8896d6a1a
MD5 775565d49257917283a526db94a2d4da
BLAKE2b-256 e1f3e362f302b26a1b2dbaa8950eef6236f1e725bb2ebb7319c75effe14085a3

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page