Skip to main content

A Python library using Rust and PyO3 to index Arrow tables

Project description

DF Embedder

DF Embedder allows you to effortlessly turn your DataFrames into fast vector stores in 3 lines of code.

df = pl.read_csv("tmdb.csv")
arrow_table = df.to_arrow()
embedder = DfEmbedder(database_name="tmdb_db")

Description

DF Embedder is a high-performance Python library (with a Rust backend) for indexing and embedding Apache Arrow compatible DataFrames (like Polars or Pandas) into low latency vector databases based on Lance files.

  • Rust: For blazing-fast, multi-threaded embedding and indexing.
  • Apache Arrow: To seamlessly work with data from libraries like Polars, Pandas (via PyArrow), etc.
  • Static Embeddings: Uses efficient static embedding model for generating text embedding 100X faster.
  • Lance Format: For optimized storage and fast vector similarity searches.
  • PyO3: To provide a clean and easy-to-use Python API.

How fast is DF Embedder? benchamrks are often misleading and users should run their own analysis. To give a general idea, I was able to index about 1.2M rows from the TMDB movie dataset in about 100 seconds, using a machine with 10 CPU cores. Thats reading, embedding, indexing and writing more than 10K rows per second. And there are still ways to improve its performance by further tunning its params.

Usage

import polars as pl # could also use Pandas or DuckDB
import pyarrow as pa # Although not directly used, good practice to import
from dfembed import DfEmbedder

# Load data from a CSV using Polars
df = pl.read_csv("tmdb.csv")
# transform to PyArrow Table format
arrow_table = df.to_arrow()
# Configure database path, and optional performance params
embedder = DfEmbedder(
    num_threads=8,              # Use 8 threads for embedding or defaults to avail num of cores
    write_buffer_size=3500,     # Buffer 3500 embeddings before writing
    database_name="tmdb_db",    # Path to the Lance database directory            
)
table_name = "tmdb_table" 
embedder.index_table(arrow_table, table_name=table_name)
# get 10 most similar items
query = "adventures jungle animals"
results = embedder.find_similar(query=query, table_name=table_name, k=10)

How It Works

  1. The DfEmbedder Python class acts as a user-friendly wrapper.
  2. It initializes and manages an instance of the DfEmbedderRust struct, implemented in Rust.
  3. When index_table is called with a PyArrow Table:
    • The Rust backend receives the Arrow data.
    • It uses a static embedding model (configured internally) to generate vector embeddings for the specified text data, potentially using multiple threads for speed.
    • The embeddings, along with original data, are written efficiently to a Lance dataset within the specified database directory and table name.
  4. When find_similar is called:
    • The query string is embedded using the same static model.
    • The Rust backend uses Lance's optimized search capabilities to find the k nearest neighbors to the query vector within the specified table.
    • The results (e.g., identifiers or relevant data) are returned to Python.

License

MIT

GitHub Actions CI/CD

This project uses GitHub Actions for continuous integration and deployment, automatically building wheels for multiple platforms and Python versions.

Automated Builds

The CI/CD pipeline automatically:

  1. Builds wheels for:

    • Linux (manylinux2014)
    • macOS (Intel x86_64 and Apple Silicon ARM64)
    • Windows
    • Python versions 3.8, 3.9, 3.10, 3.11, and 3.12
  2. Tests the built wheels on each platform to ensure they work correctly

  3. Publishes to PyPI when a new tag is pushed (format: v*, e.g., v0.1.2)

Workflow Files

  • .github/workflows/build.yml: Builds wheels for all platforms and Python versions
  • .github/workflows/test.yml: Tests the built wheels to ensure they work correctly

Releasing a New Version

To release a new version:

  1. Update the version in Cargo.toml and pyproject.toml
  2. Commit and push your changes
  3. Create and push a new tag with the format v{version} (e.g., v0.1.2):
    git tag v0.1.2
    git push origin v0.1.2
    
  4. The GitHub Actions workflow will automatically build wheels and publish them to PyPI

Manual Builds

You can also manually trigger the build workflow from the GitHub Actions tab in your repository.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dfembed-0.1.1.tar.gz (1.2 MB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

dfembed-0.1.1-cp313-cp313-win_amd64.whl (35.6 MB view details)

Uploaded CPython 3.13Windows x86-64

dfembed-0.1.1-cp313-cp313-manylinux_2_35_x86_64.whl (44.8 MB view details)

Uploaded CPython 3.13manylinux: glibc 2.35+ x86-64

dfembed-0.1.1-cp313-cp313-macosx_11_0_arm64.whl (37.0 MB view details)

Uploaded CPython 3.13macOS 11.0+ ARM64

dfembed-0.1.1-cp313-cp313-macosx_10_12_x86_64.whl (39.2 MB view details)

Uploaded CPython 3.13macOS 10.12+ x86-64

dfembed-0.1.1-cp312-cp312-win_amd64.whl (35.6 MB view details)

Uploaded CPython 3.12Windows x86-64

dfembed-0.1.1-cp312-cp312-macosx_11_0_arm64.whl (37.0 MB view details)

Uploaded CPython 3.12macOS 11.0+ ARM64

dfembed-0.1.1-cp312-cp312-macosx_10_12_x86_64.whl (39.2 MB view details)

Uploaded CPython 3.12macOS 10.12+ x86-64

dfembed-0.1.1-cp311-cp311-win_amd64.whl (35.6 MB view details)

Uploaded CPython 3.11Windows x86-64

dfembed-0.1.1-cp311-cp311-macosx_11_0_arm64.whl (37.0 MB view details)

Uploaded CPython 3.11macOS 11.0+ ARM64

dfembed-0.1.1-cp311-cp311-macosx_10_12_x86_64.whl (39.2 MB view details)

Uploaded CPython 3.11macOS 10.12+ x86-64

dfembed-0.1.1-cp310-cp310-win_amd64.whl (35.6 MB view details)

Uploaded CPython 3.10Windows x86-64

dfembed-0.1.1-cp310-cp310-manylinux_2_35_x86_64.whl (44.7 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.35+ x86-64

dfembed-0.1.1-cp39-cp39-win_amd64.whl (35.6 MB view details)

Uploaded CPython 3.9Windows x86-64

dfembed-0.1.1-cp38-cp38-win_amd64.whl (35.6 MB view details)

Uploaded CPython 3.8Windows x86-64

File details

Details for the file dfembed-0.1.1.tar.gz.

File metadata

  • Download URL: dfembed-0.1.1.tar.gz
  • Upload date:
  • Size: 1.2 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.8.3

File hashes

Hashes for dfembed-0.1.1.tar.gz
Algorithm Hash digest
SHA256 86daa09e7427c269dde9d6e420f50d5b548e8fd6c3202ae713aa1cf2d28cc216
MD5 3c043d7d2e683357d45967f7099ef7fc
BLAKE2b-256 5f5c04fb922e1c56904cc911b30fd29a018f3e205a8fbe4a5e39d9313834ba59

See more details on using hashes here.

File details

Details for the file dfembed-0.1.1-cp313-cp313-win_amd64.whl.

File metadata

  • Download URL: dfembed-0.1.1-cp313-cp313-win_amd64.whl
  • Upload date:
  • Size: 35.6 MB
  • Tags: CPython 3.13, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.8.3

File hashes

Hashes for dfembed-0.1.1-cp313-cp313-win_amd64.whl
Algorithm Hash digest
SHA256 73e3e3f737b77ff881d8f223c64470c8f777a4e987792f88d239a7e110972382
MD5 1f5af56f465a36a3992f833ff4ac066c
BLAKE2b-256 a107f9b019c9bb9475ebdb7aea37efccc479f260b3695df9d6d18591eab34500

See more details on using hashes here.

File details

Details for the file dfembed-0.1.1-cp313-cp313-manylinux_2_35_x86_64.whl.

File metadata

File hashes

Hashes for dfembed-0.1.1-cp313-cp313-manylinux_2_35_x86_64.whl
Algorithm Hash digest
SHA256 4a0a0f4dbc8d7ac13c5133837ae869cec58f6954fbc7a21e92ff8f8aba4e36d0
MD5 a5fdbfe92a810d3cb73f38b4741611c0
BLAKE2b-256 50f9d2b340413009fcc2a5b9cda143b2df3367003aa3c6f96c9384e657edffbe

See more details on using hashes here.

File details

Details for the file dfembed-0.1.1-cp313-cp313-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for dfembed-0.1.1-cp313-cp313-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 b81c402b20a97a971fe3cdb3afa6bdd00594f8b3a0b21cb8003eb10f20007f91
MD5 fb6829a57270cb59bf59a49af7730409
BLAKE2b-256 c2fe60f2f8a08a84078741a4b50317255ef464771e2d4c0d800c85799e9ae4b2

See more details on using hashes here.

File details

Details for the file dfembed-0.1.1-cp313-cp313-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for dfembed-0.1.1-cp313-cp313-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 881f91d784a69a239f57b0ff62fce98642111b97311b7348ee2704f5b1f5c819
MD5 e2db99586ab14093c9b396fc03259473
BLAKE2b-256 ab271c312e7c2ca4e9f8080265da8470f2fd5844263ed2e50553350151f0e1c8

See more details on using hashes here.

File details

Details for the file dfembed-0.1.1-cp312-cp312-win_amd64.whl.

File metadata

  • Download URL: dfembed-0.1.1-cp312-cp312-win_amd64.whl
  • Upload date:
  • Size: 35.6 MB
  • Tags: CPython 3.12, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.8.3

File hashes

Hashes for dfembed-0.1.1-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 552b97295a6cbc775c251195f61798dfe87d54295257a6425c600cf5f4c245b3
MD5 ea7a0a3ce1567ab1a6fd1a8974f37194
BLAKE2b-256 5980e49fe91c1094cf0310c9d7f2aebe269b7a84acdf027dcff3156f503391b7

See more details on using hashes here.

File details

Details for the file dfembed-0.1.1-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for dfembed-0.1.1-cp312-cp312-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 fd15a2791ce5b6173c2d0c4abe197ccb23aeabbeb54c765d13884f46249a7b78
MD5 a50fe2a16193b8234a95f4665de0c961
BLAKE2b-256 1373ff31b056e7a0b938da98e846db3685d651a00ed5e399385a78b1175a1e43

See more details on using hashes here.

File details

Details for the file dfembed-0.1.1-cp312-cp312-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for dfembed-0.1.1-cp312-cp312-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 e80fc26ef515d5583b3aa9920ec116b92b6d8f164702d5c7b50925272ff79c59
MD5 5d27a9372966fab784d87edccc46cbb1
BLAKE2b-256 de201cd2d653d1d60d840b55bd6f0d0f45bb5f121e17f38c5916aa6e867f7f2c

See more details on using hashes here.

File details

Details for the file dfembed-0.1.1-cp311-cp311-win_amd64.whl.

File metadata

  • Download URL: dfembed-0.1.1-cp311-cp311-win_amd64.whl
  • Upload date:
  • Size: 35.6 MB
  • Tags: CPython 3.11, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.8.3

File hashes

Hashes for dfembed-0.1.1-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 8dc96b23d592a3b41260c021d2bec0db9eba021ea692bade59be577066816087
MD5 b07400ffb8a2c81447599663becd68e8
BLAKE2b-256 add951122168adcc1897e4341e093106bda688eef9676b14dcd13b47e69d2192

See more details on using hashes here.

File details

Details for the file dfembed-0.1.1-cp311-cp311-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for dfembed-0.1.1-cp311-cp311-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 5aff8303880cdce07307f24c7826c5e4ba083ac110ec0c5c029cec67beb19d74
MD5 39e85c08c06b3b5ab618dc29586f1341
BLAKE2b-256 6f66828669dde58b75bd4873a99134afd2ebb0285a935249c30d3f86a593d4ba

See more details on using hashes here.

File details

Details for the file dfembed-0.1.1-cp311-cp311-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for dfembed-0.1.1-cp311-cp311-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 720ec48678e0a87b8ee7318b48693729af8a4e99b6a6dd74366a1f6ec40e6fe6
MD5 8e083b5a75083ff01c604b74bc5f1a36
BLAKE2b-256 c3dc56a496de42390b71fb3231e9db4a381f11fe8a94a1e9585de6d812d42263

See more details on using hashes here.

File details

Details for the file dfembed-0.1.1-cp310-cp310-win_amd64.whl.

File metadata

  • Download URL: dfembed-0.1.1-cp310-cp310-win_amd64.whl
  • Upload date:
  • Size: 35.6 MB
  • Tags: CPython 3.10, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.8.3

File hashes

Hashes for dfembed-0.1.1-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 022e3776a9d26818dd488c06df5091cd375969596834d7fd16f33982a258d4ef
MD5 e640efcafa734679f20b2a2925aae743
BLAKE2b-256 35a3b9b7ae8f60c6f3fed0fab9c3b7446f7597ff24a35d47ee81804ffed59edd

See more details on using hashes here.

File details

Details for the file dfembed-0.1.1-cp310-cp310-manylinux_2_35_x86_64.whl.

File metadata

File hashes

Hashes for dfembed-0.1.1-cp310-cp310-manylinux_2_35_x86_64.whl
Algorithm Hash digest
SHA256 3b086a6a65c926bc45a4a3812643dc1d1d63f33d2f128c2ecfbc64982c624c8b
MD5 b3e564310d67e747e8cae07511076449
BLAKE2b-256 e6aaf64fab45fe35bac1f8af4a1b4fb75c7e70e3a99701991dc93b7fe931b804

See more details on using hashes here.

File details

Details for the file dfembed-0.1.1-cp39-cp39-win_amd64.whl.

File metadata

  • Download URL: dfembed-0.1.1-cp39-cp39-win_amd64.whl
  • Upload date:
  • Size: 35.6 MB
  • Tags: CPython 3.9, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.8.3

File hashes

Hashes for dfembed-0.1.1-cp39-cp39-win_amd64.whl
Algorithm Hash digest
SHA256 3e6b48dd45d0986f260f401a6b370d4af34735b32b8376db57aa0148522ff6a5
MD5 6fbd5a8fd84e586cc2fc5bf58933b646
BLAKE2b-256 79ac13255dc397d3d6324a71c8f87816655574433906bb5003f49e5fb575e251

See more details on using hashes here.

File details

Details for the file dfembed-0.1.1-cp38-cp38-win_amd64.whl.

File metadata

  • Download URL: dfembed-0.1.1-cp38-cp38-win_amd64.whl
  • Upload date:
  • Size: 35.6 MB
  • Tags: CPython 3.8, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.8.3

File hashes

Hashes for dfembed-0.1.1-cp38-cp38-win_amd64.whl
Algorithm Hash digest
SHA256 794e7f578a00890c24f15aa9cf0fcd761551f60a7dc29b0e8a6ec45bb497fba1
MD5 3609d41d72a663495affaf9c05aaf135
BLAKE2b-256 e9a0d17e45b8a72af90e7a3d066c31c72f5f8ba1ea8620f8d8e3edf65cef97d4

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page