Skip to main content

A Python library that embeds and indexes Arrow-based dataframes

Project description

DF Embedder

DF Embedder is a high-performance Python library (with a Rust backend) that embeds, indexes and turns your dataframes into fast vector stores (based on Lance format) in a few lines of code.

pip install dfembed
# read a dataset using polars or pandas
df = pl.read_csv("tmdb.csv")
# turn into an arrow dataset
arrow_table = df.to_arrow()
embedder = DfEmbedder(database_name="tmdb_db")
# embed and index the dataframe to a lance table
embedder.index_table(arrow_table, table_name="films_table")
# run similarities queries
similar_movies = embedder.find_similar("adventures jungle animals", "films_table", 10)

DfEmbedder is still an early version and work in progress. Feedback and comments will be highly appriciated.

Main Features

  • Rust Backend: For blazing-fast, multi-threaded embedding and indexing.
  • Apache Arrow: To seamlessly work with data from libraries like Polars, Pandas (via PyArrow), etc.
  • Static Embeddings: Uses efficient static embedding model for generating text embedding 100X faster.
  • Lance Format: For optimized storage and fast vector similarity searches.
  • PyO3: To provide a clean and easy-to-use Python API.

How fast is DF Embedder? benchamrks are often misleading and users should run their own analysis. To give a general idea, I was able to index about 1.2M rows from the TMDB movie dataset in about 100 seconds, using a machine with 10 CPU cores. Thats reading, embedding, indexing and writing more than 10K rows per second. And there are still ways to improve its performance by further tunning its params.

How It Works

Indexing a dataframe using DfEmbedder starts by representing each row in the dataframe as a string that follows the format: col0_name is col0_value; col1_name is col1_value. Next, all strings are embedded using a static embedding model (an embedding method that can generate embedding on CPU in blazing speed with very little loss of quality). Finally, it writes data as a table in Lance format.

There are several ways to search and query Lance tables created using DfEmbedder

  1. You can use DfEmbedder's find_similar method
  2. You can use LanceDB
import lancedb
db = lancedb.connect("tmdb_db")
tbl = db.open_table("films_table")
# you need the embedder to embed a query
vector = embedder.embed_string(text)
# run a vector search
tbl.search(vector).limit(10).to_list()
  1. You can use its LlamaIndex VectorStore interface
from dfembed import DfEmbedder, DfEmbedVectorStore

# because we use our own embedding model
Settings.embed_model = MockEmbedding(embed_dim=1024)
vector_store = DfEmbedVectorStore(
    df_embedder=embedder,
    table_name=table_name
)
index = VectorStoreIndex.from_vector_store(vector_store=vector_store)
query_engine = index.as_query_engine(similarity_top_k=5, llm=llm)

See more usage examples in the notebook here

Usage

Constructor Parameters

The DfEmbedder constructor accepts the following parameters:

  • num_threads (default: CPU count): Number of parallel worker threads used for embedding. Setting this to the number of available CPU cores typically gives the best performance.
  • embedding_chunk_size (default: 500): Number of records to process in each embedding batch. Larger values may improve throughput but require more memory.
  • write_buffer_size (default: 2000): Number of embeddings to buffer before writing to storage. Increasing this reduces the number of write operations, potentially improving performance for large datasets.
  • database_name (default: "./lance_db"): Path to the Lance database directory where tables will be stored.
  • table_name (default: "embeddings"): Default name for tables created in the database. Can be overridden in index_table().
  • vector_dim (default: 1024): Dimensionality of the embedding vectors produced by the static embedder. Please keep it on default for this version
import polars as pl # could also use Pandas or DuckDB
import pyarrow as pa # Although not directly used, good practice to import
from dfembed import DfEmbedder

# Load data from a CSV using Polars
df = pl.read_csv("tmdb.csv")
# transform to PyArrow Table format
arrow_table = df.to_arrow()
# Configure database path, and optional performance params
embedder = DfEmbedder(
    num_threads=8,              # Use 8 threads for embedding or defaults to avail num of cores
    write_buffer_size=3500,     # Buffer 3500 embeddings before writing
    database_name="tmdb_db",    # Path to the Lance database directory      
)
table_name = "tmdb_table" 
embedder.index_table(arrow_table, table_name=table_name)
# get 10 most similar items
query = "adventures jungle animals"
results = embedder.find_similar(query=query, table_name=table_name, k=10)

Core Methods

  • index_table(table, table_name=None): Embeds and indexes an Arrow table.

    • table: A PyArrow Table object containing the data to index.
    • table_name: Name for the created Lance table. If None, uses the default name from the constructor.
  • find_similar(query, table_name, k): Performs semantic search for similar items.

    • query: String query to search for.
    • table_name: Name of the Lance table to search in.
    • k: Number of results to return.
    • Returns a list of the k most similar text records.
  • embed_string(text): Directly access the static embedder to encode a single string.

    • text: String to embed.
    • Returns a vector of floats (the embedding).

Performance Tips

  • For large datasets, increase write_buffer_size to reduce write operations.
  • Adjust embedding_chunk_size based on your available memory and dataset characteristics.
  • The num_threads parameter should typically match your CPU core count for optimal performance.
  • For production use, consider using a fast SSD for the database storage location.

License

MIT

GitHub Actions CI/CD (WIP)

Releasing a New Version

To release a new version:

  1. Update the version in Cargo.toml and pyproject.toml
  2. Commit and push your changes
  3. Create and push a new tag with the format v{version} (e.g., v0.1.2):
    git tag v0.1.2
    git push origin v0.1.2
    
  4. The GitHub Actions workflow will automatically build wheels and publish them to PyPI

Manual Builds

You can also manually trigger the build workflow from the GitHub Actions tab in your repository.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dfembed-0.1.2.tar.gz (1.3 MB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

dfembed-0.1.2-cp313-cp313-win_amd64.whl (35.6 MB view details)

Uploaded CPython 3.13Windows x86-64

dfembed-0.1.2-cp313-cp313-manylinux_2_35_x86_64.whl (44.8 MB view details)

Uploaded CPython 3.13manylinux: glibc 2.35+ x86-64

dfembed-0.1.2-cp313-cp313-macosx_11_0_arm64.whl (37.0 MB view details)

Uploaded CPython 3.13macOS 11.0+ ARM64

dfembed-0.1.2-cp313-cp313-macosx_10_12_x86_64.whl (39.2 MB view details)

Uploaded CPython 3.13macOS 10.12+ x86-64

dfembed-0.1.2-cp312-cp312-win_amd64.whl (35.6 MB view details)

Uploaded CPython 3.12Windows x86-64

dfembed-0.1.2-cp312-cp312-macosx_11_0_arm64.whl (37.0 MB view details)

Uploaded CPython 3.12macOS 11.0+ ARM64

dfembed-0.1.2-cp312-cp312-macosx_10_12_x86_64.whl (39.2 MB view details)

Uploaded CPython 3.12macOS 10.12+ x86-64

dfembed-0.1.2-cp311-cp311-win_amd64.whl (35.6 MB view details)

Uploaded CPython 3.11Windows x86-64

dfembed-0.1.2-cp311-cp311-macosx_11_0_arm64.whl (37.0 MB view details)

Uploaded CPython 3.11macOS 11.0+ ARM64

dfembed-0.1.2-cp311-cp311-macosx_10_12_x86_64.whl (39.2 MB view details)

Uploaded CPython 3.11macOS 10.12+ x86-64

dfembed-0.1.2-cp310-cp310-win_amd64.whl (35.6 MB view details)

Uploaded CPython 3.10Windows x86-64

dfembed-0.1.2-cp310-cp310-manylinux_2_35_x86_64.whl (44.8 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.35+ x86-64

dfembed-0.1.2-cp39-cp39-win_amd64.whl (35.6 MB view details)

Uploaded CPython 3.9Windows x86-64

dfembed-0.1.2-cp38-cp38-win_amd64.whl (35.6 MB view details)

Uploaded CPython 3.8Windows x86-64

File details

Details for the file dfembed-0.1.2.tar.gz.

File metadata

  • Download URL: dfembed-0.1.2.tar.gz
  • Upload date:
  • Size: 1.3 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.8.3

File hashes

Hashes for dfembed-0.1.2.tar.gz
Algorithm Hash digest
SHA256 b8cfcbc2d2bef4f224d5da9e8d4a588f9df9435a70995fc7423c78b0ee5afa98
MD5 4944de78cdfe49a9f5a11a331f9e4310
BLAKE2b-256 e1108b98f820e2b44449cbd0a36985d5d6fcd0100e8ee37e2aa1495d54b90ff6

See more details on using hashes here.

File details

Details for the file dfembed-0.1.2-cp313-cp313-win_amd64.whl.

File metadata

  • Download URL: dfembed-0.1.2-cp313-cp313-win_amd64.whl
  • Upload date:
  • Size: 35.6 MB
  • Tags: CPython 3.13, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.8.3

File hashes

Hashes for dfembed-0.1.2-cp313-cp313-win_amd64.whl
Algorithm Hash digest
SHA256 d80062fb6aebc1f97b9881c94d2cc6d34feb302c34aa145f8a303e52923ace29
MD5 870bcd43d2d3c96a8fd56cd4747e62db
BLAKE2b-256 daef670107b87781496ba0d4168263849a376fc0f84678866368d1f06e1045cc

See more details on using hashes here.

File details

Details for the file dfembed-0.1.2-cp313-cp313-manylinux_2_35_x86_64.whl.

File metadata

File hashes

Hashes for dfembed-0.1.2-cp313-cp313-manylinux_2_35_x86_64.whl
Algorithm Hash digest
SHA256 f85cf15536204b9fcb9ff49d2d6b03b2497529cf59f374f42aafe433ea7e5ccf
MD5 24d6f8ee2f0224b3521c73cb91f138bb
BLAKE2b-256 5fb705d506dcd185ff09a0c872ef2d486c38494a8e101969caa8d1c5799b855a

See more details on using hashes here.

File details

Details for the file dfembed-0.1.2-cp313-cp313-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for dfembed-0.1.2-cp313-cp313-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 0cff2a1a033b023a2e795f33c32a598e59d46355a989e094ad8920041de01c4c
MD5 575b9c9dd029c5b91f261710c6608611
BLAKE2b-256 035b40c00002c0d0f44584559a52fdcfba1205680c8520862641bc00edf42d2e

See more details on using hashes here.

File details

Details for the file dfembed-0.1.2-cp313-cp313-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for dfembed-0.1.2-cp313-cp313-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 79b4da90fe315b28e7ee3819af8c648af018dab089f9eae9a8e96b70c874ba65
MD5 2508a339c0d630181fe8e1542453b483
BLAKE2b-256 fca7a04b3aff275b44305584bde09c68252eebd7b435eca2946bb91fba31773a

See more details on using hashes here.

File details

Details for the file dfembed-0.1.2-cp312-cp312-win_amd64.whl.

File metadata

  • Download URL: dfembed-0.1.2-cp312-cp312-win_amd64.whl
  • Upload date:
  • Size: 35.6 MB
  • Tags: CPython 3.12, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.8.3

File hashes

Hashes for dfembed-0.1.2-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 dfb535cab33e29833fc3622c39404ee5346a23a40b863e38f49aedf9873da285
MD5 707a9d02bce9a05578be82585e395d5d
BLAKE2b-256 e1b7aaa73a338e432885152ab0a730ba331a56dab135711cb1bc03732f302ba4

See more details on using hashes here.

File details

Details for the file dfembed-0.1.2-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for dfembed-0.1.2-cp312-cp312-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 dba148237796660007f9d351eb33c0b6e66575409f23cdb65e035684e76b110c
MD5 8fa0440310928cc21187215f29b5a579
BLAKE2b-256 495ff2327011754db67e781207970bbdd311b07e6743c83ca8182548d24396cd

See more details on using hashes here.

File details

Details for the file dfembed-0.1.2-cp312-cp312-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for dfembed-0.1.2-cp312-cp312-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 0913c9e8bdd89dc5f0675ac64721f35a2741d7526b21a3eb12fb72ead2eff515
MD5 92e43210fad17ee26bbb46c75d629339
BLAKE2b-256 7127410033cd720766cceb6eb3b2a6af0aea58094320dd6e77272d3fbef1948f

See more details on using hashes here.

File details

Details for the file dfembed-0.1.2-cp311-cp311-win_amd64.whl.

File metadata

  • Download URL: dfembed-0.1.2-cp311-cp311-win_amd64.whl
  • Upload date:
  • Size: 35.6 MB
  • Tags: CPython 3.11, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.8.3

File hashes

Hashes for dfembed-0.1.2-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 e480922bb35d651e2ac528a2cedb2ee9ec48bbc99b85115b2c03f23e6ab93842
MD5 f24999556b3d165e4914c96153dfdb6c
BLAKE2b-256 37d23a2237c54cac904f1ab65775b98b5de5c713d3a3eea519aa767608226ad4

See more details on using hashes here.

File details

Details for the file dfembed-0.1.2-cp311-cp311-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for dfembed-0.1.2-cp311-cp311-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 039a3c1484749c44a52e8b7ce15921a5d6d3dca87c68ee31414a7a23af9dbd43
MD5 c3c6cb3e993311840b47182fe21d1229
BLAKE2b-256 47546b15480bcdc8d6d727db727f294c7e9dc15419c56afc106fc0a12659493c

See more details on using hashes here.

File details

Details for the file dfembed-0.1.2-cp311-cp311-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for dfembed-0.1.2-cp311-cp311-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 566e3e61cdacfe806e93c3180b51017a2615feddc8177d5efe0621726441dc25
MD5 caa26e118e828c4cac85b35da93e4497
BLAKE2b-256 2059cc7634db1b9b37ee4b13716b7431bfd1e4bf3f8a72cc0d8aaf8868740c6e

See more details on using hashes here.

File details

Details for the file dfembed-0.1.2-cp310-cp310-win_amd64.whl.

File metadata

  • Download URL: dfembed-0.1.2-cp310-cp310-win_amd64.whl
  • Upload date:
  • Size: 35.6 MB
  • Tags: CPython 3.10, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.8.3

File hashes

Hashes for dfembed-0.1.2-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 fb483b84f640379e9036f3475f6348907cbd3e4ad4ede79d16176e24ec8bae4b
MD5 8b300b1101c08f4c0512344ea7ce90cb
BLAKE2b-256 20a21fd28beb9423f39a75753d9c6f64d114e2a2a48eefea6bd9fcb07a74d132

See more details on using hashes here.

File details

Details for the file dfembed-0.1.2-cp310-cp310-manylinux_2_35_x86_64.whl.

File metadata

File hashes

Hashes for dfembed-0.1.2-cp310-cp310-manylinux_2_35_x86_64.whl
Algorithm Hash digest
SHA256 1971dc83f43b8f30eb83d8d594fb46df387ad811f9fd8b82bc2930d6d53e33c8
MD5 f9e6f297b78ab14850f2cb06ec670dbd
BLAKE2b-256 7d1b6f73c74a62531aabf5ed2d5cdfde9b98af5e5faaf676995cb11a444166f0

See more details on using hashes here.

File details

Details for the file dfembed-0.1.2-cp39-cp39-win_amd64.whl.

File metadata

  • Download URL: dfembed-0.1.2-cp39-cp39-win_amd64.whl
  • Upload date:
  • Size: 35.6 MB
  • Tags: CPython 3.9, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.8.3

File hashes

Hashes for dfembed-0.1.2-cp39-cp39-win_amd64.whl
Algorithm Hash digest
SHA256 45e0347ebf4c31bb3e047445821a4f2a4357ed3c7faee74394ece41ebba8616d
MD5 9f407592c70fce64852e019b78a73669
BLAKE2b-256 b547e8ed24be40a11914c10ef012a96d015809c7b6d73c29b10a2e039b95dc12

See more details on using hashes here.

File details

Details for the file dfembed-0.1.2-cp38-cp38-win_amd64.whl.

File metadata

  • Download URL: dfembed-0.1.2-cp38-cp38-win_amd64.whl
  • Upload date:
  • Size: 35.6 MB
  • Tags: CPython 3.8, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.8.3

File hashes

Hashes for dfembed-0.1.2-cp38-cp38-win_amd64.whl
Algorithm Hash digest
SHA256 a9d12d3ecd22964b81778ecf40058083c3be14b844d8969a9340ba71eecb7e9e
MD5 2adf4b018511005811ae6c11b37b8f5c
BLAKE2b-256 f5ac77810a3135f1c285983e02e35417c51952f02b8ec1ca73468a4baef7eb0b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page