Skip to main content

python wrapper for Lance columnar format

Project description

Python bindings for Lance Data Format

:warning: Under heavy development

Lance Logo

Lance is a new columnar data format for data science and machine learning

Why you should use Lance

  1. Is order of magnitude faster than parquet for point queries and nested data structures common to DS/ML
  2. Comes with a fast vector index that delivers sub-millisecond nearest neighbors search performance
  3. Is automatically versioned and supports lineage and time-travel for full reproducibility
  4. Integrated with duckdb/pandas/polars already. Easily convert from/to parquet in 2 lines of code

Quick start

Installation

pip install pylance

Make sure you have a recent version of pandas (1.5+), pyarrow (10.0+), and DuckDB (0.7.0+)

Converting to Lance

import lance

import pandas as pd
import pyarrow as pa
import pyarrow.dataset

df = pd.DataFrame({"a": [5], "b": [10]})
uri = "/tmp/test.parquet"
tbl = pa.Table.from_pandas(df)
pa.dataset.write_dataset(tbl, uri, format='parquet')

parquet = pa.dataset.dataset(uri, format='parquet')
lance.write_dataset(parquet, "/tmp/test.lance")

Reading Lance data

dataset = lance.dataset("/tmp/test.lance")
assert isinstance(dataset, pa.dataset.Dataset)

Pandas

df = dataset.to_table().to_pandas()

DuckDB

import duckdb

# If this segfaults, make sure you have duckdb v0.7+ installed
duckdb.query("SELECT * FROM dataset LIMIT 10").to_df()

Vector search

Download the sift1m subset

wget ftp://ftp.irisa.fr/local/texmex/corpus/sift.tar.gz
tar -xzf sift.tar.gz

Convert it to Lance

import lance
from lance.vector import vec_to_table
import numpy as np
import struct

nvecs = 1000000
ndims = 128
with open("sift/sift_base.fvecs", mode="rb") as fobj:
    buf = fobj.read()
    data = np.array(struct.unpack("<128000000f", buf[4 : 4 + 4 * nvecs * ndims])).reshape((nvecs, ndims))
    dd = dict(zip(range(nvecs), data))

table = vec_to_table(dd)
uri = "vec_data.lance"
sift1m = lance.write_dataset(table, uri, max_rows_per_group=8192, max_rows_per_file=1024*1024)

Build the index

sift1m.create_index("vector",
                    index_type="IVF_PQ", 
                    num_partitions=256,  # IVF
                    num_sub_vectors=16)  # PQ

Search the dataset

# Get top 10 similar vectors
import duckdb

dataset = lance.dataset(uri)

# Sample 100 query vectors. If this segfaults, make sure you have duckdb v0.7+ installed
sample = duckdb.query("SELECT vector FROM dataset USING SAMPLE 100").to_df()
query_vectors = np.array([np.array(x) for x in sample.vector])

# Get nearest neighbors for all of them
rs = [dataset.to_table(nearest={"column": "vector", "k": 10, "q": q})      
      for q in query_vectors]

*More distance metrics, HNSW, and distributed support is on the roadmap

Python package details

Install from PyPI: pip install pylance # >=0.3.0 is the new rust-based implementation Install from source: maturin develop (under the /python directory) Run unit tests: make test Run integration tests: make integtest

Import via: import lance

The python integration is done via pyo3 + custom python code:

  1. We make wrapper classes in Rust for Dataset/Scanner/RecordBatchReader that's exposed to python.
  2. These are then used by LanceDataset / LanceScanner implementations that extend pyarrow Dataset/Scanner for duckdb compat.
  3. Data is delivered via the Arrow C Data Interface

Motivation

Why do we need a new format for data science and machine learning?

1. Reproducibility is a must-have

Versioning and experimentation support should be built into the dataset instead of requiring multiple tools.
It should also be efficient and not require expensive copying everytime you want to create a new version.
We call this "Zero copy versioning" in Lance. It makes versioning data easy without increasing storage costs.

2. Cloud storage is now the default

Remote object storage is the default now for data science and machine learning and the performance characteristics of cloud are fundamentally different.
Lance format is optimized to be cloud native. Common operations like filter-then-take can be order of magnitude faster using Lance than Parquet, especially for ML data.

3. Vectors must be a first class citizen, not a separate thing

The majority of reasonable scale workflows should not require the added complexity and cost of a specialized database just to compute vector similarity. Lance integrates optimized vector indices into a columnar format so no additional infrastructure is required to get low latency top-K similarity search.

4. Open standards is a requirement

The DS/ML ecosystem is incredibly rich and data must be easily accessible across different languages, tools, and environments. Lance makes Apache Arrow integration its primary interface, which means conversions to/from is 2 lines of code, your code does not need to change after conversion, and nothing is locked-up to force you to pay for vendor compute. We need open-source not fauxpen-source.

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

pylance-0.19.2-cp39-abi3-win_amd64.whl (28.6 MB view details)

Uploaded CPython 3.9+ Windows x86-64

pylance-0.19.2-cp39-abi3-manylinux_2_28_x86_64.whl (30.5 MB view details)

Uploaded CPython 3.9+ manylinux: glibc 2.28+ x86-64

pylance-0.19.2-cp39-abi3-manylinux_2_24_aarch64.whl (29.2 MB view details)

Uploaded CPython 3.9+ manylinux: glibc 2.24+ ARM64

pylance-0.19.2-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (30.6 MB view details)

Uploaded CPython 3.9+ manylinux: glibc 2.17+ x86-64

pylance-0.19.2-cp39-abi3-macosx_11_0_arm64.whl (26.7 MB view details)

Uploaded CPython 3.9+ macOS 11.0+ ARM64

pylance-0.19.2-cp39-abi3-macosx_10_15_x86_64.whl (28.8 MB view details)

Uploaded CPython 3.9+ macOS 10.15+ x86-64

File details

Details for the file pylance-0.19.2-cp39-abi3-win_amd64.whl.

File metadata

  • Download URL: pylance-0.19.2-cp39-abi3-win_amd64.whl
  • Upload date:
  • Size: 28.6 MB
  • Tags: CPython 3.9+, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.9.13

File hashes

Hashes for pylance-0.19.2-cp39-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 70d47d94521fc973460c8d765c3960db79a1f676aab658434693ab3e5a7112c1
MD5 5b8ced5ab4f3c6e95b29aec43b228bbe
BLAKE2b-256 dd24126b0e37b30bf567ed31151f8adf009ee5d8eb68c613bc33e5f523f0d710

See more details on using hashes here.

File details

Details for the file pylance-0.19.2-cp39-abi3-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for pylance-0.19.2-cp39-abi3-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 48a618dfc493932b49a8c1f50ad031e39c4d505d30c968d6467da1a03720a523
MD5 90416bfe73af7c650689cfb163e52f5c
BLAKE2b-256 9048f41b76e478651afee95f3a558db1d8ea9b85e1b31511b4340ea6224491fe

See more details on using hashes here.

File details

Details for the file pylance-0.19.2-cp39-abi3-manylinux_2_24_aarch64.whl.

File metadata

File hashes

Hashes for pylance-0.19.2-cp39-abi3-manylinux_2_24_aarch64.whl
Algorithm Hash digest
SHA256 c3be7883ae860c39186f41798cd752b93298450cc09488108f2aa738aa352f0e
MD5 6a391d981dfb8cf8ea091b59dae37587
BLAKE2b-256 e9d238d523007824ef975fca496022e2583df833226f128399d4daf0a046896d

See more details on using hashes here.

File details

Details for the file pylance-0.19.2-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for pylance-0.19.2-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 cf1732ff03acceacc6793f6b209357a757ce3cfd5a94369a81b3d15e8e425f9a
MD5 04a75747d9e97cf0599ce47436c49a7f
BLAKE2b-256 3435722388309957c2dbab64a6901152c794dce01978e859a35989f95678af81

See more details on using hashes here.

File details

Details for the file pylance-0.19.2-cp39-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for pylance-0.19.2-cp39-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 6a15b8b09e015feb11307ff63ef0742f9e120100e17476b1091d3db543c19bdf
MD5 cfdb08f857325b98388fe3fd2012416f
BLAKE2b-256 8bf56c2f04869747cb382f0f561362d354e132c2adb9b299aa28f28bb1847209

See more details on using hashes here.

File details

Details for the file pylance-0.19.2-cp39-abi3-macosx_10_15_x86_64.whl.

File metadata

File hashes

Hashes for pylance-0.19.2-cp39-abi3-macosx_10_15_x86_64.whl
Algorithm Hash digest
SHA256 659b2ba45b7c905a2873bb36e9b4a6ec4634690723d45af0b469a502acacf5eb
MD5 ed38a1c35002af514f01efb7c805dddf
BLAKE2b-256 11475eb617889ad15fc07bceb419fdc9e42c67de878f2d081e30637af9e5e735

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page