Skip to main content

Distributed Dataframes for Multimodal Data

Project description

Daft dataframes can load any data such as PDF documents, images, protobufs, csv, parquet and audio files into a table dataframe structure for easy querying

Github Actions tests PyPI latest tag Coverage slack community

WebsiteDocsInstallationDaft QuickstartCommunity and Support

Daft: Unified Engine for Data Analytics, Engineering & ML/AI

Daft is a distributed query engine for large-scale data processing using Python or SQL, implemented in Rust.

  • Familiar interactive API: Lazy Python Dataframe for rapid and interactive iteration, or SQL for analytical queries

  • Focus on the what: Powerful Query Optimizer that rewrites queries to be as efficient as possible

  • Data Catalog integrations: Full integration with data catalogs such as Apache Iceberg

  • Rich multimodal type-system: Supports multimodal types such as Images, URLs, Tensors and more

  • Seamless Interchange: Built on the Apache Arrow In-Memory Format

  • Built for the cloud: Record-setting I/O performance for integrations with S3 cloud storage

Table of Contents

About Daft

Daft was designed with the following principles in mind:

  1. Any Data: Beyond the usual strings/numbers/dates, Daft columns can also hold complex or nested multimodal data such as Images, Embeddings and Python objects efficiently with it’s Arrow based memory representation. Ingestion and basic transformations of multimodal data is extremely easy and performant in Daft.

  2. Interactive Computing: Daft is built for the interactive developer experience through notebooks or REPLs - intelligent caching/query optimizations accelerates your experimentation and data exploration.

  3. Distributed Computing: Some workloads can quickly outgrow your local laptop’s computational resources - Daft integrates natively with Ray for running dataframes on large clusters of machines with thousands of CPUs/GPUs.

Getting Started

Installation

Install Daft with pip install daft.

For more advanced installations (e.g. installing from source or with extra dependencies such as Ray and AWS utilities), please see our Installation Guide

Quickstart

Check out our quickstart!

In this example, we load images from an AWS S3 bucket’s URLs and resize each image in the dataframe:

import daft

# Load a dataframe from filepaths in an S3 bucket
df = daft.from_glob_path("s3://daft-public-data/laion-sample-images/*")

# 1. Download column of image URLs as a column of bytes
# 2. Decode the column of bytes into a column of images
df = df.with_column("image", df["path"].url.download().image.decode())

# Resize each image into 32x32
df = df.with_column("resized", df["image"].image.resize(32, 32))

df.show(3)

Dataframe code to load a folder of images from AWS S3 and create thumbnails

Benchmarks

Benchmarks for SF100 TPCH

To see the full benchmarks, detailed setup, and logs, check out our benchmarking page.

More Resources

  • Daft Quickstart - learn more about Daft’s full range of capabilities including dataloading from URLs, joins, user-defined functions (UDF), groupby, aggregations and more.

  • User Guide - take a deep-dive into each topic within Daft

  • API Reference - API reference for public classes/functions of Daft

Contributing

To start contributing to Daft, please read CONTRIBUTING.md

Here’s a list of good first issues to get yourself warmed up with Daft. Comment in the issue to pick it up, and feel free to ask any questions!

Telemetry

To help improve Daft, we collect non-identifiable data via our own analytics as well as Scarf (https://scarf.sh).

To disable this behavior, set the following environment variables: - DAFT_ANALYTICS_ENABLED=0 - SCARF_NO_ANALYTICS=true or DO_NOT_TRACK=true

The data that we collect is:

  1. Non-identifiable: Events are keyed by a session ID which is generated on import of Daft

  2. Metadata-only: We do not collect any of our users’ proprietary code or data

  3. For development only: We do not buy or sell any user data

Please see our documentation for more details.

https://static.scarf.sh/a.png?x-pxid=31f8d5ba-7e09-4d75-8895-5252bbf06cf6

License

Daft has an Apache 2.0 license - please see the LICENSE file.

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

getdaft-0.4.18.tar.gz (4.9 MB view details)

Uploaded Source

Built Distributions

getdaft-0.4.18-cp39-abi3-win_amd64.whl (39.9 MB view details)

Uploaded CPython 3.9+Windows x86-64

getdaft-0.4.18-cp39-abi3-manylinux_2_24_x86_64.whl (43.0 MB view details)

Uploaded CPython 3.9+manylinux: glibc 2.24+ x86-64

getdaft-0.4.18-cp39-abi3-manylinux_2_24_aarch64.whl (40.9 MB view details)

Uploaded CPython 3.9+manylinux: glibc 2.24+ ARM64

getdaft-0.4.18-cp39-abi3-macosx_11_0_arm64.whl (37.5 MB view details)

Uploaded CPython 3.9+macOS 11.0+ ARM64

getdaft-0.4.18-cp39-abi3-macosx_10_12_x86_64.whl (40.7 MB view details)

Uploaded CPython 3.9+macOS 10.12+ x86-64

File details

Details for the file getdaft-0.4.18.tar.gz.

File metadata

  • Download URL: getdaft-0.4.18.tar.gz
  • Upload date:
  • Size: 4.9 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for getdaft-0.4.18.tar.gz
Algorithm Hash digest
SHA256 ff10119147a28cfaad949f1599d9e8317069d07f83098c8d40744c4f10b51398
MD5 6879d904a10fd488418f04dddc25b264
BLAKE2b-256 8158b9f821d25a174431622b1450f40ddeb1c79303f5d4377cbc04a578d39ae7

See more details on using hashes here.

Provenance

The following attestation bundles were made for getdaft-0.4.18.tar.gz:

Publisher: publish-pypi.yml on Eventual-Inc/Daft

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file getdaft-0.4.18-cp39-abi3-win_amd64.whl.

File metadata

  • Download URL: getdaft-0.4.18-cp39-abi3-win_amd64.whl
  • Upload date:
  • Size: 39.9 MB
  • Tags: CPython 3.9+, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for getdaft-0.4.18-cp39-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 def486c781c414b241a7843eb8c54c1ba105795cb14283d5986e395bd0510abc
MD5 076cfebb0cf6f177a6e385c2c58c7b43
BLAKE2b-256 a41e8226f8455697dda1bf6d062f8899aa057f0161dad2ed155d3480ff7e6d67

See more details on using hashes here.

Provenance

The following attestation bundles were made for getdaft-0.4.18-cp39-abi3-win_amd64.whl:

Publisher: publish-pypi.yml on Eventual-Inc/Daft

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file getdaft-0.4.18-cp39-abi3-manylinux_2_24_x86_64.whl.

File metadata

File hashes

Hashes for getdaft-0.4.18-cp39-abi3-manylinux_2_24_x86_64.whl
Algorithm Hash digest
SHA256 4e5ad35750731499a39c6056620e09ef7f310e1364a7c9d6310341e5f7478ba2
MD5 03b789570e5f8b52952606fcc51b3a45
BLAKE2b-256 f51422573e87ad2fca6489b69315a7a9e66085783ab483256dafc6b652f43c48

See more details on using hashes here.

Provenance

The following attestation bundles were made for getdaft-0.4.18-cp39-abi3-manylinux_2_24_x86_64.whl:

Publisher: publish-pypi.yml on Eventual-Inc/Daft

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file getdaft-0.4.18-cp39-abi3-manylinux_2_24_aarch64.whl.

File metadata

File hashes

Hashes for getdaft-0.4.18-cp39-abi3-manylinux_2_24_aarch64.whl
Algorithm Hash digest
SHA256 626f644af9f16ea77accf80177151fa6817313d596633e46d893ca2b232b1856
MD5 74164e127d6a03b18625efaf644d7ffc
BLAKE2b-256 51b40ba53025cc239db32d895eae8b81a69f44d6ee729b2fb2e71aa37bf616d5

See more details on using hashes here.

Provenance

The following attestation bundles were made for getdaft-0.4.18-cp39-abi3-manylinux_2_24_aarch64.whl:

Publisher: publish-pypi.yml on Eventual-Inc/Daft

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file getdaft-0.4.18-cp39-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for getdaft-0.4.18-cp39-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 d2ae3a7271eaa6f93e830bea80ea7ded0d2c3814b97b4b615af41e6aac91268b
MD5 4108b415a59ec0c29883706d7bbb6442
BLAKE2b-256 032b42192ff380fc72aa6afcdd33f057c5b7817e1ae7972b2e453759437cb1f7

See more details on using hashes here.

Provenance

The following attestation bundles were made for getdaft-0.4.18-cp39-abi3-macosx_11_0_arm64.whl:

Publisher: publish-pypi.yml on Eventual-Inc/Daft

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file getdaft-0.4.18-cp39-abi3-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for getdaft-0.4.18-cp39-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 ef7fd708e230ef7a80afe0b91d2ecc03d5ff3ec78826c5f39ff5a8fb15517ed2
MD5 1e8319792559da19ebe7e0f87968e172
BLAKE2b-256 ce36b6f1f52e03045a4145cfb45c06a055601608d57d18a377c24a456e176830

See more details on using hashes here.

Provenance

The following attestation bundles were made for getdaft-0.4.18-cp39-abi3-macosx_10_12_x86_64.whl:

Publisher: publish-pypi.yml on Eventual-Inc/Daft

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page