Skip to main content

Distributed Dataframes for Multimodal Data

Project description

Daft dataframes can load any data such as PDF documents, images, protobufs, csv, parquet and audio files into a table dataframe structure for easy querying

Github Actions tests PyPI latest tag Coverage slack community

WebsiteDocsInstallation10-minute tour of DaftCommunity and Support

Daft: Unified Engine for Data Analytics, Engineering & ML/AI

Daft is a distributed query engine for large-scale data processing using Python or SQL, implemented in Rust.

  • Familiar interactive API: Lazy Python Dataframe for rapid and interactive iteration, or SQL for analytical queries

  • Focus on the what: Powerful Query Optimizer that rewrites queries to be as efficient as possible

  • Data Catalog integrations: Full integration with data catalogs such as Apache Iceberg

  • Rich multimodal type-system: Supports multimodal types such as Images, URLs, Tensors and more

  • Seamless Interchange: Built on the Apache Arrow In-Memory Format

  • Built for the cloud: Record-setting I/O performance for integrations with S3 cloud storage

Table of Contents

About Daft

Daft was designed with the following principles in mind:

  1. Any Data: Beyond the usual strings/numbers/dates, Daft columns can also hold complex or nested multimodal data such as Images, Embeddings and Python objects efficiently with it’s Arrow based memory representation. Ingestion and basic transformations of multimodal data is extremely easy and performant in Daft.

  2. Interactive Computing: Daft is built for the interactive developer experience through notebooks or REPLs - intelligent caching/query optimizations accelerates your experimentation and data exploration.

  3. Distributed Computing: Some workloads can quickly outgrow your local laptop’s computational resources - Daft integrates natively with Ray for running dataframes on large clusters of machines with thousands of CPUs/GPUs.

Getting Started

Installation

Install Daft with pip install getdaft.

For more advanced installations (e.g. installing from source or with extra dependencies such as Ray and AWS utilities), please see our Installation Guide

Quickstart

Check out our 10-minute quickstart!

In this example, we load images from an AWS S3 bucket’s URLs and resize each image in the dataframe:

import daft

# Load a dataframe from filepaths in an S3 bucket
df = daft.from_glob_path("s3://daft-public-data/laion-sample-images/*")

# 1. Download column of image URLs as a column of bytes
# 2. Decode the column of bytes into a column of images
df = df.with_column("image", df["path"].url.download().image.decode())

# Resize each image into 32x32
df = df.with_column("resized", df["image"].image.resize(32, 32))

df.show(3)

Dataframe code to load a folder of images from AWS S3 and create thumbnails

Benchmarks

Benchmarks for SF100 TPCH

To see the full benchmarks, detailed setup, and logs, check out our benchmarking page.

More Resources

  • 10-minute tour of Daft - learn more about Daft’s full range of capabilities including dataloading from URLs, joins, user-defined functions (UDF), groupby, aggregations and more.

  • User Guide - take a deep-dive into each topic within Daft

  • API Reference - API reference for public classes/functions of Daft

Contributing

To start contributing to Daft, please read CONTRIBUTING.md

Here’s a list of good first issues to get yourself warmed up with Daft. Comment in the issue to pick it up, and feel free to ask any questions!

Telemetry

To help improve Daft, we collect non-identifiable data.

To disable this behavior, set the following environment variable: DAFT_ANALYTICS_ENABLED=0

The data that we collect is:

  1. Non-identifiable: events are keyed by a session ID which is generated on import of Daft

  2. Metadata-only: we do not collect any of our users’ proprietary code or data

  3. For development only: we do not buy or sell any user data

Please see our documentation for more details.

https://static.scarf.sh/a.png?x-pxid=cd444261-469e-473b-b9ba-f66ac3dc73ee

License

Daft has an Apache 2.0 license - please see the LICENSE file.

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

getdaft-0.3.7.tar.gz (3.7 MB view details)

Uploaded Source

Built Distributions

getdaft-0.3.7-cp38-abi3-win_amd64.whl (26.6 MB view details)

Uploaded CPython 3.8+ Windows x86-64

getdaft-0.3.7-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (29.7 MB view details)

Uploaded CPython 3.8+ manylinux: glibc 2.17+ x86-64

getdaft-0.3.7-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (28.1 MB view details)

Uploaded CPython 3.8+ manylinux: glibc 2.17+ ARM64

getdaft-0.3.7-cp38-abi3-macosx_11_0_arm64.whl (24.6 MB view details)

Uploaded CPython 3.8+ macOS 11.0+ ARM64

getdaft-0.3.7-cp38-abi3-macosx_10_12_x86_64.whl (26.6 MB view details)

Uploaded CPython 3.8+ macOS 10.12+ x86-64

File details

Details for the file getdaft-0.3.7.tar.gz.

File metadata

  • Download URL: getdaft-0.3.7.tar.gz
  • Upload date:
  • Size: 3.7 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.11.10

File hashes

Hashes for getdaft-0.3.7.tar.gz
Algorithm Hash digest
SHA256 2c2530b6abd52db4e464476284a9f4a82efcd1ea56bfc92971b7e74cd6c27069
MD5 ed2b4fbed879f461a09db16c5a9091ab
BLAKE2b-256 466df7ff7cae99e80796baceab0f976bb9465a96fcadb8d1145866bd7154e328

See more details on using hashes here.

File details

Details for the file getdaft-0.3.7-cp38-abi3-win_amd64.whl.

File metadata

  • Download URL: getdaft-0.3.7-cp38-abi3-win_amd64.whl
  • Upload date:
  • Size: 26.6 MB
  • Tags: CPython 3.8+, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.11.10

File hashes

Hashes for getdaft-0.3.7-cp38-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 85a684651d5b5ab1642adfe5ed74d42d2ffeab1bf048baf7a560cd233f9fbee5
MD5 cbdcb2ce316400851f20905630f50136
BLAKE2b-256 08477b165a85fe4745b238e8644783d7140fca459d90802b72fa7ed3bbebee65

See more details on using hashes here.

File details

Details for the file getdaft-0.3.7-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for getdaft-0.3.7-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 a800ce6a513949ddeea54d3c648b130c11bb1952073b2cd364c4c582dbc50d9f
MD5 e6c438a6ecac991fef771b420c282dd6
BLAKE2b-256 4108180ab2d05d743f51da4549cca03948f08a61f9ad1c961268818931dba8a1

See more details on using hashes here.

File details

Details for the file getdaft-0.3.7-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for getdaft-0.3.7-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 fc3e69aab2a2b516aec3343a4965bde81277ffd61e0ffae753f58944f00d4f92
MD5 8ae206ea37b08d8b3aa7b4d50f96601d
BLAKE2b-256 3ee27d6e9c8ee93ef5e179f7aa69d84d5ee1d02f8a5500d1214aad210246808e

See more details on using hashes here.

File details

Details for the file getdaft-0.3.7-cp38-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for getdaft-0.3.7-cp38-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 b3af21a05b471383a5ebf6d97b55c4aa1dc268752441728f9f6181ac2e537d6c
MD5 658dd589a21d49647eea308d43f2af84
BLAKE2b-256 2f5604bba5ceda33d836b9bfc4cd81bb57be9b45c32fa394e4b646e8d99460de

See more details on using hashes here.

File details

Details for the file getdaft-0.3.7-cp38-abi3-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for getdaft-0.3.7-cp38-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 898354d3360c11c91be68d1ffeeed68f86ad86289a70aa4248c62ee54baf124f
MD5 e58d5ceec144dec684b0f9cecc601c86
BLAKE2b-256 d54001b913d9dd9aa265ee1d6bf5de68478d46f5595433814f01e7b202ecd6ec

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page