Skip to main content

Distributed Dataframes for Multimodal Data

Project description

Daft dataframes can load any data such as PDF documents, images, protobufs, csv, parquet and audio files into a table dataframe structure for easy querying

Github Actions tests PyPI latest tag Coverage slack community

WebsiteDocsInstallation10-minute tour of DaftCommunity and Support

Daft: Unified Engine for Data Analytics, Engineering & ML/AI

Daft is a distributed query engine for large-scale data processing using Python or SQL, implemented in Rust.

  • Familiar interactive API: Lazy Python Dataframe for rapid and interactive iteration, or SQL for analytical queries

  • Focus on the what: Powerful Query Optimizer that rewrites queries to be as efficient as possible

  • Data Catalog integrations: Full integration with data catalogs such as Apache Iceberg

  • Rich multimodal type-system: Supports multimodal types such as Images, URLs, Tensors and more

  • Seamless Interchange: Built on the Apache Arrow In-Memory Format

  • Built for the cloud: Record-setting I/O performance for integrations with S3 cloud storage

Table of Contents

About Daft

Daft was designed with the following principles in mind:

  1. Any Data: Beyond the usual strings/numbers/dates, Daft columns can also hold complex or nested multimodal data such as Images, Embeddings and Python objects efficiently with it’s Arrow based memory representation. Ingestion and basic transformations of multimodal data is extremely easy and performant in Daft.

  2. Interactive Computing: Daft is built for the interactive developer experience through notebooks or REPLs - intelligent caching/query optimizations accelerates your experimentation and data exploration.

  3. Distributed Computing: Some workloads can quickly outgrow your local laptop’s computational resources - Daft integrates natively with Ray for running dataframes on large clusters of machines with thousands of CPUs/GPUs.

Getting Started

Installation

Install Daft with pip install getdaft.

For more advanced installations (e.g. installing from source or with extra dependencies such as Ray and AWS utilities), please see our Installation Guide

Quickstart

Check out our 10-minute quickstart!

In this example, we load images from an AWS S3 bucket’s URLs and resize each image in the dataframe:

import daft

# Load a dataframe from filepaths in an S3 bucket
df = daft.from_glob_path("s3://daft-public-data/laion-sample-images/*")

# 1. Download column of image URLs as a column of bytes
# 2. Decode the column of bytes into a column of images
df = df.with_column("image", df["path"].url.download().image.decode())

# Resize each image into 32x32
df = df.with_column("resized", df["image"].image.resize(32, 32))

df.show(3)

Dataframe code to load a folder of images from AWS S3 and create thumbnails

Benchmarks

Benchmarks for SF100 TPCH

To see the full benchmarks, detailed setup, and logs, check out our benchmarking page.

More Resources

  • 10-minute tour of Daft - learn more about Daft’s full range of capabilities including dataloading from URLs, joins, user-defined functions (UDF), groupby, aggregations and more.

  • User Guide - take a deep-dive into each topic within Daft

  • API Reference - API reference for public classes/functions of Daft

Contributing

To start contributing to Daft, please read CONTRIBUTING.md

Here’s a list of good first issues to get yourself warmed up with Daft. Comment in the issue to pick it up, and feel free to ask any questions!

Telemetry

To help improve Daft, we collect non-identifiable data.

To disable this behavior, set the following environment variable: DAFT_ANALYTICS_ENABLED=0

The data that we collect is:

  1. Non-identifiable: events are keyed by a session ID which is generated on import of Daft

  2. Metadata-only: we do not collect any of our users’ proprietary code or data

  3. For development only: we do not buy or sell any user data

Please see our documentation for more details.

https://static.scarf.sh/a.png?x-pxid=cd444261-469e-473b-b9ba-f66ac3dc73ee

License

Daft has an Apache 2.0 license - please see the LICENSE file.

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

getdaft-0.3.14.tar.gz (3.9 MB view details)

Uploaded Source

Built Distributions

getdaft-0.3.14-cp38-abi3-win_amd64.whl (30.4 MB view details)

Uploaded CPython 3.8+ Windows x86-64

getdaft-0.3.14-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (33.3 MB view details)

Uploaded CPython 3.8+ manylinux: glibc 2.17+ x86-64

getdaft-0.3.14-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (31.8 MB view details)

Uploaded CPython 3.8+ manylinux: glibc 2.17+ ARM64

getdaft-0.3.14-cp38-abi3-macosx_11_0_arm64.whl (27.9 MB view details)

Uploaded CPython 3.8+ macOS 11.0+ ARM64

getdaft-0.3.14-cp38-abi3-macosx_10_12_x86_64.whl (30.1 MB view details)

Uploaded CPython 3.8+ macOS 10.12+ x86-64

File details

Details for the file getdaft-0.3.14.tar.gz.

File metadata

  • Download URL: getdaft-0.3.14.tar.gz
  • Upload date:
  • Size: 3.9 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.11.10

File hashes

Hashes for getdaft-0.3.14.tar.gz
Algorithm Hash digest
SHA256 473a9aaabcba29c98dc36377c304e1a047162478d229077995818a31c29f0c6f
MD5 697bc192873c7046fc98d45f4a6cf26b
BLAKE2b-256 62f768d2b69da8b916f173256ec2fa34401219346e7f086f5c6151b7974a35b4

See more details on using hashes here.

File details

Details for the file getdaft-0.3.14-cp38-abi3-win_amd64.whl.

File metadata

  • Download URL: getdaft-0.3.14-cp38-abi3-win_amd64.whl
  • Upload date:
  • Size: 30.4 MB
  • Tags: CPython 3.8+, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.11.10

File hashes

Hashes for getdaft-0.3.14-cp38-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 71f2663500de6eb93108a6cc193ea0b14c1fdd607729970a5948b8a1b99fd0af
MD5 d555e40d93dc18d8e1828ec330d7ee43
BLAKE2b-256 84d1ae44722f3d1ab855636bdb25518e2c730b2e647aff8092c55eb564078e32

See more details on using hashes here.

File details

Details for the file getdaft-0.3.14-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for getdaft-0.3.14-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 98f85a22185afd3d1230ff014c0a5bf74521445949f27ee5b65dea15ead12b8b
MD5 289bfbd1eb03ce69688d7fd6ac19db74
BLAKE2b-256 f99eecf0c8f1f6683bb756f78f79ad1d53cc7c951256fc7c9e5b7d95d0bff3b2

See more details on using hashes here.

File details

Details for the file getdaft-0.3.14-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for getdaft-0.3.14-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 3cef1f00067adeee0ce2b1d0d8fdda2186f2e4fa2981f3aa9e910ec7d03a4a3e
MD5 9cef946f0332b974d1c2016e6f5aa107
BLAKE2b-256 2235a6b3365cb18a97739e4b67d1536ba5e9499b428e9561ff45849ff1955447

See more details on using hashes here.

File details

Details for the file getdaft-0.3.14-cp38-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for getdaft-0.3.14-cp38-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 f75e5706417e7b0b9211ce02139b9701821dd89d66e0e23e9380999198d89959
MD5 fc159d60df1437de83756d6853f1d197
BLAKE2b-256 f7ec68bf1458ff3acc8193b146af493475c1e113546f2ee1137a12cd862579f5

See more details on using hashes here.

File details

Details for the file getdaft-0.3.14-cp38-abi3-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for getdaft-0.3.14-cp38-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 3ca7900581c868954f5444ee34734d0ce906108a8dd8f43076b207fe81d6a57b
MD5 629b0056b20d1a0bfe4bc18454b8381d
BLAKE2b-256 01fc30e21adbcfa2b2ead49fc582779df8f9bf6531312525cd8d7531a18f1b2d

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page