Skip to main content

A petabyte scale data processing framework for AI models using Ray.

Project description

teraflopai-data

A petabyte scale data processing framework for AI models using Daft + Ray.

Installation

uv pip install teraflopai-data

Install specific multimodal components

# Image
uv pip install teraflopai-data[image]

# Text
uv pip install teraflopai-data[text]

# Everything
uv pip install teraflopai-data[all]

Community

Join our Discord community

Examples

Pipeline

import daft

from teraflopai_data.components.text.embedding import SentenceTransformersEmbed
from teraflopai_data.components.text.fineweb_edu import FinewebEduClassifier
from teraflopai_data.pipeline import Pipeline

df = daft.from_pydict(
    {
        "text": [
            "My mother told me",
            "Someday I will buy",
            "Galleys with good oars",
            "Sail to distant shores",
        ],
    }
)

classifier = FinewebEduClassifier(
    input_column="text",
    batch_size=4,
    concurrency=1,
    num_cpus=6,
    num_gpus=1,
)

embedder = SentenceTransformersEmbed(
    input_column="text",
    model_name="all-MiniLM-L6-v2",
    batch_size=4,
    concurrency=1,
    num_cpus=6,
    num_gpus=1,
)

pipeline = Pipeline(
    ops=[classifier, embedder],
)

df = pipeline(df)
df.show()

Text

import daft

from teraflopai_data.components.text.fineweb_edu import FinewebEduClassifier

df = daft.from_pydict(
    {
        "text": [
            "My mother told me",
            "Someday I will buy",
            "Galleys with good oars",
            "Sail to distant shores",
        ],
    }
)

classifier = FinewebEduClassifier(
    input_column="text",
    batch_size=4,
    concurrency=1,
    num_cpus=6,
    num_gpus=1,
)
df = classifier(df)
df.show()

Image

import daft
from daft import col

from teraflopai_data.components.image.image_hashing import ImageHasher

df = daft.from_pydict(
    {
        "urls": [
            "https://live.staticflickr.com/65535/53671838774_03ba68d203_o.jpg",
            "https://live.staticflickr.com/65535/53671700073_2c9441422e_o.jpg",
            "https://live.staticflickr.com/65535/53670606332_1ea5f2ce68_o.jpg",
            "https://live.staticflickr.com/65535/53671838039_b97411a441_o.jpg",
            "https://live.staticflickr.com/65535/53671698613_0230f8af3c_o.jpg",
        ],
    }
)

hasher = ImageHasher(
    input_column="image",
    hashing_algorithm="wavelet",
    concurrency=1,
    num_cpus=6,
)

df = df.with_column("image_bytes", col("urls").url.download(on_error="null"))
df = df.with_column("image", col("image_bytes").image.decode())
df = hasher(df)
df = df.drop_duplicates("image_hash")
df.show()

Citation

@misc{shippole2025petabyte,
    title   = {Distributed},
    author  = {Enrico Shippole},
    year    = {2025},
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

jotun-0.1.0.tar.gz (13.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

jotun-0.1.0-py3-none-any.whl (21.5 kB view details)

Uploaded Python 3

File details

Details for the file jotun-0.1.0.tar.gz.

File metadata

  • Download URL: jotun-0.1.0.tar.gz
  • Upload date:
  • Size: 13.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.6.3

File hashes

Hashes for jotun-0.1.0.tar.gz
Algorithm Hash digest
SHA256 33a7b472cc45d9cf47d25d8b9dcfa5911bd2341efd10c56bf1a866dcf772bf21
MD5 45d8cba24655aeb6382a4af3d0028dbf
BLAKE2b-256 846c6def3252a0a1b0556c79f444f157c38492772c17a886fe7c253acc656f83

See more details on using hashes here.

File details

Details for the file jotun-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: jotun-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 21.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.6.3

File hashes

Hashes for jotun-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 8634150bcc6d42a47c9bd6d80cb707b37ca96dd7c79ec04c9314209f8fd08890
MD5 ce2a6de05ddb383073ada46e7c6b3e44
BLAKE2b-256 02a8b895c88c6585f836a906fffabf5724e2395b6c69ad21a005035d89fec673

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page