A petabyte scale data processing framework for AI models using Ray.
Project description
teraflopai-data
A petabyte scale data processing framework for AI models using Daft + Ray.
Installation
uv pip install teraflopai-data
Install specific multimodal components
# Image
uv pip install teraflopai-data[image]
# Text
uv pip install teraflopai-data[text]
# Everything
uv pip install teraflopai-data[all]
Community
Examples
Pipeline
import daft
from teraflopai_data.components.text.embedding import SentenceTransformersEmbed
from teraflopai_data.components.text.fineweb_edu import FinewebEduClassifier
from teraflopai_data.pipeline import Pipeline
df = daft.from_pydict(
{
"text": [
"My mother told me",
"Someday I will buy",
"Galleys with good oars",
"Sail to distant shores",
],
}
)
classifier = FinewebEduClassifier(
input_column="text",
batch_size=4,
concurrency=1,
num_cpus=6,
num_gpus=1,
)
embedder = SentenceTransformersEmbed(
input_column="text",
model_name="all-MiniLM-L6-v2",
batch_size=4,
concurrency=1,
num_cpus=6,
num_gpus=1,
)
pipeline = Pipeline(
ops=[classifier, embedder],
)
df = pipeline(df)
df.show()
Text
import daft
from teraflopai_data.components.text.fineweb_edu import FinewebEduClassifier
df = daft.from_pydict(
{
"text": [
"My mother told me",
"Someday I will buy",
"Galleys with good oars",
"Sail to distant shores",
],
}
)
classifier = FinewebEduClassifier(
input_column="text",
batch_size=4,
concurrency=1,
num_cpus=6,
num_gpus=1,
)
df = classifier(df)
df.show()
Image
import daft
from daft import col
from teraflopai_data.components.image.image_hashing import ImageHasher
df = daft.from_pydict(
{
"urls": [
"https://live.staticflickr.com/65535/53671838774_03ba68d203_o.jpg",
"https://live.staticflickr.com/65535/53671700073_2c9441422e_o.jpg",
"https://live.staticflickr.com/65535/53670606332_1ea5f2ce68_o.jpg",
"https://live.staticflickr.com/65535/53671838039_b97411a441_o.jpg",
"https://live.staticflickr.com/65535/53671698613_0230f8af3c_o.jpg",
],
}
)
hasher = ImageHasher(
input_column="image",
hashing_algorithm="wavelet",
concurrency=1,
num_cpus=6,
)
df = df.with_column("image_bytes", col("urls").url.download(on_error="null"))
df = df.with_column("image", col("image_bytes").image.decode())
df = hasher(df)
df = df.drop_duplicates("image_hash")
df.show()
Citation
@misc{shippole2025petabyte,
title = {Distributed},
author = {Enrico Shippole},
year = {2025},
}
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
jotun-0.1.0.tar.gz
(13.2 kB
view details)
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
jotun-0.1.0-py3-none-any.whl
(21.5 kB
view details)
File details
Details for the file jotun-0.1.0.tar.gz.
File metadata
- Download URL: jotun-0.1.0.tar.gz
- Upload date:
- Size: 13.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.6.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
33a7b472cc45d9cf47d25d8b9dcfa5911bd2341efd10c56bf1a866dcf772bf21
|
|
| MD5 |
45d8cba24655aeb6382a4af3d0028dbf
|
|
| BLAKE2b-256 |
846c6def3252a0a1b0556c79f444f157c38492772c17a886fe7c253acc656f83
|
File details
Details for the file jotun-0.1.0-py3-none-any.whl.
File metadata
- Download URL: jotun-0.1.0-py3-none-any.whl
- Upload date:
- Size: 21.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.6.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8634150bcc6d42a47c9bd6d80cb707b37ca96dd7c79ec04c9314209f8fd08890
|
|
| MD5 |
ce2a6de05ddb383073ada46e7c6b3e44
|
|
| BLAKE2b-256 |
02a8b895c88c6585f836a906fffabf5724e2395b6c69ad21a005035d89fec673
|