A petabyte scale data processing framework for AI models using Ray.
Project description
teraflopai-data
A petabyte scale data processing framework for AI models using Daft + Ray.
Installation
pip install teraflopai-data
Community
Examples
Pipeline
import daft
from teraflopai_data.components.text.embedding import SentenceTransformersEmbed
from teraflopai_data.components.text.fineweb_edu import FinewebEduClassifier
from teraflopai_data.pipeline import Pipeline
df = daft.from_pydict(
{
"text": [
"My mother told me",
"Someday I will buy",
"Galleys with good oars",
"Sail to distant shores",
],
}
)
classifier = FinewebEduClassifier(
input_column="text",
batch_size=4,
concurrency=1,
num_cpus=6,
num_gpus=1,
)
embedder = SentenceTransformersEmbed(
input_column="text",
model_name="all-MiniLM-L6-v2",
batch_size=4,
concurrency=1,
num_cpus=6,
num_gpus=1,
)
pipeline = Pipeline(
ops=[classifier, embedder],
)
df = pipeline(df)
df.show()
Text
import daft
from teraflopai_data.components.text.fineweb_edu import FinewebEduClassifier
df = daft.from_pydict(
{
"text": [
"My mother told me",
"Someday I will buy",
"Galleys with good oars",
"Sail to distant shores",
],
}
)
classifier = FinewebEduClassifier(
input_column="text",
batch_size=4,
concurrency=1,
num_cpus=6,
num_gpus=1,
)
df = classifier(df)
df.show()
Image
import daft
from daft import col
from teraflopai_data.components.image.image_hashing import ImageHasher
df = daft.from_pydict(
{
"urls": [
"https://live.staticflickr.com/65535/53671838774_03ba68d203_o.jpg",
"https://live.staticflickr.com/65535/53671700073_2c9441422e_o.jpg",
"https://live.staticflickr.com/65535/53670606332_1ea5f2ce68_o.jpg",
"https://live.staticflickr.com/65535/53671838039_b97411a441_o.jpg",
"https://live.staticflickr.com/65535/53671698613_0230f8af3c_o.jpg",
],
}
)
hasher = ImageHasher(
input_column="image",
hashing_algorithm="wavelet",
concurrency=1,
num_cpus=6,
)
df = df.with_column("image_bytes", col("urls").url.download(on_error="null"))
df = df.with_column("image", col("image_bytes").image.decode())
df = hasher(df)
df = df.drop_duplicates("image_hash")
df.show()
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
teraflopai_data-0.1.1.tar.gz
(12.5 kB
view details)
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file teraflopai_data-0.1.1.tar.gz.
File metadata
- Download URL: teraflopai_data-0.1.1.tar.gz
- Upload date:
- Size: 12.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.6.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
109f8c8a340dfec84596dc0da5da2393877741ff6f6d5cd92004e00f08c942aa
|
|
| MD5 |
e48af3ecd66b79b3d4cb2e655c5fff8f
|
|
| BLAKE2b-256 |
6a5516f485acc18601e2543bba040bb239b30ce9ef15992c28f89af47c0aad40
|
File details
Details for the file teraflopai_data-0.1.1-py3-none-any.whl.
File metadata
- Download URL: teraflopai_data-0.1.1-py3-none-any.whl
- Upload date:
- Size: 16.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.6.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4dbfb31af7bf393c55e4e595deb56e832c287938ed8e11580593e52d540bd1bd
|
|
| MD5 |
936615baa2d06446ff16ebb21e639057
|
|
| BLAKE2b-256 |
6cb471ddcfe2edc06c2671cd002dee9bba74230e390c8b9d4e1c49523c11ef75
|