Skip to main content

A petabyte scale data processing framework for AI models using Ray.

Project description

teraflopai-data

A petabyte scale data processing framework for AI models using Daft + Ray.

Installation

pip install teraflopai-data

Community

Examples

Pipeline

import daft

from teraflopai_data.components.text.embedding import SentenceTransformersEmbed
from teraflopai_data.components.text.fineweb_edu import FinewebEduClassifier
from teraflopai_data.pipeline import Pipeline

df = daft.from_pydict(
    {
        "text": [
            "My mother told me",
            "Someday I will buy",
            "Galleys with good oars",
            "Sail to distant shores",
        ],
    }
)

classifier = FinewebEduClassifier(
    input_column="text",
    batch_size=4,
    concurrency=1,
    num_cpus=6,
    num_gpus=1,
)

embedder = SentenceTransformersEmbed(
    input_column="text",
    model_name="all-MiniLM-L6-v2",
    batch_size=4,
    concurrency=1,
    num_cpus=6,
    num_gpus=1,
)

pipeline = Pipeline(
    ops=[classifier, embedder],
)

df = pipeline(df)
df.show()

Text

import daft

from teraflopai_data.components.text.fineweb_edu import FinewebEduClassifier

df = daft.from_pydict(
    {
        "text": [
            "My mother told me",
            "Someday I will buy",
            "Galleys with good oars",
            "Sail to distant shores",
        ],
    }
)

classifier = FinewebEduClassifier(
    input_column="text",
    batch_size=4,
    concurrency=1,
    num_cpus=6,
    num_gpus=1,
)
df = classifier(df)
df.show()

Image

import daft
from daft import col

from teraflopai_data.components.image.image_hashing import ImageHasher

df = daft.from_pydict(
    {
        "urls": [
            "https://live.staticflickr.com/65535/53671838774_03ba68d203_o.jpg",
            "https://live.staticflickr.com/65535/53671700073_2c9441422e_o.jpg",
            "https://live.staticflickr.com/65535/53670606332_1ea5f2ce68_o.jpg",
            "https://live.staticflickr.com/65535/53671838039_b97411a441_o.jpg",
            "https://live.staticflickr.com/65535/53671698613_0230f8af3c_o.jpg",
        ],
    }
)

hasher = ImageHasher(
    input_column="image",
    hashing_algorithm="wavelet",
    concurrency=1,
    num_cpus=6,
)

df = df.with_column("image_bytes", col("urls").url.download(on_error="null"))
df = df.with_column("image", col("image_bytes").image.decode())
df = hasher(df)
df = df.drop_duplicates("image_hash")
df.show()

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

teraflopai_data-0.1.1.tar.gz (12.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

teraflopai_data-0.1.1-py3-none-any.whl (16.7 kB view details)

Uploaded Python 3

File details

Details for the file teraflopai_data-0.1.1.tar.gz.

File metadata

  • Download URL: teraflopai_data-0.1.1.tar.gz
  • Upload date:
  • Size: 12.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.6.3

File hashes

Hashes for teraflopai_data-0.1.1.tar.gz
Algorithm Hash digest
SHA256 109f8c8a340dfec84596dc0da5da2393877741ff6f6d5cd92004e00f08c942aa
MD5 e48af3ecd66b79b3d4cb2e655c5fff8f
BLAKE2b-256 6a5516f485acc18601e2543bba040bb239b30ce9ef15992c28f89af47c0aad40

See more details on using hashes here.

File details

Details for the file teraflopai_data-0.1.1-py3-none-any.whl.

File metadata

File hashes

Hashes for teraflopai_data-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 4dbfb31af7bf393c55e4e595deb56e832c287938ed8e11580593e52d540bd1bd
MD5 936615baa2d06446ff16ebb21e639057
BLAKE2b-256 6cb471ddcfe2edc06c2671cd002dee9bba74230e390c8b9d4e1c49523c11ef75

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page