Skip to main content

A flexible data ingestion library for various file formats

Project description

License PyPI Python Platform

Data Ingestors ๐Ÿ“Š

Move your data into the tracebloc training environment โ€” validated, clean, and ready for model evaluation. Your raw data never leaves your infrastructure.

How it works

Your raw data
     โ”‚
     โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”     โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚  Data ingestor   โ”‚โ”€โ”€โ”€โ”€โ–บโ”‚  Your Kubernetes cluster         โ”‚
โ”‚                  โ”‚     โ”‚                                  โ”‚
โ”‚  Validates       โ”‚     โ”‚  Validated dataset               โ”‚
โ”‚  Preprocesses    โ”‚     โ”‚  (ready for training)            โ”‚
โ”‚  Transfers       โ”‚     โ”‚                                  โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜     โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                                        โ”‚
                               Metadata only
                                        โ”‚
                                        โ–ผ
                         โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
                         โ”‚  tracebloc web app       โ”‚
                         โ”‚  (dataset management UI) โ”‚
                         โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Only metadata (schema, statistics, structure) syncs to the web app. Raw data stays put.

Supported data types

Type Templates
Image image_classification, object_detection
Text / NLP text_classification
Tabular tabular_classification, tabular_regression
Time series time_series_forecasting, time_to_event_prediction

Each template is a runnable starting point โ€” copy it, point it at your data, ship it.

Quickstart

1. Install

pip install tracebloc-ingestor

2. Pick a template

cp templates/image_classification/ingestor.py .

Each template builds on the same primitives โ€” BaseIngestor, CSVIngestor, validators โ€” and overrides the parts that vary by data type.

3. Deploy as a Kubernetes Job

The ingestor runs inside your cluster, next to a tracebloc client. The provided Dockerfile and ingestor-job.yaml are the canonical pattern:

docker build -t <your-registry>/<image-name>:latest .
docker push <your-registry>/<image-name>:latest
kubectl apply -f ingestor-job.yaml

The Job needs these environment variables (set in ingestor-job.yaml):

Variable What it is
CLIENT_ID, CLIENT_PASSWORD Tracebloc client credentials
CLIENT_PVC PVC name shared with the client (must match values.yaml)
MYSQL_HOST Hostname of the client's MySQL service
SRC_PATH Where your raw data is mounted in the ingestor pod
LABEL_FILE Path to labels (e.g. Xy_train.csv)
TABLE_NAME Destination table name in the client database
TITLE (optional) Human-readable dataset name
LOG_LEVEL (optional) INFO, WARNING, ERROR

Running under Pod Security Standards (restricted)

If the namespace you're deploying into enforces the restricted Pod Security Standard (OpenShift, hardened clusters, many managed-Kubernetes namespaces), the stock Dockerfile and ingestor-job.yaml won't admit. Two changes are needed.

Check first:

kubectl get ns <namespace> -o jsonpath='{.metadata.labels}' | jq

Look for pod-security.kubernetes.io/enforce: restricted. If absent, the stock files admit fine and you can skip this section.

1. Dockerfile โ€” drop root. Append before ENTRYPOINT:

# OpenShift-compatible: grant group write via GID 0
RUN chgrp -R 0 /app && chmod -R g=u /app
USER 1001

2. ingestor-job.yaml โ€” add a hardened securityContext. Both pod-level and container-level:

spec:
  template:
    spec:
      securityContext:                    # pod-level
        runAsNonRoot: true
        runAsUser: 1001
        seccompProfile:
          type: RuntimeDefault
      containers:
      - name: api
        # ... existing container spec ...
        securityContext:                  # container-level
          allowPrivilegeEscalation: false
          capabilities:
            drop: ["ALL"]

Writing a custom ingestor

For data that doesn't fit a template, subclass BaseIngestor:

from tracebloc_ingestor import BaseIngestor, FileTypeValidator

class MyIngestor(BaseIngestor):
    validators = [FileTypeValidator(allowed=[".parquet"])]

    def transform(self, record):
        # your preprocessing
        return record

if __name__ == "__main__":
    MyIngestor().ingest()

The package exports BaseIngestor, CSVIngestor, JSONIngestor, plus validators (FileTypeValidator, ImageResolutionValidator, TableNameValidator) and the Database / APIClient helpers. See examples/ for working scripts.

Prerequisites

Links

Platform ยท Docs ยท Data preparation guide ยท Discord

License

Apache 2.0 โ€” see LICENSE.

Questions? support@tracebloc.io or open an issue.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tracebloc_ingestor-0.3.0.tar.gz (99.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

tracebloc_ingestor-0.3.0-py3-none-any.whl (126.4 kB view details)

Uploaded Python 3

File details

Details for the file tracebloc_ingestor-0.3.0.tar.gz.

File metadata

  • Download URL: tracebloc_ingestor-0.3.0.tar.gz
  • Upload date:
  • Size: 99.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.15

File hashes

Hashes for tracebloc_ingestor-0.3.0.tar.gz
Algorithm Hash digest
SHA256 bcd78faf97b657a9906e3322f12ffa6ec2f89210cf854c9c23da58d72c967a8f
MD5 a4f946289240e0338ca7580acc9c5ffe
BLAKE2b-256 46b9e20fc7b07bdb16962a561b271d07ae705344b134ad9a399f159ace4b3be2

See more details on using hashes here.

File details

Details for the file tracebloc_ingestor-0.3.0-py3-none-any.whl.

File metadata

File hashes

Hashes for tracebloc_ingestor-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 54525faaea5a75e51e5aa42548bcf11c8eb8e87a865861ef44ff0d23d1267002
MD5 cbc390747724969d4d5927fc21493278
BLAKE2b-256 1397e11925fb312c0cdd476e9913eb46c2d59f4bb545be758731b40e04434fd1

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page