Skip to main content

A flexible data ingestion library for various file formats

Project description

License PyPI Python Platform

Data Ingestors ๐Ÿ“Š

Move your data into the tracebloc training environment โ€” validated, clean, and ready for model evaluation. Your raw data never leaves your infrastructure.

How it works

Your raw data
     โ”‚
     โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”     โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚  Data ingestor   โ”‚โ”€โ”€โ”€โ”€โ–บโ”‚  Your Kubernetes cluster         โ”‚
โ”‚                  โ”‚     โ”‚                                  โ”‚
โ”‚  Validates       โ”‚     โ”‚  Validated dataset               โ”‚
โ”‚  Preprocesses    โ”‚     โ”‚  (ready for training)            โ”‚
โ”‚  Transfers       โ”‚     โ”‚                                  โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜     โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                                        โ”‚
                               Metadata only
                                        โ”‚
                                        โ–ผ
                         โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
                         โ”‚  tracebloc web app       โ”‚
                         โ”‚  (dataset management UI) โ”‚
                         โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Only metadata (schema, statistics, structure) syncs to the web app. Raw data stays put.

Supported data types

Type Categories
Image image_classification, object_detection, keypoint_detection, semantic_segmentation
Text / NLP text_classification, masked_language_modeling
Tabular tabular_classification, tabular_regression
Time series time_series_forecasting, time_to_event_prediction

Each template ships a sample dataset and an example ingest.yaml you can copy as a starting point.

Quickstart โ€” declarative YAML (recommended)

Describe your dataset in ~8 lines of YAML, then helm install. The official ingestor image (this package, signed + SBOM-attested, published as ghcr.io/tracebloc/ingestor) runs it. No Dockerfile, no Python script.

1. One-time: add the chart repo on your workstation.

helm repo add tracebloc https://tracebloc.github.io/client
helm repo update

The tracebloc/client parent chart bootstraps the cluster (jobs-manager, MySQL, RBAC). The tracebloc/ingestor subchart submits per-dataset ingestion runs against it.

Already installed the client via the one-liner (bash <(curl -fsSL https://tracebloc.io/i.sh))? Use --reset-then-reuse-values so the helm upgrade doesn't drop the values the installer applied:

helm upgrade <workspace> tracebloc/client -n <namespace> --reset-then-reuse-values

Append --version <version-number> to pin a specific chart version.

2. Stage your data on the cluster's shared PVC.

The chart doesn't transport data into the cluster โ€” it points at data already accessible to the cluster's shared PVC (client-pvc by default, mounted at /data/shared/ inside the ingestor Pod). Before installing, get your raw files there. The simplest pattern for a small dataset is a throwaway kubectl cp Pod that mounts the PVC; for production you'd typically use an init container with cloud-storage sync. Full staging recipe + manifests โ†’ tracebloc/client/ingestor/README.md#stage-your-data-on-the-shared-pvc.

3. Write your ingest.yaml.

The example below is for image_classification. Other categories require different fields โ€” e.g. tabular_classification has no images: and instead needs a typed schema: block. Don't copy this one blindly; grab the matching file from examples/yaml/ (one per category) and edit from there. Per-category sample data and READMEs live under templates/.

apiVersion: tracebloc.io/v1
kind: IngestConfig
category: image_classification
table: cats_dogs_train
intent: train
csv: /data/shared/cats-dogs/labels.csv
images: /data/shared/cats-dogs/images/
label: label

The top-level shape (apiVersion, kind, category, table, intent, label) is the same for every category; the category field picks the validator set, file-extension defaults, and column conventions, and the data-source fields (csv:, images:, schema:, โ€ฆ) vary per category. The paths are paths inside the ingestor Pod, which is the PVC mount you populated in step 2.

4. Install once per dataset.

helm install my-cats-dogs tracebloc/ingestor \
  --namespace tracebloc \
  --set-file ingestConfig=./ingest.yaml

The ingestor runs once: validates your data, copies files into the destination directory on the PVC, inserts rows into MySQL, sends metadata to the tracebloc backend, then exits. Repeat per dataset. Customers never build an image, never write a Dockerfile, never track digest versions โ€” the cluster's auto-upgrade flow keeps the official image current.

Full chart docs (data-staging recipe, schema, every category, update model, verification, override knobs) โ†’ tracebloc/client/ingestor/README.md.

Advanced: custom processors (legacy Python pattern)

Use this when the declarative schema can't express what your data needs โ€” typically when you have non-trivial preprocessing logic, a custom validator, or a BaseProcessor subclass.

1. Install the package.

pip install tracebloc-ingestor

2. Pick a template + adapt the script.

cp templates/image_classification/image_classification.py .

The package exports BaseIngestor, CSVIngestor, JSONIngestor, plus validators (FileTypeValidator, ImageResolutionValidator, TableNameValidator, etc.) and the Database / APIClient helpers. See examples/ for working scripts.

3. Build + deploy as a Kubernetes Job.

The legacy Dockerfile and ingestor-job.yaml remain the canonical pattern for custom-processor flows:

docker build -t <your-registry>/<image-name>:latest .
docker push <your-registry>/<image-name>:latest
kubectl apply -f ingestor-job.yaml

The Job needs these environment variables (set in ingestor-job.yaml):

Variable What it is
CLIENT_ID, CLIENT_PASSWORD Tracebloc client credentials
CLIENT_PVC PVC name shared with the client (must match values.yaml)
MYSQL_HOST Hostname of the client's MySQL service
SRC_PATH Where your raw data is mounted in the ingestor pod
LABEL_FILE Path to labels (e.g. Xy_train.csv)
TABLE_NAME Destination table name in the client database
TITLE (optional) Human-readable dataset name
LOG_LEVEL (optional) INFO, WARNING, ERROR

Running custom-processor flows under Pod Security Standards (restricted)

If the namespace you're deploying into enforces the restricted Pod Security Standard (OpenShift, hardened clusters, many managed-Kubernetes namespaces), the stock Dockerfile and ingestor-job.yaml won't admit. (The declarative path's image is already PSA-restricted-compatible; this section only applies to custom Dockerfiles built from this repo.) Two changes are needed.

Check first:

kubectl get ns <namespace> -o jsonpath='{.metadata.labels}' | jq

Look for pod-security.kubernetes.io/enforce: restricted. If absent, the stock files admit fine and you can skip this section.

1. Dockerfile โ€” drop root. Append before ENTRYPOINT:

# OpenShift-compatible: grant group write via GID 0
RUN chgrp -R 0 /app && chmod -R g=u /app
USER 1001

2. ingestor-job.yaml โ€” add a hardened securityContext. Both pod-level and container-level:

spec:
  template:
    spec:
      securityContext:                    # pod-level
        runAsNonRoot: true
        runAsUser: 1001
        seccompProfile:
          type: RuntimeDefault
      containers:
      - name: api
        # ... existing container spec ...
        securityContext:                  # container-level
          allowPrivilegeEscalation: false
          capabilities:
            drop: ["ALL"]

Subclassing BaseIngestor

For data that doesn't fit any of the existing templates, subclass BaseIngestor:

from tracebloc_ingestor import BaseIngestor, FileTypeValidator

class MyIngestor(BaseIngestor):
    validators = [FileTypeValidator(allowed=[".parquet"])]

    def transform(self, record):
        # your preprocessing
        return record

if __name__ == "__main__":
    MyIngestor().ingest()

Prerequisites

Links

Platform ยท Docs ยท Data preparation guide ยท Discord

Maintainers: see RELEASING.md for the release procedure.

License

Apache 2.0 โ€” see LICENSE.

Questions? support@tracebloc.io or open an issue.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tracebloc_ingestor-0.3.1.tar.gz (105.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

tracebloc_ingestor-0.3.1-py3-none-any.whl (131.2 kB view details)

Uploaded Python 3

File details

Details for the file tracebloc_ingestor-0.3.1.tar.gz.

File metadata

  • Download URL: tracebloc_ingestor-0.3.1.tar.gz
  • Upload date:
  • Size: 105.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.15

File hashes

Hashes for tracebloc_ingestor-0.3.1.tar.gz
Algorithm Hash digest
SHA256 c5e6108cf5d1dcb54e987a21305a4d0562144da93a2100122ebf4a67178ffbb9
MD5 1e7ad21bcaf3810fde9629141435fbaf
BLAKE2b-256 6c83a3b7e74ffb9ace8b50a9f834f4f6a12adc5bee1d4d4bd0497e03d2aa6129

See more details on using hashes here.

File details

Details for the file tracebloc_ingestor-0.3.1-py3-none-any.whl.

File metadata

File hashes

Hashes for tracebloc_ingestor-0.3.1-py3-none-any.whl
Algorithm Hash digest
SHA256 af9dea5e758a5ef1a49d8723614e6cdf4f071416fc8285dd80eb7cb1ab636e0d
MD5 71e0e803869c04bed0dc9571f3a17429
BLAKE2b-256 d4c209f5ed01dbf80e3984ab38428f7b2d4c4ab3edd8ab113bf8551b7e22d2d0

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page