A flexible data ingestion library for various file formats
Project description
Data Ingestors ๐
Move your data into the tracebloc training environment โ validated, clean, and ready for model evaluation. Your raw data never leaves your infrastructure.
How it works
Your raw data
โ
โผ
โโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Data ingestor โโโโโโบโ Your Kubernetes cluster โ
โ โ โ โ
โ Validates โ โ Validated dataset โ
โ Preprocesses โ โ (ready for training) โ
โ Transfers โ โ โ
โโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโ
โ
Metadata only
โ
โผ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ tracebloc web app โ
โ (dataset management UI) โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Only metadata (schema, statistics, structure) syncs to the web app. Raw data stays put.
Supported data types
| Type | Templates |
|---|---|
| Image | image_classification, object_detection |
| Text / NLP | text_classification |
| Tabular | tabular_classification, tabular_regression |
| Time series | time_series_forecasting, time_to_event_prediction |
Each template is a runnable starting point โ copy it, point it at your data, ship it.
Quickstart
1. Install
pip install tracebloc-ingestor
2. Pick a template
cp templates/image_classification/ingestor.py .
Each template builds on the same primitives โ BaseIngestor, CSVIngestor, validators โ and overrides the parts that vary by data type.
3. Deploy as a Kubernetes Job
The ingestor runs inside your cluster, next to a tracebloc client. The provided Dockerfile and ingestor-job.yaml are the canonical pattern:
docker build -t <your-registry>/<image-name>:latest .
docker push <your-registry>/<image-name>:latest
kubectl apply -f ingestor-job.yaml
The Job needs these environment variables (set in ingestor-job.yaml):
| Variable | What it is |
|---|---|
CLIENT_ID, CLIENT_PASSWORD |
Tracebloc client credentials |
CLIENT_PVC |
PVC name shared with the client (must match values.yaml) |
MYSQL_HOST |
Hostname of the client's MySQL service |
SRC_PATH |
Where your raw data is mounted in the ingestor pod |
LABEL_FILE |
Path to labels (e.g. Xy_train.csv) |
TABLE_NAME |
Destination table name in the client database |
TITLE |
(optional) Human-readable dataset name |
LOG_LEVEL |
(optional) INFO, WARNING, ERROR |
Running under Pod Security Standards (restricted)
If the namespace you're deploying into enforces the restricted Pod Security Standard (OpenShift, hardened clusters, many managed-Kubernetes namespaces), the stock Dockerfile and ingestor-job.yaml won't admit. Two changes are needed.
Check first:
kubectl get ns <namespace> -o jsonpath='{.metadata.labels}' | jq
Look for pod-security.kubernetes.io/enforce: restricted. If absent, the stock files admit fine and you can skip this section.
1. Dockerfile โ drop root. Append before ENTRYPOINT:
# OpenShift-compatible: grant group write via GID 0
RUN chgrp -R 0 /app && chmod -R g=u /app
USER 1001
2. ingestor-job.yaml โ add a hardened securityContext. Both pod-level and container-level:
spec:
template:
spec:
securityContext: # pod-level
runAsNonRoot: true
runAsUser: 1001
seccompProfile:
type: RuntimeDefault
containers:
- name: api
# ... existing container spec ...
securityContext: # container-level
allowPrivilegeEscalation: false
capabilities:
drop: ["ALL"]
Writing a custom ingestor
For data that doesn't fit a template, subclass BaseIngestor:
from tracebloc_ingestor import BaseIngestor, FileTypeValidator
class MyIngestor(BaseIngestor):
validators = [FileTypeValidator(allowed=[".parquet"])]
def transform(self, record):
# your preprocessing
return record
if __name__ == "__main__":
MyIngestor().ingest()
The package exports BaseIngestor, CSVIngestor, JSONIngestor, plus validators (FileTypeValidator, ImageResolutionValidator, TableNameValidator) and the Database / APIClient helpers. See examples/ for working scripts.
Prerequisites
- Python 3.8+
- A tracebloc account
- A running tracebloc client on your infrastructure
Links
Platform ยท Docs ยท Data preparation guide ยท Discord
License
Apache 2.0 โ see LICENSE.
Questions? support@tracebloc.io or open an issue.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file tracebloc_ingestor-0.3.0.tar.gz.
File metadata
- Download URL: tracebloc_ingestor-0.3.0.tar.gz
- Upload date:
- Size: 99.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.15
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
bcd78faf97b657a9906e3322f12ffa6ec2f89210cf854c9c23da58d72c967a8f
|
|
| MD5 |
a4f946289240e0338ca7580acc9c5ffe
|
|
| BLAKE2b-256 |
46b9e20fc7b07bdb16962a561b271d07ae705344b134ad9a399f159ace4b3be2
|
File details
Details for the file tracebloc_ingestor-0.3.0-py3-none-any.whl.
File metadata
- Download URL: tracebloc_ingestor-0.3.0-py3-none-any.whl
- Upload date:
- Size: 126.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.15
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
54525faaea5a75e51e5aa42548bcf11c8eb8e87a865861ef44ff0d23d1267002
|
|
| MD5 |
cbc390747724969d4d5927fc21493278
|
|
| BLAKE2b-256 |
1397e11925fb312c0cdd476e9913eb46c2d59f4bb545be758731b40e04434fd1
|