ADC toolkit, here to help you.

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

ADC-Jakob ivan-adc noorment

These details have not been verified by PyPI

Project description

ADC Toolkit

A Python framework for validated data pipelines in data science and ML projects.

Python License Coverage Code style: ruff Type checked: mypy

Overview
Key Features
Installation
Quick Start
Initializing the Catalog
Configuration Files
Using the ValidatedDataCatalog
Data Validation
- Choosing a Validator
Processing Pipeline
Logging
Module Architecture
API Reference
CLI Reference
Development
Project Structure
Limitations
Contributing
License
Authors

Overview

adc-toolkit is a Python framework designed for data science and machine learning projects. It provides a structured approach to data handling with built-in validation, ensuring data quality throughout your pipeline.

The toolkit combines the power of Kedro's DataCatalog for data I/O with automatic schema validation using either Great Expectations or Pandera. This means every time you load or save data, it gets validated against your defined schema automatically.

Core value proposition:

Load data with automatic validation
Save data with automatic validation
No manual validation code required
Schema "freezing" on first run prevents silent data drift

Key Features

ValidatedDataCatalog - Combines Kedro DataCatalog (I/O) with data validators for seamless validated data pipelines
Dual Validator Support - Choose between Great Expectations (GX) or Pandera for schema validation
Auto Schema Detection - Automatically generates and enforces schemas on first data load
Processing Pipeline - Chainable, reusable data transformation steps
Flexible Logging - Unified logging interface with Python logging and Loguru backends
Hydra Configuration - YAML-based hierarchical configuration management
Cloud Support - AWS, GCP, and Azure data context support for Great Expectations
Spark Support - PySpark DataFrame validation (optional)
CLI Tools - Scaffold catalog structure with a single command

Installation

Prerequisites

Python 3.10 or higher (< 3.14)
uv (recommended) or pip

Basic Installation

# Using pip
pip install adc-toolkit

# Using uv (recommended)
uv add adc-toolkit

Optional Dependencies

The toolkit uses optional dependency groups to keep the base installation lightweight. Install only what you need:

Group	Description	Install Command
`kedro`	Kedro DataCatalog for data I/O	`uv add adc-toolkit[kedro]`
`gx`	Great Expectations validation	`uv add adc-toolkit[gx]`
`pandera`	Pandera validation	`uv add adc-toolkit[pandera]`
`eda`	Exploratory data analysis tools	`uv add adc-toolkit[eda]`
`spark`	PySpark support	`uv add adc-toolkit[spark]`
`gcp`	Google Cloud Platform integration	`uv add adc-toolkit[gcp]`
`logging`	Loguru logging backend	`uv add adc-toolkit[logging]`
`preprocessing`	scikit-learn transformations	`uv add adc-toolkit[preprocessing]`

Recommended installation for most use cases:

# Install with Kedro and Great Expectations
uv add adc-toolkit[kedro,gx]

# Or with Kedro and Pandera
uv add adc-toolkit[kedro,pandera]

# Install all optional dependencies
uv add adc-toolkit[kedro,gx,pandera,eda,spark,gcp,logging,preprocessing]

Development Installation

git clone https://github.com/ADC-Consulting/adc-toolkit.git
cd adc-toolkit
uv sync --all-extras

Quick Start

Using the CLI

# Initialize the catalog folder structure
adc-toolkit init-catalog ./config

# This creates:
# config/base/globals.yml
# config/base/catalog.yml
# config/local/credentials.yml

Using Python

from adc_toolkit.data.catalog import ValidatedDataCatalog

# Initialize catalog pointing to your config directory
catalog = ValidatedDataCatalog.in_directory("./config")

# Load data with automatic validation
df = catalog.load("my_dataset.raw")

# Save data with automatic validation
catalog.save("my_dataset.processed", df)

Initializing the Catalog

Before using the ValidatedDataCatalog, you need to set up the configuration folder structure.

Required Folder Structure

config/
├── base/
│   ├── globals.yml      # Global variables (bucket paths, dataset types)
│   └── catalog.yml      # Dataset definitions
└── local/
    └── credentials.yml  # Secrets and credentials (gitignored)

Method 1: Using the CLI (Recommended)

# Initialize in current directory
adc-toolkit init-catalog

# Initialize in a specific path
adc-toolkit init-catalog ./config

# Overwrite existing files
adc-toolkit init-catalog ./config --overwrite

# Skip specific files
adc-toolkit init-catalog ./config --no-credentials

Method 2: Using Python

from adc_toolkit.data.catalogs.kedro import KedroDataCatalog

result = KedroDataCatalog.init_catalog(
    "./config",
    overwrite=False,
    include_globals=True,
    include_catalog=True,
    include_credentials=True,
)

print(f"Created {len(result.created_files)} files")
print(f"Skipped {len(result.skipped_files)} files (already exist)")

Generated Files Description

File	Purpose
`base/globals.yml`	Global variables for templating (bucket names, dataset type mappings, folder paths)
`base/catalog.yml`	Kedro dataset definitions with file paths and types
`local/credentials.yml`	Secrets like API keys and database credentials (auto-added to .gitignore)

Configuration Files

globals.yml

Define global variables that can be referenced in catalog.yml:

# Global configuration for the Kedro data catalog
# Variables are referenced using ${globals:variable_name} syntax

bucket_prefix: "gs" # "gs" for GCP, "s3" for AWS, "abfs" for Azure
bucket_name: "your-project-data-bucket"

datasets:
  pandas:
    csv: "pandas.CSVDataset"
    parquet: "pandas.ParquetDataset"
    bq: "pandas.GBQQueryDataset"
  spark:
    parquet: "spark.ParquetDataset"

folders:
  raw: "raw"
  int: "intermediate"
  pri: "primary"
  fea: "feature"

catalog.yml

Define your datasets with their types and file paths:

# Dataset definitions
# See: https://docs.kedro.org/en/stable/data/data_catalog.html

# Load from local CSV file
my_data.extract:
  type: "${globals:datasets.pandas.csv}"
  filepath: data/raw/my_data.csv

# Load from SQL query (BigQuery example)
my_data.from_sql:
  type: "${globals:datasets.pandas.bq}"
  sql: data/queries/my_data.sql

# Save to cloud storage with versioning
my_data.raw:
  type: "${globals:datasets.pandas.parquet}"
  filepath: "${globals:bucket_prefix}://${globals:bucket_name}/${globals:folders.raw}/my_data.parquet"
  versioned: true

my_data.processed:
  type: "${globals:datasets.pandas.parquet}"
  filepath: "${globals:bucket_prefix}://${globals:bucket_name}/${globals:folders.pri}/my_data.parquet"
  versioned: true

# Pattern-based factory (matches any dataset_name.layer combination)
"{dataset_name}.{layer}@pandas":
  type: "${globals:datasets.pandas.parquet}"
  filepath: "${globals:bucket_prefix}://${globals:bucket_name}/{layer}/{dataset_name}.parquet"
  versioned: true

credentials.yml

Store sensitive information (never commit this file):

# Credentials Configuration
# IMPORTANT: This file should NEVER be committed to version control

gcp:
  client_kwargs:
    project: your-gcp-project-id

aws_s3:
  client_kwargs:
    aws_access_key_id: YOUR_ACCESS_KEY
    aws_secret_access_key: YOUR_SECRET_KEY

database:
  con: postgresql://user:password@host:port/database

Using the ValidatedDataCatalog

Basic Usage

from adc_toolkit.data.catalog import ValidatedDataCatalog

# Initialize with default validators (GX if available, then Pandera)
catalog = ValidatedDataCatalog.in_directory("./config")

# Load data - automatically validated after loading
df = catalog.load("house_prices.raw")

# Save data - automatically validated before saving
catalog.save("house_prices.processed", df)

Specifying a Validator

from adc_toolkit.data.catalog import ValidatedDataCatalog
from adc_toolkit.data.validators.gx import GXValidator
from adc_toolkit.data.validators.pandera import PanderaValidator

# Use Great Expectations
catalog = ValidatedDataCatalog.in_directory(
    "./config",
    validator_class=GXValidator
)

# Use Pandera
catalog = ValidatedDataCatalog.in_directory(
    "./config",
    validator_class=PanderaValidator
)

Dynamic Query Parameters

For SQL-based datasets, you can pass parameters at load time:

# If your SQL file contains placeholders like {min_price}
df = catalog.load(
    "house_prices.from_sql",
    min_price=100000,
    max_price=500000
)

How Validation Works

First Load: The validator "freezes" the DataFrame schema (columns and types)
Subsequent Loads/Saves: Data is validated against the frozen schema
Schema Drift Detection: If the data structure changes, validation fails with a clear error

Data Validation

Great Expectations (GX)

Great Expectations is the default validator when installed. It provides powerful data validation with automatic expectation suite generation.

How GX Validation Works

On first validation, GX creates an expectation suite named {dataset_name}_suite
The schema is "frozen" with column and type expectations
Every load/save runs a checkpoint to validate the data

Adding Custom Expectations

from adc_toolkit.data.validators.gx import (
    BatchManager,
    ConfigurationBasedExpectationAddition,
)
from adc_toolkit.data.validators.gx.data_context.repo import RepoDataContext

# Load your data
catalog = ValidatedDataCatalog.in_directory("./config")
df = catalog.load("house_prices.raw")

# Create batch manager for adding expectations
data_context = RepoDataContext("./config").create()
batch_manager = BatchManager("house_prices.raw", df, data_context)

# Define custom expectations
expectations = [
    {"expect_column_values_to_not_be_null": {"column": "SalePrice"}},
    {"expect_column_values_to_be_between": {
        "column": "SalePrice",
        "min_value": 0,
        "max_value": 1000000,
    }},
    {"expect_column_values_to_be_unique": {"column": "Id"}},
]

# Add expectations to the suite
expectation_adder = ConfigurationBasedExpectationAddition()
expectation_adder.add_expectations(batch_manager, expectations)

Pandera

Pandera provides DataFrame validation using Python type annotations and is lighter-weight than GX.

How Pandera Validation Works

On first validation, Pandera auto-generates a schema from the DataFrame
The schema is saved as a Python file in config/pandera_schemas/{dataset_name}.py
Every load/save validates against the stored schema

Pandera Folder Structure

config/
├── base/
│   └── ...
├── local/
│   └── ...
└── pandera_schemas/           # Auto-created by Pandera
    ├── house_prices.raw.py
    ├── house_prices.processed.py
    └── ...

Configuring Pandera

from adc_toolkit.data.catalog import ValidatedDataCatalog
from adc_toolkit.data.catalogs.kedro import KedroDataCatalog
from adc_toolkit.data.validators.pandera import PanderaValidator
from adc_toolkit.data.validators.pandera.parameters import PanderaParameters

# Disable lazy validation
parameters = PanderaParameters(lazy=False)

# Create validator with custom parameters
validator = PanderaValidator.in_directory("./config", parameters=parameters)

# Create catalog with custom validator
catalog = ValidatedDataCatalog(
    catalog=KedroDataCatalog.in_directory("./config"),
    validator=validator,
)

Choosing a Validator

Which validator you choose is largely a matter of preference. Both Great Expectations and Pandera are mature, well-maintained libraries that will serve you well.

If your project already uses one of them, stick with it. There's no need to switch or maintain two validation systems.

If you're starting fresh and have no preference, we recommend Pandera for most use cases:

Easier to use - Pandera schemas are defined in Python code, making them easier to write, read, and maintain
Script-based validation - Schemas are stored as .py files that you can version control, review in PRs, and modify directly
Lighter weight - Pandera has fewer dependencies and a simpler setup

Great Expectations is a better fit if you need:

JSON/YAML-based expectation suites (useful for non-technical stakeholders)
Built-in data documentation and profiling features
Integration with the broader GX ecosystem (Data Docs, checkpoints, etc.)

Processing Pipeline

The processing module provides chainable data transformation steps.

Basic Pipeline Usage

from adc_toolkit.processing.pipeline import ProcessingPipeline
from adc_toolkit.processing.steps.pandas import (
    fill_missing_values,
    remove_duplicates,
    make_columns_snake_case,
    select_columns,
    filter_rows,
)

# Create and chain pipeline steps
pipeline = ProcessingPipeline()
pipeline = (
    pipeline
    .add(remove_duplicates, subset=["id"])
    .add(fill_missing_values, method="median", columns=["price", "area"])
    .add(make_columns_snake_case)
    .add(filter_rows, condition=lambda df: df["price"] > 0)
    .add(select_columns, columns=["id", "price", "area", "bedrooms"])
)

# Run the pipeline
df_processed = pipeline.run(df)

Available Processing Steps

Function	Description	Key Parameters
`remove_duplicates`	Remove duplicate rows	`subset`, `keep`
`fill_missing_values`	Fill NaN values	`method` (mean/median/mode/constant/interpolate), `columns`
`make_columns_snake_case`	Normalize column names to snake_case	-
`filter_rows`	Filter rows by condition	`condition` (callable)
`select_columns`	Select specific columns	`columns` (list)
`scale_data`	Scale numeric columns	`columns`, `method` (minmax/standard)
`encode_categorical`	Encode categorical variables	`columns`, `method` (onehot/label)
`divide_one_column_by_another`	Create ratio column	`numerator`, `denominator`, `new_column_name`
`group_and_aggregate`	Group by and aggregate	`group_by_columns`, `agg_funcs`

Custom Steps

The pipeline accepts any callable that takes a DataFrame as its first argument and returns a DataFrame. This means you can easily create custom steps or use pandas functions directly:

import pandas as pd
from adc_toolkit.processing.pipeline import ProcessingPipeline

# Define a custom processing function
def add_price_category(df: pd.DataFrame, threshold: int = 200000) -> pd.DataFrame:
    df = df.copy()
    df["price_category"] = df["price"].apply(lambda x: "high" if x > threshold else "low")
    return df

# Create pipeline with custom steps and pandas functions
pipeline = ProcessingPipeline()
pipeline = (
    pipeline
    .add(add_price_category, threshold=300000)       # Custom function with kwargs
    .add(lambda df: df.dropna(subset=["price"]))     # Lambda wrapping pandas method
    .add(pd.merge, right=other_df, on="id")          # Pandas function (df is passed as first arg)
    .add(lambda df: df.reset_index(drop=True))       # Lambda for quick transformations
)

df_processed = pipeline.run(df)

Any function with the signature (df: DataFrame, **kwargs) -> DataFrame works as a pipeline step. The DataFrame is always passed as the first positional argument.

Logging

The toolkit provides a unified logging interface with support for multiple backends.

Basic Usage

from adc_toolkit.logger import Logger

# Create logger
logger = Logger()

# Log messages
logger.debug("Debug message")
logger.info("Info message")
logger.warning("Warning message")
logger.error("Error message")

Configuration

from adc_toolkit.logger import Logger

# Set global log level
Logger.set_level("debug")  # or "info", "warning", "error"

# Configure format options
Logger.set_options(
    use_log_filepath=True,
    log_filepath="./logs/app.log"
)

# Reset to defaults
Logger.reset_options()

Backend Selection

The logger automatically selects the best available backend:

Loguru (if installed via adc-toolkit[logging]) - Colorful, feature-rich logging
Python logging (fallback) - Standard library logging

Module Architecture

Module	Description	Key Classes/Functions
`adc_toolkit.data`	Validated data catalog	`ValidatedDataCatalog`
`adc_toolkit.data.catalogs.kedro`	Kedro integration	`KedroDataCatalog`
`adc_toolkit.data.validators.gx`	Great Expectations	`GXValidator`, `BatchManager`
`adc_toolkit.data.validators.pandera`	Pandera validation	`PanderaValidator`
`adc_toolkit.processing`	Data pipelines	`ProcessingPipeline`, `PipelineStep`
`adc_toolkit.logger`	Logging interface	`Logger`
`adc_toolkit.eda`	Exploratory analysis	`TimeSeries` (in development)
`adc_toolkit.configuration`	Hydra configs	Base YAML templates

Architecture Diagram

                    ValidatedDataCatalog
                            |
            +---------------+---------------+
            |                               |
       DataCatalog                    DataValidator
       (Kedro I/O)                   (Validation)
            |                               |
     +------+------+              +---------+---------+
     |             |              |         |         |
   Local        Cloud           GX      Pandera   NoValidator
   Files       Storage
     |             |
   CSV/         GCS/S3/
  Parquet       Azure

API Reference

Full API documentation is available at: adc-toolkit API Reference

CLI Reference

adc-toolkit --version                    # Show version
adc-toolkit --help                       # Show help

adc-toolkit init-catalog [PATH]          # Initialize catalog structure
    --overwrite                          # Overwrite existing files
    --no-globals                         # Skip globals.yml
    --no-catalog                         # Skip catalog.yml
    --no-credentials                     # Skip credentials.yml

Examples

# Initialize in current directory
adc-toolkit init-catalog

# Initialize in specific path
adc-toolkit init-catalog ./my_project/config

# Reinitialize and overwrite existing files
adc-toolkit init-catalog ./config --overwrite

# Initialize without credentials file
adc-toolkit init-catalog ./config --no-credentials

Development

Prerequisites

Python 3.10+
uv package manager
Java 17 (for Spark tests only)

Setup

git clone https://github.com/ADC-Consulting/adc-toolkit.git
cd adc-toolkit
uv sync --all-extras

Running Tests

# Run all tests
make test

# Run tests with coverage report
make coverage

# Run specific test file
uv run pytest src/adc_toolkit/data/tests/test_catalog.py

# Run specific test function
uv run pytest src/adc_toolkit/data/tests/test_catalog.py::test_function_name

# CI/CD command (90% coverage required)
uv run pytest --cov=src/adc_toolkit/ --cov-fail-under=90

Code Quality

# Run all linters and pre-commit hooks
make lint

# Or directly
uv run pre-commit run --all --all-files

Project Structure

adc-toolkit/
├── src/adc_toolkit/
│   ├── __init__.py
│   ├── cli/                    # Command-line interface
│   ├── data/                   # Data catalog and validation
│   │   ├── catalog.py          # ValidatedDataCatalog
│   │   ├── catalogs/kedro/     # Kedro integration
│   │   └── validators/         # GX, Pandera, NoValidator
│   ├── eda/                    # Exploratory data analysis
│   ├── logger/                 # Logging infrastructure
│   ├── processing/             # Data transformation pipelines
│   ├── configuration/          # Hydra config templates
│   └── utils/                  # Utility functions
├── examples/
│   ├── configs/                # Example configurations
│   ├── notebooks/              # Example Jupyter notebooks
│   └── scripts/                # Example scripts
├── pyproject.toml
└── README.md

Limitations

EDA Module: The exploratory data analysis module is partially implemented and shows a FutureWarning on import
Great Expectations Version: GX is constrained to version <1.0.0 due to API changes in GX 1.0
Kedro Required: The kedro optional dependency is required for the ValidatedDataCatalog to function

Contributing

Code Style Requirements

Ruff for linting and formatting (line length: 120)
mypy for type checking (strict mode)
90%+ test coverage enforced in CI/CD
Google-style docstrings required for all public functions

Pre-commit Hooks

The following checks run automatically on commit:

Ruff linting with auto-fix
autoflake (removes unused imports)
mypy type checking
isort import sorting
black-jupyter formatting
nbstripout (removes notebook outputs)

Pull Request Process

Fork the repository
Create a feature branch
Make your changes with tests
Ensure make lint and make test pass
Submit a pull request

License

MIT License - see LICENCE file for details.

Authors

Ivan Konovalov (ivan.k@adc-consulting.com)
Jakob Damen (jakob.d@adc-consulting.com)
Marc Nientker (marc@adc-consulting.com)
Martijn Blom (martijn.b@adc-consulting.com)
Niek van der Laan (niek.vdl@adc-consulting.com)
Nils Jaranovs (nils.j@adc-consulting.com)
Rutger Lit (rutger.l@adc-consulting.com)
Noortje Mentink (noortje.m@adc-consulting.com)
Ron Kremer (ron.k@adc-consulting.com)

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

ADC-Jakob ivan-adc noorment

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

1.1.0

Jan 26, 2026

1.0.2

Jan 12, 2026

1.0.1

Apr 15, 2025

1.0.0

Apr 2, 2025

0.0.1a3 pre-release

Apr 2, 2025

0.0.1a1 pre-release

Mar 27, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

adc_toolkit-1.1.0.tar.gz (585.2 kB view details)

Uploaded Jan 26, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

adc_toolkit-1.1.0-py3-none-any.whl (332.5 kB view details)

Uploaded Jan 26, 2026 Python 3

File details

Details for the file adc_toolkit-1.1.0.tar.gz.

File metadata

Download URL: adc_toolkit-1.1.0.tar.gz
Upload date: Jan 26, 2026
Size: 585.2 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for adc_toolkit-1.1.0.tar.gz
Algorithm	Hash digest
SHA256	`63af5c0d00189ee47a82ccc103d21484603ff7ee89131f65db7182b46e933166`
MD5	`fae72d4b4d457370524640eea7d9e459`
BLAKE2b-256	`94d9aef5a2e71f8a74b5995b5509c677b359231939878504c0299da427213e88`

See more details on using hashes here.

Provenance

The following attestation bundles were made for adc_toolkit-1.1.0.tar.gz:

Publisher: release.yaml on ADC-Consulting/adc-toolkit

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: adc_toolkit-1.1.0.tar.gz
- Subject digest: 63af5c0d00189ee47a82ccc103d21484603ff7ee89131f65db7182b46e933166
- Sigstore transparency entry: 855166448
- Sigstore integration time: Jan 26, 2026
Source repository:
- Permalink: ADC-Consulting/adc-toolkit@abf4fa2433e4a23c17006fd558e2c6559b8adf9c
- Branch / Tag: refs/tags/1.1.0
- Owner: https://github.com/ADC-Consulting
- Access: private
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yaml@abf4fa2433e4a23c17006fd558e2c6559b8adf9c
- Trigger Event: push

File details

Details for the file adc_toolkit-1.1.0-py3-none-any.whl.

File metadata

Download URL: adc_toolkit-1.1.0-py3-none-any.whl
Upload date: Jan 26, 2026
Size: 332.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for adc_toolkit-1.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`df5c03a04da30d43583d0b62f2e544de4134b56281b1e9069e75a336b5e42939`
MD5	`8e7b9aa69743440d35def711a587dbd8`
BLAKE2b-256	`bfa306d92eea19e2c9b4bd4dc058fc5f70fc300af5612f7f4250fde9fe1aaaa9`

See more details on using hashes here.

Provenance

The following attestation bundles were made for adc_toolkit-1.1.0-py3-none-any.whl:

Publisher: release.yaml on ADC-Consulting/adc-toolkit

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: adc_toolkit-1.1.0-py3-none-any.whl
- Subject digest: df5c03a04da30d43583d0b62f2e544de4134b56281b1e9069e75a336b5e42939
- Sigstore transparency entry: 855166450
- Sigstore integration time: Jan 26, 2026
Source repository:
- Permalink: ADC-Consulting/adc-toolkit@abf4fa2433e4a23c17006fd558e2c6559b8adf9c
- Branch / Tag: refs/tags/1.1.0
- Owner: https://github.com/ADC-Consulting
- Access: private
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yaml@abf4fa2433e4a23c17006fd558e2c6559b8adf9c
- Trigger Event: push

adc-toolkit 1.1.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Project description

ADC Toolkit

Table of Contents

Overview

Key Features

Installation

Prerequisites

Basic Installation

Optional Dependencies

Development Installation

Quick Start

Using the CLI

Using Python

Initializing the Catalog

Required Folder Structure

Method 1: Using the CLI (Recommended)

Method 2: Using Python

Generated Files Description

Configuration Files

globals.yml

catalog.yml

credentials.yml

Using the ValidatedDataCatalog

Basic Usage

Specifying a Validator

Dynamic Query Parameters

How Validation Works

Data Validation

Great Expectations (GX)

How GX Validation Works

Adding Custom Expectations

Pandera

How Pandera Validation Works

Pandera Folder Structure

Configuring Pandera

Choosing a Validator

Processing Pipeline

Basic Pipeline Usage

Available Processing Steps

Custom Steps

Logging

Basic Usage

Configuration

Backend Selection

Module Architecture

Architecture Diagram

API Reference

CLI Reference

Examples

Development

Prerequisites

Setup

Running Tests

Code Quality

Project Structure

Limitations

Contributing

Code Style Requirements

Pre-commit Hooks

Pull Request Process

License

Authors

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution