Skip to main content

A library for validating and comparing datasets in Spark using PySpark.

Project description

spark-data-test

Overview

spark-data-test provides utilities to compare two Spark DataFrames or datasets, generating detailed reports on matches, mismatches, and missing records. It is designed for data validation, ETL testing, and regression testing in Spark pipelines.

Configuration Dataclasses

Below are the main dataclasses used for configuration in spark-data-test. You can use these directly in Python or as a reference for your YAML/JSON configs.

DatasetParams

Defines parameters for a single dataset comparison.

from dataclasses import dataclass, field

@dataclass
class DatasetParams:
    dataset_name: str                  # Name of the dataset/table
    primary_keys: list                 # List of primary key column names
    select_cols: list = field(default_factory=lambda: ["*"])  # Columns to select (default: all)
    drop_cols: list = field(default_factory=list)             # Columns to drop (default: none)

DataframeConfig

Defines how to read a DataFrame from storage.

from dataclasses import dataclass, field

@dataclass
class DataframeConfig:
    path: str                          # Path to the data (e.g., file or table)
    file_format: str = "parquet"       # File format (parquet, csv, etc.)
    spark_options: dict = field(default_factory=dict)  # Spark read options (e.g., {"header": "true"})

OutputConfig

Defines output options for writing comparison results.

from dataclasses import dataclass, field

@dataclass
class OutputConfig:
    output_dir: str                    # Directory to write output files
    output_file_format: str = "parquet" # Output file format
    spark_options: dict = field(default_factory=dict)  # Spark write options
    no_of_partitions: int = -1         # Number of partitions for output (-1 for default)

DatasetConfig

Groups together the configs for a single dataset comparison.

from dataclasses import dataclass

@dataclass
class DatasetConfig:
    params: DatasetParams              # Dataset parameters
    source_config: DataframeConfig     # Source DataFrame config
    target_config: DataframeConfig     # Target DataFrame config

ComparisonJobConfig

Top-level config for a comparison job (can include multiple datasets).

from dataclasses import dataclass

@dataclass
class ComparisonJobConfig:
    job_name: str                      # Name of the comparison job
    dataset_configs: list[DatasetConfig] # List of dataset configs to compare
    output_config: OutputConfig        # Output config for all results

Usage

1. Compare DataFrames Directly

Use run_comparison_job_from_dfs to compare two Spark DataFrames directly.

Function Signature

run_comparison_job_from_dfs(
    spark: SparkSession,
    job_name: str,
    source_df: DataFrame,
    target_df: DataFrame,
    params: DatasetParams,
    output_config: OutputConfig
)

Parameters

  • spark: The active SparkSession.
  • job_name: Name for the comparison job (used in output paths).
  • source_df: Source DataFrame.
  • target_df: Target DataFrame.
  • params: An instance of DatasetParams specifying dataset name, primary keys, columns to select/drop, etc.
  • output_config: An instance of OutputConfig specifying output directory, file format, Spark write options, etc.

Example

from spark_data_test.jobs.comparison_job import run_comparison_job_from_dfs
from spark_data_test.entities.config import DatasetParams, OutputConfig

params = DatasetParams(
    dataset_name="my_table",
    primary_keys=["id"]
)
output_config = OutputConfig(
    output_dir="/tmp/comparison_results"
)

run_comparison_job_from_dfs(spark, "my_job", df1, df2, params, output_config)

2. Compare Using Config (YAML/JSON/dict)

Use run_comparison_job to compare multiple datasets using a configuration dictionary or object.

Function Signature

run_comparison_job(
    spark: SparkSession,
    config: dict or ComparisonJobConfig
)

Parameters

  • spark: The active SparkSession.
  • config: A dictionary or ComparisonJobConfig instance describing one or more datasets to compare, their source/target configs, and output config.

Example

from spark_data_test.jobs.comparison_job import run_comparison_job

config = {
    "job_name": "multi_dataset_job",
    "dataset_configs": [
        {
            "params": {
                "dataset_name": "table1",
                "primary_keys": ["id"]
            },
            "source_config": {
                "path": "/data/source/table1",
                "file_format": "parquet"
            },
            "target_config": {
                "path": "/data/target/table1",
                "file_format": "parquet"
            }
        }
    ],
    "output_config": {
        "output_dir": "/tmp/comparison_results"
    }
}

run_comparison_job(spark, config)

Output Files

After running a comparison job, the following files/directories are generated under the specified output_dir and job_name:

overall_test_report

Summary DataFrame with row counts, matched counts, duplicate counts, missing rows, and test status for each dataset.

dataset_name count matched_count duplicate_count missing_rows test_status
table1 {"source": 100, "target": 98} 97 {"source": 0, "target": 1} {"source": 1, "target": 3} PASSED

col_lvl_test_report

Column-level report showing the count of unmatched values for each non-key column.

dataset_name column_name unmatched_rows_count
table1 colA 2
table1 colB 0

row_lvl_test_report

Row-level report with primary keys, duplicate count, missing row status, and match status for each row.

dataset_name id duplicate_count missing_row_status all_rows_matched
table1 1 0 PRESENT_IN_BOTH true
table1 2 0 MISSING_AT_TARGET false

unmatched_rows/

Directory containing one file per column with all rows where that column did not match between source and target.

Example for unmatched_rows/colA:

dataset_name id colA_src colA_target
table1 5 foo bar
table1 8 baz qux

All outputs are written in the format specified by output_file_format (default: parquet).


Notes

  • The package requires PySpark and is intended for use in Spark environments.
  • For more details on configuration options, see the entities/config.py dataclasses.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

spark_data_test-0.1.0-py3-none-any.whl (4.4 kB view details)

Uploaded Python 3

File details

Details for the file spark_data_test-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for spark_data_test-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 4ff9f8e72c5a8eda4dee228ddf1bdf6854a1989eb16e47c143eeec1b6f83e8f3
MD5 48abb1526b2ff5182f61e56b0bdc749b
BLAKE2b-256 e835b69a8a7a2d36203f7db6740c8afb3decef30c5f213dccc1c3b7e71a9f12e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page