Code that is executed as part of GitHub Actions pipelines in Dataform-enabled Google Cloud projects to perform unit tests on Dataform models

Project description

Dataform Unit Tests

This Python package is designed to be executed as part of GitHub Actions CI/CD for Dataform-enabled GCP projects to perform unit tests on Dataform models:

Asserts that any defined unit test validates the behaviour of any relevant Dataform models.
Presents any issues with Dataform code identified.
If any issues detected, fails the CI/CD pipeline.

Why unit test your Dataform Model?

Unit tests are a key aspect of any software engineering practice, including data engineering. It validates that your code behaves the way you intend it to, by allowing you to define various use cases your code should theoretically handle and seeing if it actually does.

This allows you to identify bugs quicker, and more importantly, in isolation in a non-production environment.

Dataform Unit Tests lets you do just that, by defining a unit test in JSON and it failing your GitHub Actions CI/CD workflow if any tests fail. Theoretically, it can be used in any CI/CD pipeline as long as you execute it as a module.

This will let you build more robust data models and transformations before releasing any models into production for anyone to consume.

How does it work?

It's designed to be executed as a Python module (i.e. python3 -m dataform-unit-tests) within a step of your GitHub Actions workflow. You set 2 OS environment variables containing your Dataform repo's latest compilation result and the GCP project ID your Dataform repository exists in. It will then discover any unit tests you have written and execute them one by one.

It will report to you in the GitHub Actions if any of your tests fail, and display the results on why it did, or if all your tests pass. If it fails, it will fail the GitHub Actions workflow, else it'll continue on to the next steps (whatever they may be.)

User Guide

Pre-requisites

You need a GCP Service Account to authenticate to GCP with in your GitHub Actions Workflow (not covered by this user guide.)
The same GCP Service Account needs to have BigQuery Data Viewer, BigQuery Job User, BigQuery Resource Viewer, and Dataform Editor IAM roles in the same GCP project as your Dataform repository.
You have experience with the Dataform API and are able to generate a compilation result through it based on your latest Dataform workspace (also not covered by this user guide, please refer to official Google Dataform API documentation.)
You have a GitHub Actions workflow that executes as a on: pull_request event.

Installing Dataform Unit Tests

In your GitHub Actions workflow, you need to have a step to install the Python package, something that looks like this:

- name: Setup Python
  uses: actions/setup-python@v5
  with:
    python-version: '3.11'
    cache: 'pip'
- name: Install Dataform Unit Tests
  run: pip install dataform-unit-tests

Using Dataform Unit Tests in GitHub Actions

After installing the package, you need to have a step to execute the package as a module, which will look something like this:

- name: Execute Dataform Unit Tests
  run: python -m dataform_unit_testing
  env:
    COMPILATION_RESULT: ${{ steps.compile_dataform.outputs.COMPILATION_RESULT }}
    GCP_PROJECT_ID: <your-gcp-project-id>

COMPILATION_RESULT should be set as the compilation result string outputted from an earlier step (please refer to the Pre-requisites section). The format should look something like projects/<your-gcp-project-id>/locations/<your-dataform-gcp-region>/repositories/<your-dataform-repo-name>/compilationResults/4a4e7155-d24c-4643-8a2e-623a78cb2308
GCP_PROJECT_ID should be set to the GCP Project ID your Dataform repository exists in.

The above code sample will execute all unit tests you have written. If one or more unit tests fail, this step will fail and your GitHub Actions workflow will fail. Otherwise, it will pass and the rest of your workflow will continue.

Your GitHub Actions YAML should look something like this:

name: Dataform CICD Checks

on:
  push:
    branches:
      - '*'
      - '!master'
  pull_request:
    branches:
      - '*'
      - '!master'

jobs:
  check-branch:
    runs-on: ubuntu-latest
    environment: preview
    concurrency: preview
    steps:
      - name: Checkout code into workspace directory
        uses: actions/checkout@v4
        with:
          fetch-depth: 10
      - name: Authenticate to Google Cloud
        uses: google-github-actions/auth@v2
        with:
          credentials_json: '${{ secrets.SERVICE_ACCOUNT_KEY }}'
      - name: Set up Cloud SDK
        uses: google-github-actions/setup-gcloud@v2
      - name: Change GCP project
        run: gcloud config set project <your-gcp-project-id>
      - name: Compile Dataform
        id: compile_dataform
        run: <whatever code compiles your Dataform>
      - name: Setup Python
        uses: actions/setup-python@v5
        with:
          python-version: '3.11'
          cache: 'pip'
      - name: Install Dataform Unit Tests
        run: pip install dataform-unit-tests
      - name: Execute Dataform Unit Tests
        run: python -m dataform_unit_testing
        env:
          COMPILATION_RESULT: ${{ steps.compile_dataform.outputs.COMPILATION_RESULT }}
          GCP_PROJECT_ID: <your-gcp-project-id>

How to write unit tests

All written tests are in JSON and have the following structure. The file name of your file should be test_<your_name_of_choice>.json. Any file with a test_ prefix and .json file extension makes your unit test "discoverable" by the Python package and it will be executed by it. You write your tests within the Dataform UI and can place it any directory you like.

{
 "name": "The name of your unit test",
 "model_to_test": "schema.your_dataform_model",
 "input_data": {
   "schema.input_table_1": [
     {
       "col_a": "col_a_value_1",
       "col_b": "col_b_value_1",
       "col_c": "col_c_value_1"
     },
     {
       "col_a": "col_a_value_2",
       "col_b": "col_b_value_2",
       "col_c": "col_c_value_2"
     }
   ],
   "schema.input_table_2": [
     {
       "col_x": "col_x_value_1",
       "col_y": "col_y_value_1",
       "col_z": "col_z_value_1"
     },
     {
       "col_x": "col_x_value_2",
       "col_y": "col_y_value_2",
       "col_z": "col_z_value_2"
     }
   ]
 },
 "expected_output": [
   {
     "col_a": "col_a_value_1",
     "col_b": "col_b_value_1",
     "col_c": "col_c_value_1",
     "col_x": "col_x_value_1",
     "col_y": "col_y_value_1",
     "col_z": "col_z_value_1"
   },
   {
     "col_a": "col_a_value_2",
     "col_b": "col_b_value_2",
     "col_c": "col_c_value_2",
     "col_x": "col_x_value_2",
     "col_y": "col_y_value_2",
     "col_z": "col_z_value_2"
   }
 ]
}

In practice, if you have a Dataform model called staging.customer that looks like this:

config {
  type: "table",
  schema: "staging",
  description: "A table containing customer data"
}

SELECT
  c.customer_id,
  c.customer_first_name,
  c.customer_last_name,
  a.address_1,
  a.address_2,
  a.city,
  a.post_code
FROM ${ref("customer", "customer_personal_details")} AS c
INNER JOIN ${ref("customer", "customer_address")} AS a
  ON a.customer_id = c.customer_id

You could write a unit test to verify the right address is associated to the right customer, called test_staging_customer.json.

{
  "name": "Test Staging Customer",
  "description": "A ",
  "model_to_test": "staging.customer",
  "input_data": {
    "customer.customer_personal_details": [
      {
        "customer_id": "1",
        "customer_first_name": "John",
        "customer_last_name": "Doe"
      },
      {
        "customer_id": "2",
        "customer_first_name": "Jane",
        "customer_last_name": "Doe"      
      }
    ],
    "customer.customer_address": [
      {
        "customer_id": "1",
        "address_1": "123 Fake Street",
        "address_2": "Fake Town",
        "city": "Fake City",
        "post_code": "F4 K3"
      },
      {
        "customer_id": "2",
        "address_1": "456 Another Fake Street",
        "address_2": "Another Fake Town",
        "city": "Another Fake City",
        "post_code": "FF4 KK3"
      }
    ]
  },
  "expected_output": [
    {
      "customer_id": "1",
      "customer_first_name": "John",
      "customer_last_name": "Doe",
      "address_1": "123 Fake Street",
      "address_2": "Fake Town",
      "city": "Fake City",
      "post_code": "F4 K3"
    },
    {
      "customer_id": "2",
      "customer_first_name": "Jane",
      "customer_last_name": "Doe",
      "address_1": "456 Another Fake Street",
      "address_2": "Another Fake Town",
      "city": "Another Fake City",
      "post_code": "FF4 KK3"
    }
  ]
}

The above unit test mocks the two input tables used in staging.customer under the input_data section:

customer.customer_personal_details
customer.customer_address

Each mocked table consists of 2 rows of test data. These mocked tables are used in placed of the real tables your Dataform model uses.

The expected_output section defines what you expect the Dataform model to produce, in this case 2 rows of data producing the columns and values defined.

The unit test framework will then execute the Dataform model with the input tables you have mocked, compare the actual results produced by the model against the expected results you defined.

How do you submit your unit test?

To submit your unit test, you simply create a Pull Request from Dataform and your GitHub Actions Workflow will execute all defined unit tests under the step Execute Dataform Unit Tests (or whatever name you called this step). It will execute each unit test found one by one, if all tests pass, this step will pass, for example:

If one or more tests fail, this step will fail and will fail the workflow. It will also produce messages on which unit tests failed and display to you which results were actually produced versus what results you were expecting, for example:

As you can see, it highlights which rows were expected, and which rows were the actual results. This allows you to compare what was different for your unit test to fail so you can correct your code as needed. In the above example, the value null for forecast_lag_1 was expected, but 27.21 was actually produced, and this caused the unit test to fail.

Where do unit tests live?

Your unit tests can exist anywhere within your Dataform file repository. As long as the file naming convention of test_<name of your choosing>.json is used, the framework will discover the test and execute it.

Testing tables that have columns which are RECORDs (i.e. ARRAY<STRUCT<>>)

You may have source tables that have columns which represents a STRUCT/RECORD and you want to test those columns. This is possible by creating a nested record like below:

{
  "name": "Test Staging Customers",
  "model_to_test": "staging.customers",
  "input_data": {
    "customer.customer_personal_details": [
      {
        "customer_id": "1",
        "customer_first_name": "John",
        "customer_last_name": "Doe",
        "customer_addresses": [
          {
            "address_1": "123 Fake Street",
            "address_2": "Fake Town",
            "city": "Fake City",
            "post_code": "F4 K3"
          },
          {
            "address_1": "456 Another Fake Street",
            "address_2": "Another Fake Town",
            "city": "Another Fake City",
            "post_code": "FF4 KK3"
          }
        ]
      }
    ]
  },
  "expected_output": [
    {
      "customer_id": "1",
      "customer_first_name": "John",
      "customer_last_name": "Doe",
      "customer_latest_address_1": "123 Fake Street",
      "customer_latest_address_2": "Fake Town",
      "customer_latest_city": "Fake City",
      "customer_latest_post_code": "F4 K3"
    }
  ]
}

NULL values

If you want to represent a null, you simply provide null as a value like below:

{
  "name": "Test Staging Customers",
  "model_to_test": "staging.customers",
  "input_data": {
    "customer.customer_personal_details": [
      {
        "customer_id": "1",
        "customer_first_name": "John",
        "customer_last_name": null
      }
    ]
  },
  "expected_output": [
    {
      "customer_id": "1",
      "customer_first_name": "John",
      "customer_last_name": null
    }
  ]
}

Which BigQuery datatypes are handled by Dataform Unit Tests?

The following BigQuery datatypes are handled by this package:

Basic datatypes (e.g. STRING, FLOAT, NUMERIC, INTEGER, BOOL, etc.)
RECORDS or ARRAY<STRUCT<>>
ARRAY - e.g. "array_column": "[1, 2, 3]"
JSON - e.g. "json_column": "{\"key1\": \"value1\"}"
BYTES - e.g. "bytes_column": "q4ah4e9w3/l5WQZ7cjxcJA=="
GEOGRAPHY - e.g. "geography_column": "POINT(-0.349498 51.48198)"

Project details

Release history Release notifications | RSS feed

1.3.0

Mar 10, 2025

1.2.5

Feb 20, 2025

1.2.4

Feb 20, 2025

1.2.3

Feb 19, 2025

1.2.2

Feb 19, 2025

This version

1.2.1

Feb 19, 2025

1.2.0

Feb 19, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dataform_unit_tests-1.2.1.tar.gz (413.6 kB view details)

Uploaded Feb 19, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

dataform_unit_tests-1.2.1-py3-none-any.whl (36.4 kB view details)

Uploaded Feb 19, 2025 Python 3

File details

Details for the file dataform_unit_tests-1.2.1.tar.gz.

File metadata

Download URL: dataform_unit_tests-1.2.1.tar.gz
Upload date: Feb 19, 2025
Size: 413.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.0.1 CPython/3.11.6

File hashes

Hashes for dataform_unit_tests-1.2.1.tar.gz
Algorithm	Hash digest
SHA256	`805057057889c7fbf5da4f9ec1a024f25cc85185e976f817f10548cb26c6d90e`
MD5	`337161ef26cf98bf6e968ebe5c8cbf38`
BLAKE2b-256	`11ad9198062c6a69032b57ebe5b9532bf887f2514c91d0e8a21cfecc9ee45b56`

See more details on using hashes here.

File details

Details for the file dataform_unit_tests-1.2.1-py3-none-any.whl.

File metadata

Download URL: dataform_unit_tests-1.2.1-py3-none-any.whl
Upload date: Feb 19, 2025
Size: 36.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.0.1 CPython/3.11.6

File hashes

Hashes for dataform_unit_tests-1.2.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`7d520352bc0d539157431fc08e8d770783caf24cffee3f0acf3c5ebe23e867a5`
MD5	`b8524c8427504b7adb706f2a8e70e7a4`
BLAKE2b-256	`b946bc9bd110c7643eaa02d423a4d0b50a68555a15fd210c40af890fd5e06184`

See more details on using hashes here.

dataform-unit-tests 1.2.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

Dataform Unit Tests

Why unit test your Dataform Model?

How does it work?

User Guide

Pre-requisites

Installing Dataform Unit Tests

Using Dataform Unit Tests in GitHub Actions

How to write unit tests

How do you submit your unit test?

Where do unit tests live?

Testing tables that have columns which are RECORDs (i.e. ARRAY<STRUCT<>>)

NULL values

Which BigQuery datatypes are handled by Dataform Unit Tests?

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes