Environment diff tool for dbt
Project description
The data validation toolkit
for teams that care about building better data
Introduction
Recce
is a toolkit for performing data validation in the context of data modeling and pull request (PR) review for dbt projects.
Data validations, or 'checks', are a method of impact assessment performed by comparing the data in two dbt environments, such as development and production schemas. Recce
enables the identifying and investigating of data change throughout the pull request process, from data modeling to stakeholder approval.
Who's using Recce
?
Recce
is useful for validating your own work or the work of others, and can also be used to share data impact with non-technical stakeholders to approve data checks.
-
Data engineers can use
Recce
to ensure the structural integrity of the data and understand the scope of impact before merging. -
Analysts can use
Recce
to self-review and understand how data modeling changes have changed the data. -
Stakeholders can use
Recce
to sign-off on data after updates have been made
Why Recce
dbt has brought many software best practices to data projects, such as:
- Version controlled code
- Modular SQL
- Reproducible pipelines
Even so, 'bad merges' still happen, in which updated data models are merged into production without proper checks; resulting in erroneous data, silent errors, or even data downtime. This is due to the unique situation encountered by reviewers of having to review both the code (data models) and the resultant data.
In addition, the dbt culture of 'self-serve analytics' has opened up data pipelines to the wider data team. This means that many roles, all with differing priorities, are modifying the dbt pipeline and making the job of reviewing pull requests even more difficult. Impact assessment and data QA is also made difficult by the growing size and complexity of dbt projects.
The best way to understand the impact of code changes on data is to compare the data before-and-after the changes. To ensure:
- historical data is not impacted (does not change) and
- intentional modifications have the desired impact.
This is where Recce
comes in, by providing a self-serve review environment for the self-serve data platform specifically geared towards surfacing and understanding data impact from code changes.
Features
The core concept of Recce
is comparing data between your pull request and a known-good base, such as production, or ideally a staging environment. Therefore, to suite of tools, or diffs, in Recce
are designed to help you find and record this change.
Lineage Diff
Lineage Diff is the main interface to Recce
is the Lineage Diff chart. It shows which nodes in the lineage have been added, removed, or modified.
Structural Diffs
See if columns have been added or removed, and perform row count diffs for selected models to see if data has been lost.
- Schema Diff - Show the struture of the table including added or removed columns
- Row Count Diff - Compares the row count for tables
Advanced Diffs
Advanced Diffs provide high level statistics about data change:
- Profile Diff: Compares stats such as count, distinct count, min, max, average.
- Value Diff: The matched count and percentage for each column in the table.
- Top-K Diff: Compares the distribution of a categorical column.
- Histogram Diff: Compares the distribution of a numeric column in an overlay histogram chart.
Query Diff
Query Diff compares the results of any ad-hoc query, and supports the use of dbt macros.
Checklist
The checklist provides a way to record the results of your data validation process.
- Save the results of checks
- Re-run checks
- Annotate checks to add context
- Share the results of checks
- (
Recce Cloud
) Sync checks and check results across Recce instances - (
Recce Cloud
) Block PR merging until checks have been approved
Recce
Use cases
Recce is designed for reviewing data change, which can happen at many stages of the PR process:
Development
Start a Recce server locally, during development, to inspect and investigate how your data modeling changes are impacting the data in your development schema compared to prod.
After performing data checks and create a checklist:
- Copy the results of your development-time data validations into the PR comment to show that you have done your due diligence and performed QA on your work.
- Share the Recce state file by exporting and attaching it to your PR comment.
PR Review
The PR reviewer can:
- review the checks manually added by the PR author
- download the Recce state file that has been attached to the PR by either CI automation, or the PR author.
After downloading the state file, Recce can be run in Review Mode, which does not require the dbt project to be present.
Continuous Integration (CI)
Get full coverage of your pipeline by setting up preset checks that run on all
Recce Cloud
users can take advantage of our GitHub app that will enable the blocking of merging and alerting to unapproved checks.
Automated Impact Summary
Recce
can output an Impact Summmary which can be posted as a PR comment as part of your CI automation. If data change from any of your checks is detected then the Recce
Summary will show which ones. Run Recce
in review mode to investigate the impact.
Other uses
Recce
is also useful for troubleshooting root causes by using Query Diff to compare ad-hoc queries and comparing data at the row level.
Demo
We provide three online demo instances of Recce, each with a related pull request. Use these demo instances to inspect the impact of the data modeling changes on the project lineage and data.
For each demo, review the following:
- The pull request comment
- The code changes
- How the lineage and data has changed in
Recce
This will enable you to validate if the intention of the PR has been successfully implemented and no unintended impact has occurred.
[!TIP] Don't forget to click the
Checks
tab to view the data validation checklist, or perform your own checks for further investigation.
Demo 1: Calculation logic change
This pull request adjusts the logic for how customer lifetime value is calculated:
Demo 2: Refactoring
This pull request performs some refactoring on the customers model by turning two CTEs into intermediate models, enhancing readability and maintainability:
Demo 3: Analysis
This pull request introduces a new Rounding Effect Analysis feature, aimed at analyzing and reporting the impacts of rounding in our data processing.
How to use Recce
The following is a basic example of installing and using Recce
. For a full introduction to using Recce
please visit the documentation.
Recce
requires at least two output schemas in your dbt profiles.yml
, for example:
jaffle_shop:
target: dev
outputs:
dev:
type: duckdb
path: jaffle_shop.duckdb
schema: dev
prod:
type: duckdb
path: jaffle_shop.duckdb
schema: main
Install Recce with Pip:
pip install recce
Generate the dbt artifacts for your base
environment. This is the stable environment that your PR data will be compared against.
git checkout main
dbt run --target prod
dbt docs generate --target prod --target-path target-base/
[!NOTE]
You must build these artifacts into a directory namedtarget-base
Generate the artifacts for your target
(PR) environment:
git checkout feature/my-awesome-feature
dbt run
dbt docs generate
Start the Recce
server:
recce server
This will launch a Recce Instance for you to start your data validation work. Take a look at the Hands-On Impact Analysis blog that shows how to use various Recce features to validate a PR.
Data Security
Recce
consists of a local server application that you run on your own device or compute services.
- Diffs or queries that are performed by
Recce
happen either in your data warehouse, or in the browser itself. Recce
does not store or transmit your data.
For Recce Cloud
users:
- An encrypted version of your
Recce
state file is storedonRecce Cloud
. This file is encrypted before transmission.
Recce Cloud
Recce Cloud
provides a backbone of supporting services that make Recce usage more suitable for teams reviewing multiple pull requests.
With Recce Cloud
:
Recce
environments can be instantiated directly from a PR- data checks are automatically synced across
Recce
instances - unapproved checks can block merging of pull requests
Recce Cloud is currently in early-access private beta.
To find out how you can get access please book an appointment for a short meeting.
Documentation
The Recce Documentation covers everything you need to get started from a 5-minute tutorial using Jaffle Shop, to advice on best practices for your dbt environments.
Community & Support
Here's where you can get in touch with the Recce
team and find support:
- dbt Slack in the #tools-recce channel
- Recce Discord
- Email us product@datarecce.io
If you believe you have found a bug, or there is some missing functionality in Recce, please open a GitHub Issue.
Recce on the web
You can follow along with news about Recce
and blogs from our team in the following places:
- Medium blog
- @datarecce on X
- @DataRecce@mastodon.social on Mastodon
- @datarecce.bsky.social on BlueSky
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for recce_nightly-0.39.0.20241029.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7aab08792a40730b5dc72853a7879a5dab9d6e37512b179d0fb2d3f9db34b9be |
|
MD5 | 7171ffbac1e036a49f1ef5c9011d4942 |
|
BLAKE2b-256 | b8b833d40fefafce064beeee5d6acb0424d713ef2f3b11b1a72ff4406d5d047c |
Hashes for recce_nightly-0.39.0.20241029-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 93fbcc8c750460ef7456d8fb95b0e09dc9ef52ba11ddcb06f3217e2467a866bf |
|
MD5 | f79d3e2c5f55a38686e638a40a644c2a |
|
BLAKE2b-256 | 5298ae12727a2c777b6983a7eeee70c43ae1674f4c737fb3c2d3678b5edd497f |