re_data - data quality framework
Project description
re_data
re_data is tool for improving data quality in your organization, build on top of dbt.
- create data quality project for your organization to monitor and improve quality of your data,
- compute data quality metrics for all your tables and add your own code computing those,
- look for anomalies in all your metrics and investigate problematic data
Key features
Data quality metrics
re_data creates data quality metrics schema in your data warehouse containg metrics for all your tables (or only those you would like to monitor) Metrics schema contains information about:
- time since last records were added
- number of records added
- number of missing values in columns over time
- min/max/avg of values in all your columns
- string lengths in all your columns
Think about it as a INFORMATION_SCHEMA
on steroids :muscle:
And this is just a start and in your project you can compute many other data quality metrics specific to your organization.
Detecting anomalies
re_data looks at metrics gathered and alerts if those are suspicious comparing to data saw in the past. This means situations like those:
- sudden drops or increases in the volume of new records added to your tables
- longer than expected break between data arrivals
- increase in NULL values in one of your columns
- different maximal/minimal/avg numbers in any of table columns
Will be detected. All data including anomalies is saved directly into your data warehouse so you can easily integrate any existing alerting with it.
Data testing
re_data supports writing data tests by adding dbt_expectations
library (and some our test macros) to dbt project created. We recommend using it, to test both:
- tables you are monitoring
- metrics about your data created by re_data
Getting started
Follow our getting started toy shop tutorial! here 🎈🚙 🦄
Docs
More details on tables created by re_data through dbt package are on project github https://github.com/re-data/dbt-re-data and docs for this package: here
Community
Join Slack for questions about using re_data and discussion with people making it :slightly_smiling_face:
Integrations
We support all main data warehouses supported by dbt. We plan to add support for Spark (now officially supported by dbt). Other DBs may work, after installing dbt extension for them. We currently not test re_data against those, so you you can do it at your own risk.
Integration | Status | |
---|---|---|
BigQuery | Supported | |
PostgreSQL | Supported | |
Redshift | Supported | |
Snowflake | Supported | |
Apache Spark | Planned |
License
re_data is licensed under the MIT license. See the LICENSE file for licensing information.
Contributing
We love all contributions :heart_eyes: bigger and smaller.
Checkout out current list of issues here and see if you like anything from there. Also feel welcome to join our Slack and suggest ideas or setup a live session here.
And if you got this far and like what we are building, support us! Star https://github.com/re-data/re-data on Github :star_struck:
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.