A lightweight data contracts framework
Project description
Wimsey 🔍
A lightweight and flexible data contract library.
Wimsey is designed a very lightweight data contracts library, simlar to great-expections or soda-core, that is built on top of Narwhals. It is designed to have minimal import times and dependencies.
What is a data contract?
As well as being a good buzzword to mention at your next data event, data contracts are a good way of testing data values at boundary points. Ideally, all data would be usable when you recieve it, but you probably already have figured that's not always the case.
A data contract is an expression of what should be true of some data, such as that it should 'only have columns x and y' or 'the values of column a should never exceed 1'. Wimsey is a library built to run these contracts on a dataframe during python runtime.
Quick Demo
Let's start by taking a look at an example data contract, Wimsey supports reading json or yaml files, or just plain old python dictionaries. Here's an example of a yaml contract:
- column: awesome_column
test: mean_should
be_greater_than: -10
be_less_than: 100
- column: another_great_column
test: null_count_should
be_exactly: 0
- test: row_count_should
be_less_than_or_equal_to: 50000
- column: neato_column
test: type_should
be_one_of:
- int64
- float64
Note you'll need
pyyaml
installed to support reading this, the same data can be stored as json without needing extension if you're trying to keep things lightweight
Here we have two tests, firstly, we're checking that "awesome_column" is between -10 and 100, and then we're checking that "another_great_column" has no null entries.
In terms of using the Wimsey libary, there's essentially only two functions you'll need, validate
and/or test
.
Because Wimsey uses Narwhals under the hood, you can run these tests directly on your dataframe library of choice (pandas, polars, dask etc) as long as it's supported via Narwhals. Here's an example of using "validate" with pandas, which will throw an exception if tests fail, and otherwise pass back your data frame so you can continue happily:
import pandas as pd
import wimsey
df = (
pd.read_csv("hopefully_nice_data.csv")
.pipe(wimsey.validate, "tests.json")
.groupby(["name", "type"]).sum()
)
Similarly, here's an example with polars, but instead using test
, which will return a final_results
object with a success boolean.
import polars as pl
import wimsey
df = pl.read_csv("hopefully_nice_data.csv")
results = wimsey.test(df, "tests.yaml")
if results.success:
print("Yay we have good data! 🥳")
else:
print(f"Oh nooo, something up! 😭")
print(results)
Project Status
Wimsey is veeeery, veeerrrry early, there's a very small amount of supported tests, and even less documentation. Feedback, contributions and requests are all welcome!
Comparison
Tool | Import Time | PyPi Size | Dependencies | Has a GUI Framework |
---|---|---|---|---|
Great Expectations | 2.7 seconds | 5367KB | 25 | Yes |
Soda Core | 0.4 seconds | 145KB | 11 | Yes (non open source) |
Wimsey | 0.02 seconds | 6KB | 2 | No |
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file wimsey-0.2.0.tar.gz
.
File metadata
- Download URL: wimsey-0.2.0.tar.gz
- Upload date:
- Size: 13.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.12.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | f056971c87a4d8c1d4a8c4d2b82cdbe3855975f2aff4cac6084d629210b9a101 |
|
MD5 | 34cba496f5b5f88cfe7a17962a5b2dfe |
|
BLAKE2b-256 | dfac71d6344d591aed69d43ef4a3d9c37f10373e906a789e584d5a49d916bfe3 |
File details
Details for the file wimsey-0.2.0-py3-none-any.whl
.
File metadata
- Download URL: wimsey-0.2.0-py3-none-any.whl
- Upload date:
- Size: 7.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.12.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | f532ba2ac4d00d0d03eedd6ffdff394f92572cfdbecc4341a462baaaaf1e3a85 |
|
MD5 | d9eb8c9dc540e98d150153778ae750d7 |
|
BLAKE2b-256 | 9c65c638de6b8fe57ca1fba15cde4d02cd1e2feced94ad3c502c66c648e46429 |