Blazing fast data quality framework for Python, built on Apache DataFusion
Project description
qualink
Blazing fast data quality framework for Python, built on Apache DataFusion.
Features
- High Performance: Leverages Apache DataFusion for fast data processing and validation.
- Flexible Constraints: Supports various data quality constraints including completeness, uniqueness, and custom assertions.
- YAML Configuration: Define validation suites declaratively using YAML files.
- Multiple Output Formats: Results can be formatted as human-readable text, JSON, or Markdown.
- Async Support: Built with asyncio for non-blocking operations.
- Easy Integration: Simple API for defining and running validation suites.
Installation
Install qualink using pip:
pip install qualink
Or using uv:
uv add qualink
Quick Start
Here's a basic example of using dq-tool to validate a CSV file:
import asyncio
from datafusion import SessionContext
from qualink.checks import Check, Level
from qualink.constraints import Assertion
from qualink.core import ValidationSuite
from qualink.formatters import MarkdownFormatter
async def main() -> None:
ctx = SessionContext()
ctx.register_csv("users", "examples/users.csv")
result = await (
ValidationSuite()
.on_data(ctx, "users")
.with_name("User Data Quality")
.add_check(Check.builder("Critical Checks").with_level(Level.ERROR).is_complete("user_id").build())
.add_check(
Check.builder("Data Quality")
.with_level(Level.WARNING)
.has_completeness("name", Assertion.greater_than_or_equal(0.95))
.build()
)
.run()
)
print(MarkdownFormatter().format(result))
if __name__ == "__main__":
asyncio.run(main())
YAML Configuration
You can also define validation suites using YAML files for a declarative approach:
suite:
name: "User Data Quality"
data_source:
type: csv
path: "examples/users.csv"
table_name: users
checks:
- name: "Critical Checks"
level: error
rules:
- is_complete: user_id
- is_unique: email
- has_size:
gt: 0
- name: "Data Quality"
level: warning
rules:
- has_completeness:
column: name
gte: 0.95
Run the YAML configuration:
import asyncio
from qualink.config import run_yaml
from qualink.formatters import HumanFormatter
async def main() -> None:
result = await run_yaml("path/to/your/config.yaml")
print(HumanFormatter().format(result))
if __name__ == "__main__":
asyncio.run(main())
Constraints
dq-tool supports the following constraint types:
- Completeness: Ensures a column has no null values or meets a minimum completeness ratio.
- Uniqueness: Checks for duplicate values in a column.
- Assertion: Custom assertions using SQL expressions.
Formatters
Results can be formatted using:
HumanFormatter: Human-readable text output.JsonFormatter: JSON format for programmatic processing.MarkdownFormatter: Markdown tables for documentation.
Development
To set up the development environment:
git clone https://github.com/gopidesupavan/qualink.git
cd dq-tool
uv sync
Run tests:
uv run pytest
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
License
This project is licensed under the Apache License 2.0 - see the LICENSE file for details.
Acknowledgments
- Apache DataFusion for the query engine
- AWS Deequ for the inspiration
- Term Guard
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file qualink-0.0.1.tar.gz.
File metadata
- Download URL: qualink-0.0.1.tar.gz
- Upload date:
- Size: 30.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e5dacb78337d2a58c129654ff11c4e9a1785d996c407178ddfa9f3cf8f1cb213
|
|
| MD5 |
08984ca9c404f902713c2d5cf1f4bcad
|
|
| BLAKE2b-256 |
4c5a172dbf57bfab8ed7386acb13f10030c50fe02d094dbb9d1555390b4bc6d1
|
Provenance
The following attestation bundles were made for qualink-0.0.1.tar.gz:
Publisher:
release.yml on gopidesupavan/qualink
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
qualink-0.0.1.tar.gz -
Subject digest:
e5dacb78337d2a58c129654ff11c4e9a1785d996c407178ddfa9f3cf8f1cb213 - Sigstore transparency entry: 1005584580
- Sigstore integration time:
-
Permalink:
gopidesupavan/qualink@336076f73add03b723524f95f06ed11f89138716 -
Branch / Tag:
refs/heads/main - Owner: https://github.com/gopidesupavan
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@336076f73add03b723524f95f06ed11f89138716 -
Trigger Event:
workflow_dispatch
-
Statement type:
File details
Details for the file qualink-0.0.1-py3-none-any.whl.
File metadata
- Download URL: qualink-0.0.1-py3-none-any.whl
- Upload date:
- Size: 52.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
69d7f5aa6e2251887ecebb7860b3d8c58a97379cfe8ec53fcfe913c3f26e0c84
|
|
| MD5 |
1a8f1879325a49e6311db26e100e4db6
|
|
| BLAKE2b-256 |
ed07b06a02714857dc75b45ce345efe7fea0a9b18006340d768a8877947ea0d4
|
Provenance
The following attestation bundles were made for qualink-0.0.1-py3-none-any.whl:
Publisher:
release.yml on gopidesupavan/qualink
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
qualink-0.0.1-py3-none-any.whl -
Subject digest:
69d7f5aa6e2251887ecebb7860b3d8c58a97379cfe8ec53fcfe913c3f26e0c84 - Sigstore transparency entry: 1005584598
- Sigstore integration time:
-
Permalink:
gopidesupavan/qualink@336076f73add03b723524f95f06ed11f89138716 -
Branch / Tag:
refs/heads/main - Owner: https://github.com/gopidesupavan
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@336076f73add03b723524f95f06ed11f89138716 -
Trigger Event:
workflow_dispatch
-
Statement type: