A linting tool for xarray datasets
Project description
xinter
A comprehensive linting and data quality checking tool for xarray datasets.
Overview
xinter provides automated data quality checks for xarray datasets, helping you identify issues like missing values, outliers, incorrect units, and other data anomalies. It features an extensible architecture that allows you to easily add custom checkers for your specific data validation needs.
Features
-
25 Built-in Checkers: Comprehensive checks for data quality including:
- Missing values (NaNs)
- Statistical properties (mean, std, skewness, kurtosis)
- Outlier detection (IQR method)
- Data type validation
- Units verification and parsing
- Coordinate uniformity checks
- Shape and size validation
- And many more...
-
Extensible Architecture: Easily add custom checkers using a simple decorator pattern
-
Rich CLI Output: Beautiful terminal output with tables showing results
-
DataFrame Export: Convert results to pandas DataFrames for further analysis
-
Coordinate Checking: Optionally check coordinate arrays in addition to data variables
-
Group Support: Handle datasets with hierarchical groups (e.g., Zarr, NetCDF4)
Installation
pip install xinter
Or install from source:
git clone https://github.com/samueljackson92/xinter.git
cd xinter
pip install -e .
Quick Start
Command Line Interface
Lint a single file:
xl mydata.zarr
Lint multiple files:
xl file1.nc file2.zarr file3.nc
Check coordinates in addition to data variables:
xl mydata.zarr --coords
Specify a group within the dataset:
xl mydata.zarr --group=/equilibrium
Python API
from xinter.cli import lint_dataset, reports_to_dataframe
# Lint a dataset
reports = lint_dataset("mydata.zarr", check_coords=True)
# Convert to DataFrame for analysis
df = reports_to_dataframe(reports)
# Filter for failed checks
failures = df[~df["success"]]
print(failures)
# Export to CSV
df.to_csv("lint_report.csv", index=False)
Built-in Checkers
| Checker | Description |
|---|---|
| NaNs | Proportion of NaN values |
| Mean | Mean value |
| Standard deviation | Standard deviation |
| IQR outliers | Proportion of values outside IQR range |
| Range | Range of values (max - min) |
| Max | Maximum value |
| Min | Minimum value |
| Duplicate values | Proportion of duplicate values |
| Negative values | Proportion of negative values |
| Zero values | Proportion of zero values |
| Constant values | Whether all values are constant |
| Infinite values | Proportion of infinite values |
| Skewness | Skewness of the distribution |
| Kurtosis | Kurtosis of the distribution |
| Entropy | Shannon entropy of the distribution |
| Data type | Data type of the variable |
| Units | Units attribute |
| Units parsable | Whether units can be parsed by pint |
| Diff | Mean of first differences |
| Diff constant | Whether differences are constant (coordinates only) |
| Shape | Shape of the variable |
| Size | Total number of elements |
| Variable name | Name of the variable |
| Dimension names | Names of the dimensions |
| Constant along dimension | Whether values are constant along the first dimension |
Creating Custom Checkers
You can easily extend xinter with custom checkers:
from xinter.linters import DataArrayChecker, LinterRegistry, CheckerResult
import xarray as xr
@LinterRegistry.register()
class MyCustomChecker(DataArrayChecker):
"""Check if values are within expected range."""
name = "Value range check"
description = "Checks if values fall within [0, 100]"
def check(self, var: xr.DataArray) -> CheckerResult:
min_val = var.min().item()
max_val = var.max().item()
in_range = 0 <= min_val and max_val <= 100
return CheckerResult(
value=in_range,
message=f"Range: [{min_val}, {max_val}]",
success=in_range,
)
Your custom checker will automatically be included in all linting operations.
Output Format
The reports_to_dataframe() function produces a DataFrame with the following columns:
- file_path: Path to the dataset file
- target_type: Either "data_vars" or "coords"
- variable_name: Name of the variable
- checker_name: Name of the checker
- value: The check result value
- message: Descriptive message about the result
- success: Boolean indicating if the check passed
Web Dashboard (GUI)
xinter includes a modern, interactive web-based dashboard for visualizing linting results. The dashboard provides:
- 📊 Interactive Charts: Explore data quality metrics with beautiful Plotly visualizations
- 🔍 Real-time Filtering: Filter results by file and group
- 📈 Comprehensive Analytics: NaN distribution, data types, statistical distributions, entropy analysis, and more
Installation
Install with GUI support:
pip install -e ".[gui]"
Usage
Launch the dashboard for any linting report:
xl-gui linting_report.parquet
Or with custom options:
xl-gui thomson_scattering.parquet --port 8080 --title "Thomson Scattering Analysis"
The dashboard will open in your browser at http://localhost:8050 (or your specified port).
See GUI_README.md for detailed documentation and examples.
Development
# Clone the repository
git clone https://github.com/yourusername/xinter.git
cd xinter
# Install development dependencies
pip install -e ".[dev]"
# Run tests
pytest
# Format code
ruff format .
# Lint code
ruff check .
License
MIT License - see LICENSE file for details.
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
Authors
- Samuel Jackson (samuel.jackson@ukaea.uk)
Acknowledgments
xinter builds on the excellent work of the xarray, pandas, and pint communities.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file xinter-0.1.0.tar.gz.
File metadata
- Download URL: xinter-0.1.0.tar.gz
- Upload date:
- Size: 25.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
add8f8c3bff83d8d4890ff9a20a21a0b9f67cdb9cbf85ffd332f6ff302f2bc11
|
|
| MD5 |
0a374ebcba31e9c9d5d4168054427c03
|
|
| BLAKE2b-256 |
72552be3e1dc45d03175e46d895b7c212b8bc9b70c3fdc86085c00ebd4bd419a
|
Provenance
The following attestation bundles were made for xinter-0.1.0.tar.gz:
Publisher:
publish.yml on samueljackson92/xinter
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
xinter-0.1.0.tar.gz -
Subject digest:
add8f8c3bff83d8d4890ff9a20a21a0b9f67cdb9cbf85ffd332f6ff302f2bc11 - Sigstore transparency entry: 1383126324
- Sigstore integration time:
-
Permalink:
samueljackson92/xinter@a69a25b8a98a35d4dad9cad2730addbf0d633a43 -
Branch / Tag:
refs/heads/main - Owner: https://github.com/samueljackson92
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@a69a25b8a98a35d4dad9cad2730addbf0d633a43 -
Trigger Event:
workflow_dispatch
-
Statement type:
File details
Details for the file xinter-0.1.0-py3-none-any.whl.
File metadata
- Download URL: xinter-0.1.0-py3-none-any.whl
- Upload date:
- Size: 24.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
581d7a71a1ca2feca21cbc5e713aec6e41eae3f595a39d011fc0d41d3ff47cb7
|
|
| MD5 |
412d2316e37ec1f6e1e1217c91995e99
|
|
| BLAKE2b-256 |
aea8745a9941bc34d8279c62d4ff57ef1c9b1a43c8a716c3b2138e63f7effb9d
|
Provenance
The following attestation bundles were made for xinter-0.1.0-py3-none-any.whl:
Publisher:
publish.yml on samueljackson92/xinter
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
xinter-0.1.0-py3-none-any.whl -
Subject digest:
581d7a71a1ca2feca21cbc5e713aec6e41eae3f595a39d011fc0d41d3ff47cb7 - Sigstore transparency entry: 1383126333
- Sigstore integration time:
-
Permalink:
samueljackson92/xinter@a69a25b8a98a35d4dad9cad2730addbf0d633a43 -
Branch / Tag:
refs/heads/main - Owner: https://github.com/samueljackson92
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@a69a25b8a98a35d4dad9cad2730addbf0d633a43 -
Trigger Event:
workflow_dispatch
-
Statement type: