A Streamlit-first Python package for detecting and visualizing data quality issues
Project description
LavenderTown
A Streamlit-first Python package for detecting and visualizing "data ghosts": type inconsistencies, nulls, invalid values, schema drift, and anomalies in tabular datasets.
LavenderTown helps you quickly identify data quality issues in your datasets through an intuitive, interactive Streamlit interface. Perfect for data scientists, analysts, and engineers who need to understand their data quality before diving into analysis.
✨ Features
- 🔍 Zero-config data quality insights - Get started with minimal setup
- 📊 Streamlit-native UI - No HTML embeds, fully integrated with Streamlit
- 🎯 Interactive ghost detection - Drill down into problematic rows
- 🐼 Pandas & Polars support - Works with your existing data pipelines
- 📤 Exportable findings - Download results as JSON or CSV with one click
- 🔄 Dataset Comparison - Detect schema and distribution drift between datasets
- ⚙️ Custom Rules - Create and manage custom data quality rules via UI
- 🚀 High Performance - Optimized for datasets up to millions of rows
📦 Installation
Install LavenderTown using pip:
pip install lavendertown
For Polars support, install with the optional dependency:
pip install lavendertown[polars]
🚀 Quick Start
Basic Usage
import streamlit as st
from lavendertown import Inspector
import pandas as pd
# Load your data
df = pd.read_csv("your_data.csv")
# Create inspector and render
inspector = Inspector(df)
inspector.render() # This must be called within a Streamlit app context
That's it! Save this code in a file (e.g., app.py) and run streamlit run app.py to see the interactive data quality dashboard.
Using Polars
LavenderTown works seamlessly with Polars DataFrames:
import streamlit as st
from lavendertown import Inspector
import polars as pl
# Load your data with Polars
df = pl.read_csv("your_data.csv")
# Create inspector and render (works with Polars too!)
inspector = Inspector(df)
inspector.render() # This must be called within a Streamlit app context
Standalone CSV Upload App
For quick analysis without writing code, use the included Streamlit app:
streamlit run examples/app.py
This opens a web interface where you can:
- Upload CSV files via drag-and-drop or file browser
- Preview your data before analysis
- View interactive data quality insights
- Export findings with download buttons
See the examples directory and examples/README.md for more usage examples and detailed instructions.
📚 Usage Examples
Dataset Comparison (Drift Detection)
Compare datasets to detect schema and distribution changes:
from lavendertown import Inspector
import pandas as pd
baseline_df = pd.read_csv("baseline.csv")
current_df = pd.read_csv("current.csv")
inspector = Inspector(current_df)
drift_findings = inspector.compare_with_baseline(
baseline_df=baseline_df,
comparison_type="full" # or "schema_only", "distribution_only"
)
# Drift findings have ghost_type="drift"
for finding in drift_findings:
if finding.ghost_type == "drift":
print(f"{finding.column}: {finding.description}")
Custom Data Quality Rules
Create custom rules through the Streamlit UI:
- Click "Manage Rules" in the sidebar
- Create rules of different types:
- Range rules: Validate numeric values within min/max bounds
- Regex rules: Pattern matching for string columns
- Enum rules: Allow only specific values in a column
- Rules execute automatically with each analysis
- Export/import rules as JSON for reuse across projects
Programmatic Usage
Use LavenderTown in your Python scripts:
from lavendertown import Inspector, GhostFinding
import pandas as pd
df = pd.read_csv("data.csv")
inspector = Inspector(df)
# Get findings programmatically
findings = inspector.detect()
# Filter by severity
errors = [f for f in findings if f.severity == "error"]
warnings = [f for f in findings if f.severity == "warning"]
# Access finding details
for finding in errors:
print(f"Column: {finding.column}")
print(f"Type: {finding.ghost_type}")
print(f"Description: {finding.description}")
if finding.row_indices:
print(f"Affected rows: {len(finding.row_indices)}")
👻 Ghost Categories
LavenderTown detects four main categories of data quality issues:
- Structural Ghosts - Mixed dtypes, schema drift, unexpected nullability
- Value Ghosts - Out-of-range values, regex violations, enum violations
- Completeness Ghosts - Null density thresholds, conditional nulls
- Statistical Ghosts - Outliers (IQR method), distribution shifts
Each finding includes:
- Ghost type: Category of the issue
- Column: Affected column name
- Severity:
info,warning, orerror - Description: Human-readable explanation
- Row indices: Specific rows affected (when applicable)
- Metadata: Additional diagnostic information
🏗️ Architecture
LavenderTown is built with a plugin-based architecture:
- Inspector: Main orchestrator that coordinates detection and rendering
- Detectors: Stateless, UI-agnostic modules for detecting specific ghost types
NullGhostDetector: Detects excessive null valuesTypeGhostDetector: Identifies type inconsistenciesOutlierGhostDetector: Finds statistical outliers using IQR methodRuleBasedDetector: Executes custom user-defined rules
- UI Components: Streamlit-native visualization components
- Export Layer: JSON and CSV export functionality
🛠️ Development
Installation for Development
git clone https://github.com/eddiethedean/lavendertown.git
cd lavendertown
pip install -e ".[dev]"
Running Tests
pytest tests/
Code Quality
# Format code
ruff format .
# Lint
ruff check .
# Type checking
mypy lavendertown/
📊 Performance
LavenderTown is optimized for performance:
- Small datasets (<10k rows): Near-instantaneous analysis
- Medium datasets (10k-100k rows): Sub-second analysis
- Large datasets (100k-1M rows): Optimized with caching and vectorized operations
Benchmark results and optimization recommendations are documented in docs/PERFORMANCE.md.
🤝 Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add some amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
📄 License
This project is licensed under the MIT License - see the LICENSE file for details.
🙏 Acknowledgments
- Built with Streamlit for the UI
- Powered by Pandas and Polars for data processing
- Visualizations created with Altair
🔗 Links
- Homepage: https://github.com/eddiethedean/lavendertown
- Repository: https://github.com/eddiethedean/lavendertown
- Issues: https://github.com/eddiethedean/lavendertown/issues
Made with ❤️ for the data quality community
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file lavendertown-0.1.0.tar.gz.
File metadata
- Download URL: lavendertown-0.1.0.tar.gz
- Upload date:
- Size: 47.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.18
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3f98e3329ac8caaaf09cc9d52fd380632c5f679e5defc7fc7f8107a6b38fc99c
|
|
| MD5 |
e91f2ba80e8574999525883a3c98b87a
|
|
| BLAKE2b-256 |
0c259c64f1f2e2150097389da5cb711df8c177b1e05c8b9e1801cf2b5158e679
|
File details
Details for the file lavendertown-0.1.0-py3-none-any.whl.
File metadata
- Download URL: lavendertown-0.1.0-py3-none-any.whl
- Upload date:
- Size: 37.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.18
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ed21a6801369e43140cf9428462008d6e89312db07b53816d88123dc89ed8891
|
|
| MD5 |
bcc1f70075a3eb7dca4913ab8422ebc8
|
|
| BLAKE2b-256 |
785789871ac47e200cbe3ed3952bacd9879cf506fca9b416c4f0dda8c0313f74
|