MATILDA - TGD Rule Discovery Algorithm CLI
Project description
MATILDA: Tuple-Generating Dependencies Discovery
MATILDA (Mining Approximate Tuple-Generating Dependencies in Large Databases) is a research-grade command-line tool for automatically discovering Tuple-Generating Dependencies (TGDs) in relational databases with a modern, beautiful CLI interface.
โจ New: Modern CLI Interface
MATILDA now features a gorgeous, colorful interface powered by Rich:
- ๐จ Beautiful ASCII art banner with MATILDA branding
- ๐ Live progress bars with spinners and time tracking
- ๐ Elegant tables for displaying discovered rules
- ๐ฏ Color-coded panels for configuration and results
- โจ Status indicators with emojis for better UX
Try it: python demo_cli.py for a quick preview!
See CLI_MODERNIZATION.md for full details.
๐ Academic Context
This tool was developed as part of post-doctoral research in database theory and knowledge discovery. It implements novel algorithms for mining logical rules from relational data, with applications in:
- Data Quality: Detecting inconsistencies and integrity violations
- Schema Design: Understanding implicit constraints and relationships
- Knowledge Extraction: Discovering hidden patterns in enterprise databases
- Database Reverse Engineering: Reconstructing business logic from legacy systems
โจ Key Features
- ๐ Automated TGD Discovery: Extracts logical rules without manual specification
- ๐ Confidence Metrics: Ranks discovered rules by support and accuracy
- ๐๏ธ Multi-Database Support: Works with SQLite, MySQL, PostgreSQL via SQLAlchemy
- โก Efficient Pruning: Uses constraint graphs and heuristics to scale to large schemas
- ๐ MLflow Integration: Optional experiment tracking for research workflows
- ๐ก๏ธ Resource Management: Built-in memory limits and timeout controls
- ๐ Comprehensive Reporting: Generates JSON and Markdown outputs
๐ฆ Installation
Quick Install
pip install -e .
From Source
git clone https://github.com/Fran-cois/MATILDA.git
cd matilda_cli
pip install -e .
With MLflow Support
pip install -e ".[mlflow]"
Development Installation
pip install -e ".[dev]"
๐ฏ Quick Start
Demo Mode (Quickest Way to Try MATILDA)
# Run with pre-configured university demo database
matilda --demo imperfect_database
This will:
- โ Auto-create a demo SQLite database with 50 students, 81 enrollments
- ๐ Run TGD discovery on realistic data with built-in violations
- ๐ Generate reports in
results/
See DEMO_GUIDE.md for details.
Basic Usage
# Run with default config.yaml
matilda
# Run with specific config
matilda --config /path/to/config.yaml
Configuration
Create a config.yaml file:
monitor:
memory_threshold: 16106127360 # 15GB
timeout: 3600 # 1 hour
database:
path: "data/db/"
name: "database.db"
logging:
log_dir: "logs"
results:
output_dir: "results"
mlflow:
use: false
๐ Usage Examples
Quick Demo with Fake Database
Try MATILDA instantly with our pre-built university database:
# Generate the demo database
python scripts/generate_fake_university_db.py
# Run the beautiful CLI demo
python ai_doc/demo_cli.py
# Inspect the database
python scripts/inspect_university_db.py
# Run MATILDA on the demo database
python -m matilda_cli --database data/university.db
The fake database contains:
- 5 departments (CS, Math, Physics, Business, Engineering)
- 10 professors
- 15 students
- 13 courses
- 42 enrollments + teaching assignments + advisor relationships
- 113 total tuples across 7 tables
See data/README.md for more details.
Example 1: Basic Discovery
# Run with default configuration
matilda --config config.yaml
Sample Output:
INFO - Discovered 47 rules
INFO - Top rule: student(X, Y) โง enrollment(Y, Z) โ course(Z, W) [support: 156, confidence: 0.94]
Example 2: Custom Parameters
# config.yaml
monitor:
memory_threshold: 16106127360 # 15GB
timeout: 3600 # 1 hour
database:
path: "data/databases/"
name: "university.db"
algorithm:
nb_occurrence: 5 # Minimum support
max_table: 4 # Max tables per rule
max_vars: 8 # Max variables per rule
logging:
log_dir: "logs"
results:
output_dir: "results"
Example 3: With MLflow Experiment Tracking
# Install with MLflow support
pip install -e ".[mlflow]"
# Configure MLflow in config.yaml
cat > config.yaml << EOF
mlflow:
use: true
tracking_uri: "http://localhost:5000"
experiment_name: "MATILDA_University_DB"
database:
path: "data/"
name: "university.db"
EOF
# Start MLflow server (separate terminal)
mlflow server --host 127.0.0.1 --port 5000
# Run MATILDA with tracking
matilda --config config.yaml
Example 4: Programmatic Usage
from pathlib import Path
from matilda_cli.database.alchemy_utility import AlchemyUtility
from matilda_cli.algorithms.matilda import MATILDA
# Connect to database
db_path = Path("data/university.db")
db = AlchemyUtility(db_path)
# Configure algorithm
settings = {
"nb_occurrence": 3,
"max_table": 3,
"max_vars": 6
}
# Discover rules
matilda = MATILDA(db, settings)
for rule in matilda.discover_rules():
print(f"{rule} [support: {rule.support}, confidence: {rule.confidence:.2f}]")
# Cleanup
db.close()
๐ง Configuration Options
Complete Configuration Reference
# Monitor Settings
monitor:
memory_threshold: 16106127360 # Maximum memory usage (bytes, default: 15GB)
timeout: 3600 # Maximum execution time (seconds, default: 1h)
# Database Settings
database:
path: "data/databases/" # Directory containing database files
name: "database.db" # Database filename
# Algorithm Settings
algorithm:
nb_occurrence: 3 # Minimum rule support (default: 3)
max_table: 3 # Maximum tables per rule (default: 3)
max_vars: 6 # Maximum variables per rule (default: 6)
# Logging Settings
logging:
log_dir: "logs" # Directory for log files
level: "INFO" # Log level: DEBUG, INFO, WARNING, ERROR
# Results Settings
results:
output_dir: "results" # Directory for output files
# MLflow Settings (optional)
mlflow:
use: false # Enable/disable MLflow tracking
tracking_uri: "http://localhost:5000" # MLflow server URL
experiment_name: "MATILDA Discovery" # Experiment name
Algorithm Parameters Tuning Guide
| Parameter | Effect | Recommendation |
|---|---|---|
nb_occurrence |
Higher = more general rules, fewer results | Start with 3-5 for exploration |
max_table |
Higher = more complex rules, slower execution | 3 for most cases, 4-5 for complex schemas |
max_vars |
Higher = more expressive rules, larger search space | 6 is balanced, increase to 8-10 for research |
๐ Output Format
MATILDA generates three types of output:
1. JSON Rules File
{
"rules": [
{
"body": ["student(X, Y)", "enrollment(Y, Z)"],
"head": ["course(Z, W)"],
"support": 156,
"confidence": 0.94,
"accuracy": 0.94,
"tgd_string": "student(X, Y) โง enrollment(Y, Z) โ course(Z, W)"
}
],
"metadata": {
"database": "university.db",
"total_rules": 47,
"execution_time": "12.3s"
}
}
2. Markdown Report
Auto-generated summary with:
- Execution statistics
- Top 5 rules by confidence
- Database schema overview
- Timestamps and configuration
3. Detailed Logs
2025-11-18 10:15:23 - INFO - Starting MATILDA discovery
2025-11-18 10:15:24 - INFO - Loaded database: university.db (3 tables, 12 attributes)
2025-11-18 10:15:25 - INFO - Generated 45 candidate rules
2025-11-18 10:15:26 - INFO - Pruned 12 rules (support threshold)
2025-11-18 10:15:27 - INFO - Discovered 47 valid TGDs
๐ ๏ธ Development
Project Architecture
matilda_cli/
โโโ matilda_cli/ # Main package
โ โโโ __init__.py
โ โโโ __main__.py # CLI entry point
โ โโโ algorithms/ # MATILDA algorithm implementation
โ โ โโโ matilda.py # Main algorithm class
โ โ โโโ MATILDA/ # Core discovery logic
โ โ โโโ tgd_discovery.py # TGD mining algorithms
โ โ โโโ constraint_graph.py # Graph-based representation
โ โ โโโ candidate_rule_chains.py # Rule generation
โ โโโ database/ # Database utilities
โ โ โโโ alchemy_utility.py # SQLAlchemy interface
โ โ โโโ query_utility.py # Query optimization
โ โ โโโ data_exporter.py # Export utilities
โ โโโ utils/ # Shared utilities
โ โโโ config_loader.py # YAML configuration
โ โโโ rules.py # Rule data structures
โ โโโ monitor.py # Resource monitoring
โโโ tests/ # Comprehensive test suite
โ โโโ algorithms/ # Algorithm tests (61 tests passing)
โ โโโ database/ # Database utilities tests
โ โโโ fixtures/ # Test databases and fixtures
โ โโโ benchmarks/ # Performance benchmarks
โโโ scripts/ # Development utilities
โ โโโ check_tests.py # Test integrity verification
โ โโโ run_benchmarks.py # Performance benchmarking
โ โโโ generate_test_report.py # Test coverage reports
โโโ docs/ # Documentation (auto-generated)
โโโ setup.py # Package configuration
โโโ pyproject.toml # Modern Python packaging
โโโ pytest.ini # Test configuration
Running Tests
# Run all tests
pytest tests/
# Run with coverage
pytest --cov=matilda_cli tests/
# Run specific test categories
pytest tests/algorithms/ # Algorithm tests
pytest tests/database/ # Database tests
pytest -m benchmark tests/ # Performance benchmarks only
# Run tests with detailed output
pytest -v --tb=short tests/
Test Suite Status
- Total Tests: 61 โ
- Coverage: 45%
- Test Categories:
- Unit Tests: 35 tests
- Integration Tests: 20 tests
- Benchmarks: 6 tests
- Databases: 3 test schemas (university, employee, retail)
Code Quality Tools
# Type checking
mypy matilda_cli/
# Code formatting
black matilda_cli/ tests/
# Linting
ruff check matilda_cli/
# Run all checks
./scripts/check_tests.py
๐ Requirements
- Python: 3.10 or higher (required for type hints and modern language features)
- Core Dependencies:
- SQLAlchemy >= 2.0
- PyYAML >= 6.0
- psutil >= 5.9
- tqdm >= 4.66
- colorama >= 0.4.6
- Optional:
- mlflow >= 2.0 (for experiment tracking)
- pytest >= 7.0 (for development)
๐ Documentation
- Quick Start Guide - Get started in 5 minutes
- Testing Guide - Comprehensive testing documentation
- Test Status - Current test suite status
- API Documentation - Full API reference (auto-generated)
๐ค Contributing
Contributions are welcome! This is an academic research project, so please:
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Add tests for new functionality
- Ensure all tests pass (
pytest tests/) - Document your changes
- Commit with clear messages (
git commit -m 'Add amazing feature') - Push to your branch (
git push origin feature/amazing-feature) - Open a Pull Request
Development Setup
# Clone repository
git clone https://github.com/Fran-cois/MATILDA.git
cd matilda_cli
# Create virtual environment
python -m venv .venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate
# Install in development mode with all extras
pip install -e ".[dev,mlflow]"
# Verify installation
matilda --help
pytest tests/
Related Datasets
Test databases are based on schemas from:
- Zenodo Repository: DOI 10.5281/zenodo.17644035
๐ Acknowledgments
- Database Theory Community: For foundational work on TGDs and dependencies
- SQLAlchemy Project: For the excellent database abstraction layer
- Open Source Contributors: For tools and libraries that made this possible
- Academic Reviewers: For feedback on algorithm design and implementation
๐บ๏ธ Roadmap
Current Version (0.1.0)
- โ Core TGD discovery algorithm
- โ SQLite, MySQL, PostgreSQL support
- โ MLflow integration
- โ Comprehensive test suite
Planned Features (0.2.0)
- ๐ Parallel rule discovery
- ๐ Interactive rule refinement
- ๐ Web-based visualization
- ๐ Incremental discovery for large databases
Future Research Directions
- ๐ฎ Probabilistic TGDs
- ๐ฎ Temporal dependency discovery
- ๐ฎ Multi-database federation
- ๐ฎ Machine learning-guided pruning
๐ Academic Impact
This tool is designed to support:
- PhD Students: As a baseline for dependency discovery research
- Database Researchers: For experimental evaluation and comparison
- Industry Practitioners: For data quality assessment and schema understanding
- Educators: As a teaching tool for database theory concepts
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file matilda_cli-0.1.0.tar.gz.
File metadata
- Download URL: matilda_cli-0.1.0.tar.gz
- Upload date:
- Size: 63.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b988fe55b22e48727c5940b20850af099c067bc0b7c3c7447dbe02f439dee30e
|
|
| MD5 |
5b13019566c55b47d7d9da48e083660a
|
|
| BLAKE2b-256 |
01fd65aab7025b347578e6047311b57f7dcfe90ae8744c2275cdbe7f6682a9fb
|
Provenance
The following attestation bundles were made for matilda_cli-0.1.0.tar.gz:
Publisher:
publish.yml on Fran-cois/matilda-cli
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
matilda_cli-0.1.0.tar.gz -
Subject digest:
b988fe55b22e48727c5940b20850af099c067bc0b7c3c7447dbe02f439dee30e - Sigstore transparency entry: 711947917
- Sigstore integration time:
-
Permalink:
Fran-cois/matilda-cli@5b23ae38de8411da762b4eb2550f7cd5923d5613 -
Branch / Tag:
refs/heads/main - Owner: https://github.com/Fran-cois
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@5b23ae38de8411da762b4eb2550f7cd5923d5613 -
Trigger Event:
push
-
Statement type:
File details
Details for the file matilda_cli-0.1.0-py3-none-any.whl.
File metadata
- Download URL: matilda_cli-0.1.0-py3-none-any.whl
- Upload date:
- Size: 65.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ac59938d3e81699a7c6d4abf8ba92a5639b3ae0f4ed64c3d78caf8fd1e4ace1b
|
|
| MD5 |
1bf653bcb02adc328f309cf390a0ea9e
|
|
| BLAKE2b-256 |
fbb99e768589178b8bf3f9678c351af41a285fb9f09bd9ab8f4ff8be646e71cb
|
Provenance
The following attestation bundles were made for matilda_cli-0.1.0-py3-none-any.whl:
Publisher:
publish.yml on Fran-cois/matilda-cli
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
matilda_cli-0.1.0-py3-none-any.whl -
Subject digest:
ac59938d3e81699a7c6d4abf8ba92a5639b3ae0f4ed64c3d78caf8fd1e4ace1b - Sigstore transparency entry: 711947925
- Sigstore integration time:
-
Permalink:
Fran-cois/matilda-cli@5b23ae38de8411da762b4eb2550f7cd5923d5613 -
Branch / Tag:
refs/heads/main - Owner: https://github.com/Fran-cois
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@5b23ae38de8411da762b4eb2550f7cd5923d5613 -
Trigger Event:
push
-
Statement type: