Skip to main content

MATILDA - TGD Rule Discovery Algorithm CLI

Project description

MATILDA: Tuple-Generating Dependencies Discovery

PyPI version Python 3.10+ License: Apache 2.0 Tests Coverage DOI Code style: black Modern CLI

MATILDA (Mining Approximate Tuple-Generating Dependencies in Large Databases) is a research-grade command-line tool for automatically discovering Tuple-Generating Dependencies (TGDs) in relational databases with a modern, beautiful CLI interface.

โœจ New: Modern CLI Interface

MATILDA now features a gorgeous, colorful interface powered by Rich:

  • ๐ŸŽจ Beautiful ASCII art banner with MATILDA branding
  • ๐Ÿ“Š Live progress bars with spinners and time tracking
  • ๐Ÿ“‹ Elegant tables for displaying discovered rules
  • ๐ŸŽฏ Color-coded panels for configuration and results
  • โœจ Status indicators with emojis for better UX

Try it: python demo_cli.py for a quick preview!

See CLI_MODERNIZATION.md for full details.

๐ŸŽ“ Academic Context

This tool was developed as part of post-doctoral research in database theory and knowledge discovery. It implements novel algorithms for mining logical rules from relational data, with applications in:

  • Data Quality: Detecting inconsistencies and integrity violations
  • Schema Design: Understanding implicit constraints and relationships
  • Knowledge Extraction: Discovering hidden patterns in enterprise databases
  • Database Reverse Engineering: Reconstructing business logic from legacy systems

โœจ Key Features

  • ๐Ÿ” Automated TGD Discovery: Extracts logical rules without manual specification
  • ๐Ÿ“Š Confidence Metrics: Ranks discovered rules by support and accuracy
  • ๐Ÿ—„๏ธ Multi-Database Support: Works with SQLite, MySQL, PostgreSQL via SQLAlchemy
  • โšก Efficient Pruning: Uses constraint graphs and heuristics to scale to large schemas
  • ๐Ÿ“ˆ MLflow Integration: Optional experiment tracking for research workflows
  • ๐Ÿ›ก๏ธ Resource Management: Built-in memory limits and timeout controls
  • ๐Ÿ“ Comprehensive Reporting: Generates JSON and Markdown outputs

๐Ÿ“ฆ Installation

Quick Install

pip install -e .

From Source

git clone https://github.com/Fran-cois/MATILDA.git
cd matilda_cli
pip install -e .

With MLflow Support

pip install -e ".[mlflow]"

Development Installation

pip install -e ".[dev]"

๐ŸŽฏ Quick Start

Demo Mode (Quickest Way to Try MATILDA)

# Run with pre-configured university demo database
matilda --demo imperfect_database

This will:

  • โœ… Auto-create a demo SQLite database with 50 students, 81 enrollments
  • ๐Ÿ” Run TGD discovery on realistic data with built-in violations
  • ๐Ÿ“Š Generate reports in results/

See DEMO_GUIDE.md for details.

Basic Usage

# Run with default config.yaml
matilda

# Run with specific config
matilda --config /path/to/config.yaml

Configuration

Create a config.yaml file:

monitor:
  memory_threshold: 16106127360  # 15GB
  timeout: 3600  # 1 hour

database:
  path: "data/db/"
  name: "database.db"

logging:
  log_dir: "logs"

results:
  output_dir: "results"

mlflow:
  use: false

๐Ÿ“– Usage Examples

Quick Demo with Fake Database

Try MATILDA instantly with our pre-built university database:

# Generate the demo database
python scripts/generate_fake_university_db.py

# Run the beautiful CLI demo
python ai_doc/demo_cli.py

# Inspect the database
python scripts/inspect_university_db.py

# Run MATILDA on the demo database
python -m matilda_cli --database data/university.db

The fake database contains:

  • 5 departments (CS, Math, Physics, Business, Engineering)
  • 10 professors
  • 15 students
  • 13 courses
  • 42 enrollments + teaching assignments + advisor relationships
  • 113 total tuples across 7 tables

See data/README.md for more details.

Example 1: Basic Discovery

# Run with default configuration
matilda --config config.yaml

Sample Output:

INFO - Discovered 47 rules
INFO - Top rule: student(X, Y) โˆง enrollment(Y, Z) โ†’ course(Z, W) [support: 156, confidence: 0.94]

Example 2: Custom Parameters

# config.yaml
monitor:
  memory_threshold: 16106127360  # 15GB
  timeout: 3600  # 1 hour

database:
  path: "data/databases/"
  name: "university.db"

algorithm:
  nb_occurrence: 5      # Minimum support
  max_table: 4          # Max tables per rule
  max_vars: 8           # Max variables per rule

logging:
  log_dir: "logs"

results:
  output_dir: "results"

Example 3: With MLflow Experiment Tracking

# Install with MLflow support
pip install -e ".[mlflow]"

# Configure MLflow in config.yaml
cat > config.yaml << EOF
mlflow:
  use: true
  tracking_uri: "http://localhost:5000"
  experiment_name: "MATILDA_University_DB"
  
database:
  path: "data/"
  name: "university.db"
EOF

# Start MLflow server (separate terminal)
mlflow server --host 127.0.0.1 --port 5000

# Run MATILDA with tracking
matilda --config config.yaml

Example 4: Programmatic Usage

from pathlib import Path
from matilda_cli.database.alchemy_utility import AlchemyUtility
from matilda_cli.algorithms.matilda import MATILDA

# Connect to database
db_path = Path("data/university.db")
db = AlchemyUtility(db_path)

# Configure algorithm
settings = {
    "nb_occurrence": 3,
    "max_table": 3,
    "max_vars": 6
}

# Discover rules
matilda = MATILDA(db, settings)
for rule in matilda.discover_rules():
    print(f"{rule} [support: {rule.support}, confidence: {rule.confidence:.2f}]")

# Cleanup
db.close()

๐Ÿ”ง Configuration Options

Complete Configuration Reference

# Monitor Settings
monitor:
  memory_threshold: 16106127360  # Maximum memory usage (bytes, default: 15GB)
  timeout: 3600                   # Maximum execution time (seconds, default: 1h)

# Database Settings  
database:
  path: "data/databases/"         # Directory containing database files
  name: "database.db"             # Database filename

# Algorithm Settings
algorithm:
  nb_occurrence: 3                # Minimum rule support (default: 3)
  max_table: 3                    # Maximum tables per rule (default: 3)
  max_vars: 6                     # Maximum variables per rule (default: 6)

# Logging Settings
logging:
  log_dir: "logs"                 # Directory for log files
  level: "INFO"                   # Log level: DEBUG, INFO, WARNING, ERROR

# Results Settings
results:
  output_dir: "results"           # Directory for output files
  
# MLflow Settings (optional)
mlflow:
  use: false                      # Enable/disable MLflow tracking
  tracking_uri: "http://localhost:5000"  # MLflow server URL
  experiment_name: "MATILDA Discovery"   # Experiment name

Algorithm Parameters Tuning Guide

Parameter Effect Recommendation
nb_occurrence Higher = more general rules, fewer results Start with 3-5 for exploration
max_table Higher = more complex rules, slower execution 3 for most cases, 4-5 for complex schemas
max_vars Higher = more expressive rules, larger search space 6 is balanced, increase to 8-10 for research

๐Ÿ“Š Output Format

MATILDA generates three types of output:

1. JSON Rules File

{
  "rules": [
    {
      "body": ["student(X, Y)", "enrollment(Y, Z)"],
      "head": ["course(Z, W)"],
      "support": 156,
      "confidence": 0.94,
      "accuracy": 0.94,
      "tgd_string": "student(X, Y) โˆง enrollment(Y, Z) โ†’ course(Z, W)"
    }
  ],
  "metadata": {
    "database": "university.db",
    "total_rules": 47,
    "execution_time": "12.3s"
  }
}

2. Markdown Report

Auto-generated summary with:

  • Execution statistics
  • Top 5 rules by confidence
  • Database schema overview
  • Timestamps and configuration

3. Detailed Logs

2025-11-18 10:15:23 - INFO - Starting MATILDA discovery
2025-11-18 10:15:24 - INFO - Loaded database: university.db (3 tables, 12 attributes)
2025-11-18 10:15:25 - INFO - Generated 45 candidate rules
2025-11-18 10:15:26 - INFO - Pruned 12 rules (support threshold)
2025-11-18 10:15:27 - INFO - Discovered 47 valid TGDs

๐Ÿ› ๏ธ Development

Project Architecture

matilda_cli/
โ”œโ”€โ”€ matilda_cli/              # Main package
โ”‚   โ”œโ”€โ”€ __init__.py
โ”‚   โ”œโ”€โ”€ __main__.py           # CLI entry point
โ”‚   โ”œโ”€โ”€ algorithms/           # MATILDA algorithm implementation
โ”‚   โ”‚   โ”œโ”€โ”€ matilda.py        # Main algorithm class
โ”‚   โ”‚   โ””โ”€โ”€ MATILDA/          # Core discovery logic
โ”‚   โ”‚       โ”œโ”€โ”€ tgd_discovery.py       # TGD mining algorithms
โ”‚   โ”‚       โ”œโ”€โ”€ constraint_graph.py    # Graph-based representation
โ”‚   โ”‚       โ””โ”€โ”€ candidate_rule_chains.py  # Rule generation
โ”‚   โ”œโ”€โ”€ database/             # Database utilities
โ”‚   โ”‚   โ”œโ”€โ”€ alchemy_utility.py         # SQLAlchemy interface
โ”‚   โ”‚   โ”œโ”€โ”€ query_utility.py           # Query optimization
โ”‚   โ”‚   โ””โ”€โ”€ data_exporter.py           # Export utilities
โ”‚   โ””โ”€โ”€ utils/                # Shared utilities
โ”‚       โ”œโ”€โ”€ config_loader.py           # YAML configuration
โ”‚       โ”œโ”€โ”€ rules.py                   # Rule data structures
โ”‚       โ””โ”€โ”€ monitor.py                 # Resource monitoring
โ”œโ”€โ”€ tests/                    # Comprehensive test suite
โ”‚   โ”œโ”€โ”€ algorithms/           # Algorithm tests (61 tests passing)
โ”‚   โ”œโ”€โ”€ database/             # Database utilities tests
โ”‚   โ”œโ”€โ”€ fixtures/             # Test databases and fixtures
โ”‚   โ””โ”€โ”€ benchmarks/           # Performance benchmarks
โ”œโ”€โ”€ scripts/                  # Development utilities
โ”‚   โ”œโ”€โ”€ check_tests.py        # Test integrity verification
โ”‚   โ”œโ”€โ”€ run_benchmarks.py     # Performance benchmarking
โ”‚   โ””โ”€โ”€ generate_test_report.py  # Test coverage reports
โ”œโ”€โ”€ docs/                     # Documentation (auto-generated)
โ”œโ”€โ”€ setup.py                  # Package configuration
โ”œโ”€โ”€ pyproject.toml            # Modern Python packaging
โ””โ”€โ”€ pytest.ini                # Test configuration

Running Tests

# Run all tests
pytest tests/

# Run with coverage
pytest --cov=matilda_cli tests/

# Run specific test categories
pytest tests/algorithms/         # Algorithm tests
pytest tests/database/           # Database tests
pytest -m benchmark tests/       # Performance benchmarks only

# Run tests with detailed output
pytest -v --tb=short tests/

Test Suite Status

  • Total Tests: 61 โœ…
  • Coverage: 45%
  • Test Categories:
    • Unit Tests: 35 tests
    • Integration Tests: 20 tests
    • Benchmarks: 6 tests
  • Databases: 3 test schemas (university, employee, retail)

Code Quality Tools

# Type checking
mypy matilda_cli/

# Code formatting
black matilda_cli/ tests/

# Linting
ruff check matilda_cli/

# Run all checks
./scripts/check_tests.py

๐Ÿ“‹ Requirements

  • Python: 3.10 or higher (required for type hints and modern language features)
  • Core Dependencies:
    • SQLAlchemy >= 2.0
    • PyYAML >= 6.0
    • psutil >= 5.9
    • tqdm >= 4.66
    • colorama >= 0.4.6
  • Optional:
    • mlflow >= 2.0 (for experiment tracking)
    • pytest >= 7.0 (for development)

๐Ÿ“š Documentation

๐Ÿค Contributing

Contributions are welcome! This is an academic research project, so please:

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Add tests for new functionality
  4. Ensure all tests pass (pytest tests/)
  5. Document your changes
  6. Commit with clear messages (git commit -m 'Add amazing feature')
  7. Push to your branch (git push origin feature/amazing-feature)
  8. Open a Pull Request

Development Setup

# Clone repository
git clone https://github.com/Fran-cois/MATILDA.git
cd matilda_cli

# Create virtual environment
python -m venv .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate

# Install in development mode with all extras
pip install -e ".[dev,mlflow]"

# Verify installation
matilda --help
pytest tests/

Related Datasets

Test databases are based on schemas from:

๐Ÿ™ Acknowledgments

  • Database Theory Community: For foundational work on TGDs and dependencies
  • SQLAlchemy Project: For the excellent database abstraction layer
  • Open Source Contributors: For tools and libraries that made this possible
  • Academic Reviewers: For feedback on algorithm design and implementation

๐Ÿ—บ๏ธ Roadmap

Current Version (0.1.0)

  • โœ… Core TGD discovery algorithm
  • โœ… SQLite, MySQL, PostgreSQL support
  • โœ… MLflow integration
  • โœ… Comprehensive test suite

Planned Features (0.2.0)

  • ๐Ÿ”„ Parallel rule discovery
  • ๐Ÿ”„ Interactive rule refinement
  • ๐Ÿ”„ Web-based visualization
  • ๐Ÿ”„ Incremental discovery for large databases

Future Research Directions

  • ๐Ÿ”ฎ Probabilistic TGDs
  • ๐Ÿ”ฎ Temporal dependency discovery
  • ๐Ÿ”ฎ Multi-database federation
  • ๐Ÿ”ฎ Machine learning-guided pruning

๐Ÿ† Academic Impact

This tool is designed to support:

  • PhD Students: As a baseline for dependency discovery research
  • Database Researchers: For experimental evaluation and comparison
  • Industry Practitioners: For data quality assessment and schema understanding
  • Educators: As a teaching tool for database theory concepts

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

matilda_cli-0.1.0.tar.gz (63.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

matilda_cli-0.1.0-py3-none-any.whl (65.3 kB view details)

Uploaded Python 3

File details

Details for the file matilda_cli-0.1.0.tar.gz.

File metadata

  • Download URL: matilda_cli-0.1.0.tar.gz
  • Upload date:
  • Size: 63.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for matilda_cli-0.1.0.tar.gz
Algorithm Hash digest
SHA256 b988fe55b22e48727c5940b20850af099c067bc0b7c3c7447dbe02f439dee30e
MD5 5b13019566c55b47d7d9da48e083660a
BLAKE2b-256 01fd65aab7025b347578e6047311b57f7dcfe90ae8744c2275cdbe7f6682a9fb

See more details on using hashes here.

Provenance

The following attestation bundles were made for matilda_cli-0.1.0.tar.gz:

Publisher: publish.yml on Fran-cois/matilda-cli

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file matilda_cli-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: matilda_cli-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 65.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for matilda_cli-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 ac59938d3e81699a7c6d4abf8ba92a5639b3ae0f4ed64c3d78caf8fd1e4ace1b
MD5 1bf653bcb02adc328f309cf390a0ea9e
BLAKE2b-256 fbb99e768589178b8bf3f9678c351af41a285fb9f09bd9ab8f4ff8be646e71cb

See more details on using hashes here.

Provenance

The following attestation bundles were made for matilda_cli-0.1.0-py3-none-any.whl:

Publisher: publish.yml on Fran-cois/matilda-cli

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page