Skip to main content

A conceptual models dataset cleaning and preprocessing library

Project description

MCP4CM - Model Cleansing Package for Conceptual Models

Python 3.8+ License: MIT

A comprehensive library for cleaning and preprocessing conceptual model datasets, with a focus on UML models.

Overview

MCP4CM is a Python package designed to facilitate the cleaning, filtering, and analysis of conceptual model datasets. Currently focused on UML models, the library provides tools for:

  • Loading and parsing UML model datasets
  • Cleaning models by filtering out empty or invalid files
  • Detecting and removing duplicate models
  • Filtering models based on naming patterns and quality metrics
  • Language detection for model content
  • Extracting metadata and statistical information from models

Installation

pip install mcp4cm

Quick Start

from mcp4cm import load_dataset

# Load a UML model dataset
dataset = load_dataset("modelset", path="path/to/modelset", uml_type="genmymodel")

# Filter empty or invalid files
from mcp4cm import uml_filter_empty_or_invalid_files
filtered_dataset = uml_filter_empty_or_invalid_files(dataset)

# Filter models with generic class patterns
from mcp4cm import uml_filter_classes_by_generic_pattern
filtered_dataset = uml_filter_classes_by_generic_pattern(filtered_dataset)

# Get duplicate models based on hash
from mcp4cm.uml.duplicate_detection import detect_duplicates_by_hash
unique_models, duplicate_groups = detect_duplicates_by_hash(filtered_dataset)

Main Components

Base Models

  • Model: Base class for all model objects with common attributes
  • Dataset: Container class for collections of models
  • DatasetType: Enum for different dataset types

UML-Specific Components

  • UMLModel: Extended model class with UML-specific properties
  • UMLDataset: Container for UML models with specialized methods

Data Filtering

MCP4CM provides various filtering methods to clean datasets:

  • Filter empty or invalid files
  • Filter models without proper names
  • Filter models with dummy class names
  • Filter models with generic patterns
  • Filter by name length or frequency
  • Filter models with sequential naming patterns

Duplicate Detection

  • Hash-based duplicate detection
  • TF-IDF based near-duplicate detection

Language Detection

  • Detect languages used in model text
  • Extract non-English models

Examples

Loading and Basic Filtering

from mcp4cm import load_dataset, uml_filter_empty_or_invalid_files, uml_filter_models_without_names

# Load dataset
dataset = load_dataset("modelset", path="path/to/modelset")
print(f"Original dataset size: {len(dataset.models)} models")

# Apply basic filters
filtered_dataset = uml_filter_empty_or_invalid_files(dataset)
filtered_dataset = uml_filter_models_without_names(filtered_dataset)
print(f"Filtered dataset size: {len(filtered_dataset.models)} models")

Analyzing Name Statistics

from mcp4cm.uml.data_extraction import get_word_counts_from_dataset, get_name_length_distribution

# Get word frequency statistics
most_common_names = get_word_counts_from_dataset(dataset, plt_fig=True)

# Get name length distribution
name_lengths = get_name_length_distribution(dataset, plt_fig=True)

Detecting and Removing Duplicates

from mcp4cm.uml.duplicate_detection import detect_duplicates_by_hash, tfidf_near_duplicate_detector

# Hash-based duplicates
unique_models, duplicate_groups = detect_duplicates_by_hash(dataset, inplace=True)

# Near-duplicates using TF-IDF
unique_models, near_duplicate_groups = tfidf_near_duplicate_detector(dataset, threshold=0.85, inplace=True)

Documentation

Each module and function includes detailed documentation and usage examples. For more information on specific functions, please refer to the docstrings in the code.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Authors

Citation

If you use MCP4CM in your research, please cite:

@software{mcp4cm2025,
  author = {Djelic, Andjela and Ali, Syed Juned},
  title = {MCP4CM: Model Cleansing Package for Conceptual Models},
  url = {https://github.com/borkdominik/model-cleansing},
  version = {1.0.1},
  year = {2025}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mcp4cm-1.0.2.tar.gz (21.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

mcp4cm-1.0.2-py3-none-any.whl (22.4 kB view details)

Uploaded Python 3

File details

Details for the file mcp4cm-1.0.2.tar.gz.

File metadata

  • Download URL: mcp4cm-1.0.2.tar.gz
  • Upload date:
  • Size: 21.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for mcp4cm-1.0.2.tar.gz
Algorithm Hash digest
SHA256 8dd52e8e04ea2917e3eefef20786d73a6f55e1c25db79550223dd2a9a84be6ff
MD5 38132bc09b887444cdd2741e14b95b46
BLAKE2b-256 f144ac2d79613a354ccd25bc6630032e11675406f4c847a74efdda6c98567dc7

See more details on using hashes here.

File details

Details for the file mcp4cm-1.0.2-py3-none-any.whl.

File metadata

  • Download URL: mcp4cm-1.0.2-py3-none-any.whl
  • Upload date:
  • Size: 22.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for mcp4cm-1.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 2206d060e6f11f912b8dde081f97ba31496f71efa3bc4ad0d88a9b83d5b1f86e
MD5 61313b2c5b5a146b0156d1521bd950bf
BLAKE2b-256 0888f6d47fcd97523e3121e0ccd6ed66a37104d38666a6b1ddb3b9a300189e66

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page