A conceptual models dataset cleaning and preprocessing library
Project description
MCP4CM - Model Cleansing Package for Conceptual Models
A comprehensive library for cleaning and preprocessing conceptual model datasets, with a focus on UML models.
Overview
MCP4CM is a Python package designed to facilitate the cleaning, filtering, and analysis of conceptual model datasets. Currently focused on UML models, the library provides tools for:
- Loading and parsing UML model datasets
- Cleaning models by filtering out empty or invalid files
- Detecting and removing duplicate models
- Filtering models based on naming patterns and quality metrics
- Language detection for model content
- Extracting metadata and statistical information from models
Installation
pip install mcp4cm
Quick Start
from mcp4cm import load_dataset
# Load a UML model dataset
dataset = load_dataset("modelset", path="path/to/modelset", uml_type="genmymodel")
# Filter empty or invalid files
from mcp4cm import uml_filter_empty_or_invalid_files
filtered_dataset = uml_filter_empty_or_invalid_files(dataset)
# Filter models with generic class patterns
from mcp4cm import uml_filter_classes_by_generic_pattern
filtered_dataset = uml_filter_classes_by_generic_pattern(filtered_dataset)
# Get duplicate models based on hash
from mcp4cm.uml.duplicate_detection import detect_duplicates_by_hash
unique_models, duplicate_groups = detect_duplicates_by_hash(filtered_dataset)
Main Components
Base Models
Model: Base class for all model objects with common attributesDataset: Container class for collections of modelsDatasetType: Enum for different dataset types
UML-Specific Components
UMLModel: Extended model class with UML-specific propertiesUMLDataset: Container for UML models with specialized methods
Data Filtering
MCP4CM provides various filtering methods to clean datasets:
- Filter empty or invalid files
- Filter models without proper names
- Filter models with dummy class names
- Filter models with generic patterns
- Filter by name length or frequency
- Filter models with sequential naming patterns
Duplicate Detection
- Hash-based duplicate detection
- TF-IDF based near-duplicate detection
Language Detection
- Detect languages used in model text
- Extract non-English models
Examples
Loading and Basic Filtering
from mcp4cm import load_dataset, uml_filter_empty_or_invalid_files, uml_filter_models_without_names
# Load dataset
dataset = load_dataset("modelset", path="path/to/modelset")
print(f"Original dataset size: {len(dataset.models)} models")
# Apply basic filters
filtered_dataset = uml_filter_empty_or_invalid_files(dataset)
filtered_dataset = uml_filter_models_without_names(filtered_dataset)
print(f"Filtered dataset size: {len(filtered_dataset.models)} models")
Analyzing Name Statistics
from mcp4cm.uml.data_extraction import get_word_counts_from_dataset, get_name_length_distribution
# Get word frequency statistics
most_common_names = get_word_counts_from_dataset(dataset, plt_fig=True)
# Get name length distribution
name_lengths = get_name_length_distribution(dataset, plt_fig=True)
Detecting and Removing Duplicates
from mcp4cm.uml.duplicate_detection import detect_duplicates_by_hash, tfidf_near_duplicate_detector
# Hash-based duplicates
unique_models, duplicate_groups = detect_duplicates_by_hash(dataset, inplace=True)
# Near-duplicates using TF-IDF
unique_models, near_duplicate_groups = tfidf_near_duplicate_detector(dataset, threshold=0.85, inplace=True)
Documentation
Each module and function includes detailed documentation and usage examples. For more information on specific functions, please refer to the docstrings in the code.
License
This project is licensed under the MIT License - see the LICENSE file for details.
Authors
- Andjela Djelic - andjela.djelic@tuwien.ac.at
- Syed Juned Ali - syed.juned.ali@tuwien.ac.at
Citation
If you use MCP4CM in your research, please cite:
@software{mcp4cm2025,
author = {Djelic, Andjela and Ali, Syed Juned},
title = {MCP4CM: Model Cleansing Package for Conceptual Models},
url = {https://github.com/borkdominik/model-cleansing},
version = {1.0.1},
year = {2025}
}
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file mcp4cm-1.0.2.tar.gz.
File metadata
- Download URL: mcp4cm-1.0.2.tar.gz
- Upload date:
- Size: 21.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8dd52e8e04ea2917e3eefef20786d73a6f55e1c25db79550223dd2a9a84be6ff
|
|
| MD5 |
38132bc09b887444cdd2741e14b95b46
|
|
| BLAKE2b-256 |
f144ac2d79613a354ccd25bc6630032e11675406f4c847a74efdda6c98567dc7
|
File details
Details for the file mcp4cm-1.0.2-py3-none-any.whl.
File metadata
- Download URL: mcp4cm-1.0.2-py3-none-any.whl
- Upload date:
- Size: 22.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2206d060e6f11f912b8dde081f97ba31496f71efa3bc4ad0d88a9b83d5b1f86e
|
|
| MD5 |
61313b2c5b5a146b0156d1521bd950bf
|
|
| BLAKE2b-256 |
0888f6d47fcd97523e3121e0ccd6ed66a37104d38666a6b1ddb3b9a300189e66
|