A conceptual models dataset cleaning and preprocessing library

These details have not been verified by PyPI

Project links

Project description

MCP4CM - Model Cleansing Package for Conceptual Models

A comprehensive library for cleaning and preprocessing conceptual model datasets, with a focus on UML models.

Overview

MCP4CM is a Python package designed to facilitate the cleaning, filtering, and analysis of conceptual model datasets. Currently focused on UML models, the library provides tools for:

Loading and parsing UML model datasets
Cleaning models by filtering out empty or invalid files
Detecting and removing duplicate models
Filtering models based on naming patterns and quality metrics
Language detection for model content
Extracting metadata and statistical information from models

Installation

pip install mcp4cm

Quick Start

from mcp4cm import load_dataset

# Load a UML model dataset
dataset = load_dataset("modelset", path="path/to/modelset", uml_type="genmymodel")

# Filter empty or invalid files
from mcp4cm import uml_filter_empty_or_invalid_files
filtered_dataset = uml_filter_empty_or_invalid_files(dataset)

# Filter models with generic class patterns
from mcp4cm import uml_filter_classes_by_generic_pattern
filtered_dataset = uml_filter_classes_by_generic_pattern(filtered_dataset)

# Get duplicate models based on hash
from mcp4cm.uml.duplicate_detection import detect_duplicates_by_hash
unique_models, duplicate_groups = detect_duplicates_by_hash(filtered_dataset)

Main Components

Base Models

Model: Base class for all model objects with common attributes
Dataset: Container class for collections of models
DatasetType: Enum for different dataset types

UML-Specific Components

UMLModel: Extended model class with UML-specific properties
UMLDataset: Container for UML models with specialized methods

Data Filtering

MCP4CM provides various filtering methods to clean datasets:

Filter empty or invalid files
Filter models without proper names
Filter models with dummy class names
Filter models with generic patterns
Filter by name length or frequency
Filter models with sequential naming patterns

Duplicate Detection

Hash-based duplicate detection
TF-IDF based near-duplicate detection

Language Detection

Detect languages used in model text
Extract non-English models

Examples

Loading and Basic Filtering

from mcp4cm import load_dataset, uml_filter_empty_or_invalid_files, uml_filter_models_without_names

# Load dataset
dataset = load_dataset("modelset", path="path/to/modelset")
print(f"Original dataset size: {len(dataset.models)} models")

# Apply basic filters
filtered_dataset = uml_filter_empty_or_invalid_files(dataset)
filtered_dataset = uml_filter_models_without_names(filtered_dataset)
print(f"Filtered dataset size: {len(filtered_dataset.models)} models")

Analyzing Name Statistics

from mcp4cm.uml.data_extraction import get_word_counts_from_dataset, get_name_length_distribution

# Get word frequency statistics
most_common_names = get_word_counts_from_dataset(dataset, plt_fig=True)

# Get name length distribution
name_lengths = get_name_length_distribution(dataset, plt_fig=True)

Detecting and Removing Duplicates

from mcp4cm.uml.duplicate_detection import detect_duplicates_by_hash, tfidf_near_duplicate_detector

# Hash-based duplicates
unique_models, duplicate_groups = detect_duplicates_by_hash(dataset, inplace=True)

# Near-duplicates using TF-IDF
unique_models, near_duplicate_groups = tfidf_near_duplicate_detector(dataset, threshold=0.85, inplace=True)

Documentation

Each module and function includes detailed documentation and usage examples. For more information on specific functions, please refer to the docstrings in the code.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Authors

Andjela Djelic - andjela.djelic@tuwien.ac.at
Syed Juned Ali - syed.juned.ali@tuwien.ac.at

Citation

If you use MCP4CM in your research, please cite:

@software{mcp4cm2025,
  author = {Djelic, Andjela and Ali, Syed Juned},
  title = {MCP4CM: Model Cleansing Package for Conceptual Models},
  url = {https://github.com/borkdominik/model-cleansing},
  version = {1.0.1},
  year = {2025}
}

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

1.0.4

Jul 23, 2025

1.0.3

Jul 23, 2025

This version

1.0.2

Apr 1, 2025

1.0.1

Apr 1, 2025

1.0.0

Apr 1, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mcp4cm-1.0.2.tar.gz (21.3 kB view details)

Uploaded Apr 1, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

mcp4cm-1.0.2-py3-none-any.whl (22.4 kB view details)

Uploaded Apr 1, 2025 Python 3

File details

Details for the file mcp4cm-1.0.2.tar.gz.

File metadata

Download URL: mcp4cm-1.0.2.tar.gz
Upload date: Apr 1, 2025
Size: 21.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for mcp4cm-1.0.2.tar.gz
Algorithm	Hash digest
SHA256	`8dd52e8e04ea2917e3eefef20786d73a6f55e1c25db79550223dd2a9a84be6ff`
MD5	`38132bc09b887444cdd2741e14b95b46`
BLAKE2b-256	`f144ac2d79613a354ccd25bc6630032e11675406f4c847a74efdda6c98567dc7`

See more details on using hashes here.

File details

Details for the file mcp4cm-1.0.2-py3-none-any.whl.

File metadata

Download URL: mcp4cm-1.0.2-py3-none-any.whl
Upload date: Apr 1, 2025
Size: 22.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for mcp4cm-1.0.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`2206d060e6f11f912b8dde081f97ba31496f71efa3bc4ad0d88a9b83d5b1f86e`
MD5	`61313b2c5b5a146b0156d1521bd950bf`
BLAKE2b-256	`0888f6d47fcd97523e3121e0ccd6ed66a37104d38666a6b1ddb3b9a300189e66`

See more details on using hashes here.

mcp4cm 1.0.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

MCP4CM - Model Cleansing Package for Conceptual Models

Overview

Installation

Quick Start

Main Components

Base Models

UML-Specific Components

Data Filtering

Duplicate Detection

Language Detection

Examples

Loading and Basic Filtering

Analyzing Name Statistics

Detecting and Removing Duplicates

Documentation

License

Authors

Citation

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes