A conceptual models dataset cleaning and preprocessing library

These details have not been verified by PyPI

Project links

Project description

MCP4CM - Model Cleansing Pipeline for Conceptual Models

Overview

mcp4cm is a Python library dedicated to cleaning and processing conceptual modeling datasets. It specifically supports UML and ArchiMate datasets, providing a streamlined workflow for dataset loading, filtering, data extraction, and deduplication.

Key Features

Dataset Loading: Supports UML (MODELSET) and ArchiMate (EAMODELSET) datasets.
Data Filtering: Provides comprehensive filters to remove invalid or irrelevant data.
Data Extraction: Enables detailed analysis of dataset contents, including naming conventions and class structures.
Deduplication: Offers both exact and near-duplicate detection techniques using hashing and TF-IDF-based approaches.

Usage

To use the mcp4cm library, follow these steps:

Create a virtual environment using uv package manager

In order to install uv, you can use the following link - uv install

In case of Linux or MacOS, it is straightforward using the installation page. In case of windows - First Follow instructions for windows on installation page

If you get an error due to execution policy: 2. run powershell as administrator and change policies with this command:

Set-ExecutionPolicy -ExecutionPolicy RemoteSigned -Scope CurrentUser

You should now be able to run the uv command in your terminal.

Once you have uv installed, you can create a virtual environment using the following command:

uv init
uv venv
source .venv/bin/activate

Install the required packages

After creating the virtual environment, you need to install the required packages. You can do this by running the following command:

uv pip install -r requirements.txt

Downloading the data

To use the library, you need to download the datasets. The datasets are not included in the repository due to their size. You can download the zip file of the datasets them from the following drive link:

MCP4CM Dataset: MCP4CM Datasets
Unzip the data folder in the root directory of the repository.

unzip data.zip

The structure should look like this:

mcp4cm/
data/
    modelset/
    eamodelset/
README.md
requirements.txt
LICENSE

## Testing the Library
You can test the library in the jupyter notebook - test_mcp4cm.ipynb. This notebook contains examples of how to use the library for dataset loading, filtering, data extraction, and deduplication.

## Usage

### Dataset Loading

```python
from mcp4cm.dataloading import load_dataset
from mcp4cm.base import DatasetType

uml_dataset = load_dataset(DatasetType.MODELSET, 'data/modelset')
archimate_dataset = load_dataset(DatasetType.EAMODELSET, 'data/eamodelset')

Filtering and Data Extraction

UML Dataset

from mcp4cm.uml.data_extraction import (
    filter_empty_or_invalid_files,
    filter_models_without_names,
    filter_models_by_name_count,
    filter_models_with_empty_class_names,
    find_files_with_comments,
    extract_names_counts_from_dataset,
    get_word_counts_from_dataset,
    get_name_length_distribution,
    filter_models_by_name_length_or_stopwords,
    filter_dummy_names,
    filter_dummy_classes,
    filter_classes_by_generic_pattern,
    filter_models_by_sequential_and_dummy_words
)

filter_empty_or_invalid_files(uml_dataset)
filter_models_without_names(uml_dataset)
filter_models_by_name_count(uml_dataset)
filter_models_with_empty_class_names(uml_dataset)
find_files_with_comments(uml_dataset)
extract_names_counts_from_dataset(uml_dataset, plt_figs=True)
get_word_counts_from_dataset(uml_dataset, plt_fig=True, topk=20)
get_name_length_distribution(uml_dataset, plt_fig=True)
filter_models_by_name_length_or_stopwords(uml_dataset)
filter_dummy_names(uml_dataset)
filter_dummy_classes(uml_dataset)
filter_classes_by_generic_pattern(uml_dataset)
filter_models_by_sequential_and_dummy_words(uml_dataset)

ArchiMate Dataset

from mcp4cm.archimate.data_extraction import (
    extract_names_counts_from_dataset,
    get_word_counts_from_dataset,
    get_name_length_distribution,
    filter_models_by_name_length_or_stopwords,
    filter_dummy_names
)

extract_names_counts_from_dataset(archimate_dataset, plt_figs=True)
get_word_counts_from_dataset(archimate_dataset, plt_fig=True, topk=20)
get_name_length_distribution(archimate_dataset, plt_fig=True)
filter_models_by_name_length_or_stopwords(archimate_dataset)
filter_dummy_names(archimate_dataset)

Deduplication

from mcp4cm.generic.duplicate_detection import (
    detect_duplicates_by_hash,
    tfidf_near_duplicate_detector
)

detect_duplicates_by_hash(uml_dataset, plt_fig=True)

# TF-IDF-based near duplicate detection
tfidf_near_duplicate_detector(uml_dataset, key='names', plt_fig=True)
tfidf_near_duplicate_detector(archimate_dataset, key='names', plt_fig=True)
tfidf_near_duplicate_detector(archimate_dataset, key='names_with_layers_and_types', plt_fig=True)

Visualization

The library includes built-in visualization options (plt_fig=True) for quick insights into dataset characteristics.

Contributing

Contributions are welcome. Please fork the repository, create a feature branch, and submit a pull request.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

1.0.4

Jul 23, 2025

This version

1.0.3

Jul 23, 2025

1.0.2

Apr 1, 2025

1.0.1

Apr 1, 2025

1.0.0

Apr 1, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mcp4cm-1.0.3.tar.gz (28.3 kB view details)

Uploaded Jul 23, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

mcp4cm-1.0.3-py3-none-any.whl (30.9 kB view details)

Uploaded Jul 23, 2025 Python 3

File details

Details for the file mcp4cm-1.0.3.tar.gz.

File metadata

Download URL: mcp4cm-1.0.3.tar.gz
Upload date: Jul 23, 2025
Size: 28.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.5

File hashes

Hashes for mcp4cm-1.0.3.tar.gz
Algorithm	Hash digest
SHA256	`ea4251183aadb6aee795518edccd406f5c6c73a0630b337e7e298f603b3807e9`
MD5	`9245ddca1217fafe2b1ccbb078ed5069`
BLAKE2b-256	`d5bfe7890b6776f56969bdf3029574c9fcd88cdfb6bfdafb26a5eaf8ee3bc088`

See more details on using hashes here.

File details

Details for the file mcp4cm-1.0.3-py3-none-any.whl.

File metadata

Download URL: mcp4cm-1.0.3-py3-none-any.whl
Upload date: Jul 23, 2025
Size: 30.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.5

File hashes

Hashes for mcp4cm-1.0.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`e4e29acb568f25319667749ed0c080c6cfcccf7ead5b380adab637fea4f627a6`
MD5	`1c53d8d7d8989bb67f92ef22f75a8a60`
BLAKE2b-256	`69ab3d0b70f1ec3dd85016e6350267dc49a0b4d6e2ce44c0066c4ffbe42e6ab1`

See more details on using hashes here.

mcp4cm 1.0.3

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

MCP4CM - Model Cleansing Pipeline for Conceptual Models

Overview

Key Features

Usage

Create a virtual environment using uv package manager

Install the required packages

Downloading the data

Filtering and Data Extraction

UML Dataset

ArchiMate Dataset

Deduplication

Visualization

Contributing

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes