Skip to main content

A conceptual models dataset cleaning and preprocessing library

Project description

MCP4CM - Model Cleansing Pipeline for Conceptual Models

Overview

mcp4cm is a Python library dedicated to cleaning and processing conceptual modeling datasets. It specifically supports UML and ArchiMate datasets, providing a streamlined workflow for dataset loading, filtering, data extraction, and deduplication.

Key Features

  • Dataset Loading: Supports UML (MODELSET) and ArchiMate (EAMODELSET) datasets.
  • Data Filtering: Provides comprehensive filters to remove invalid or irrelevant data.
  • Data Extraction: Enables detailed analysis of dataset contents, including naming conventions and class structures.
  • Deduplication: Offers both exact and near-duplicate detection techniques using hashing and TF-IDF-based approaches.

Usage

To use the mcp4cm library, follow these steps:

Create a virtual environment

virutalenv .venv
source .venv/bin/activate

Install the required packages

After creating the virtual environment, you need to install the required packages. You can do this by running the following command:

pip install -r requirements.txt

Downloading the data

To use the library, you need to download the datasets. The datasets are not included in the repository due to their size. You can download the zip file of the datasets them from the following drive link:

  • MCP4CM Dataset: MCP4CM Datasets

  • Unzip the data folder in the root directory of the repository.

unzip data.zip

The structure should look like this:

mcp4cm/
data - Contains the datasets used in the library.
|--modelset/ - Contains the UML dataset.
|--eamodelset/ - Contains the ArchiMate dataset.
dataset_generation.ipynb - Contains the notebook for generating datasets used in the reproducibility studies and using the mcp4cm python library.
test_mcp4cm.ipynb - Contains the notebook for testing the library functionalities.
README.md
requirements.txt
LICENSE


## Generating dataset for the reproducibility studies given here - 
Publicly archived on Zenodo [10.5281/zenodo.16285770] (https://zenodo.org/records/16285770)

The dataset_generation.ipynb notebook provides a way to generate the datasets used in the reproducibility studies. 
The notebook provides a step-by-step filtering to generate the datasets for modelset dataset to be used for the reproducibility study.
Given that the library was developed after the dataset was generated, to provide evidence for reusability of the library, the jupyter notebook provides model filtering code snippets that show how using the library gives the same results as the 'without-library-code' dataset generation code snippets.

In the dataset_generation.ipynb notebook, you will find cells that validate the consistency of the results obtained using the library with those obtained without it. These cells are marked with comments indicating the consistency checks.

## Testing the Library
You can test the library in the jupyter notebook - test_mcp4cm.ipynb. This notebook contains examples of how to use the library for dataset loading, filtering, data extraction, and deduplication.

## Usage

### Dataset Loading

```python
from mcp4cm.dataloading import load_dataset
from mcp4cm.base import DatasetType

uml_dataset = load_dataset(DatasetType.MODELSET, 'data/modelset')
archimate_dataset = load_dataset(DatasetType.EAMODELSET, 'data/eamodelset')

Filtering and Data Extraction

UML Dataset

from mcp4cm.uml.data_extraction import (
    filter_empty_or_invalid_files,
    filter_models_without_names,
    filter_models_by_name_count,
    filter_models_with_empty_class_names,
    find_files_with_comments,
    extract_names_counts_from_dataset,
    get_word_counts_from_dataset,
    get_name_length_distribution,
    filter_models_by_name_length_or_stopwords,
    filter_dummy_names,
    filter_dummy_classes,
    filter_classes_by_generic_pattern,
    filter_models_by_sequential_and_dummy_words
)

filter_empty_or_invalid_files(uml_dataset)
filter_models_without_names(uml_dataset)
filter_models_by_name_count(uml_dataset)
filter_models_with_empty_class_names(uml_dataset)
find_files_with_comments(uml_dataset)
extract_names_counts_from_dataset(uml_dataset, plt_figs=True)
get_word_counts_from_dataset(uml_dataset, plt_fig=True, topk=20)
get_name_length_distribution(uml_dataset, plt_fig=True)
filter_models_by_name_length_or_stopwords(uml_dataset)
filter_dummy_names(uml_dataset)
filter_dummy_classes(uml_dataset)
filter_classes_by_generic_pattern(uml_dataset)
filter_models_by_sequential_and_dummy_words(uml_dataset)

ArchiMate Dataset

from mcp4cm.archimate.data_extraction import (
    extract_names_counts_from_dataset,
    get_word_counts_from_dataset,
    get_name_length_distribution,
    filter_models_by_name_length_or_stopwords,
    filter_dummy_names
)

extract_names_counts_from_dataset(archimate_dataset, plt_figs=True)
get_word_counts_from_dataset(archimate_dataset, plt_fig=True, topk=20)
get_name_length_distribution(archimate_dataset, plt_fig=True)
filter_models_by_name_length_or_stopwords(archimate_dataset)
filter_dummy_names(archimate_dataset)

Deduplication

from mcp4cm.generic.duplicate_detection import (
    detect_duplicates_by_hash,
    tfidf_near_duplicate_detector
)

detect_duplicates_by_hash(uml_dataset, plt_fig=True)

# TF-IDF-based near duplicate detection
tfidf_near_duplicate_detector(uml_dataset, key='names', plt_fig=True)
tfidf_near_duplicate_detector(archimate_dataset, key='names', plt_fig=True)
tfidf_near_duplicate_detector(archimate_dataset, key='names_with_layers_and_types', plt_fig=True)

Visualization

The library includes built-in visualization options (plt_fig=True) for quick insights into dataset characteristics.

Contributing

Contributions are welcome. Please fork the repository, create a feature branch, and submit a pull request.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mcp4cm-1.0.4.tar.gz (28.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

mcp4cm-1.0.4-py3-none-any.whl (31.2 kB view details)

Uploaded Python 3

File details

Details for the file mcp4cm-1.0.4.tar.gz.

File metadata

  • Download URL: mcp4cm-1.0.4.tar.gz
  • Upload date:
  • Size: 28.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.5

File hashes

Hashes for mcp4cm-1.0.4.tar.gz
Algorithm Hash digest
SHA256 aa7d2e7c5cb4f9cd5691ac792ccb3041fa02d1853ffc72fb5a4b7f9fcea82505
MD5 576106e8591d4425e838766b96d50300
BLAKE2b-256 82332b35406ed47428d5a6bc4c800433c758880b4bc31de917c30ec397bf7397

See more details on using hashes here.

File details

Details for the file mcp4cm-1.0.4-py3-none-any.whl.

File metadata

  • Download URL: mcp4cm-1.0.4-py3-none-any.whl
  • Upload date:
  • Size: 31.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.5

File hashes

Hashes for mcp4cm-1.0.4-py3-none-any.whl
Algorithm Hash digest
SHA256 a99e8eb8ab81730262c1610e21d1cc883167c88e73835cd2ab35bd71ab88328f
MD5 b9a9b8baabdaf5b0ad02e2570ed07524
BLAKE2b-256 7e4eb2a36d3b2b40e57d24676b8d9a9e2b018af0f856a49ae97e4e908f89fb14

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page