Implementation of novel metrics for measuring inter-dataset similarity based on PCA.

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Project description

Inter-Dataset Similarity Metric Based on PCA

This document presents the implementation of two novel metrics for measuring inter-dataset similarity based on PCA. These metrics are proposed in our paper, "Metrics for Inter-Dataset Similarity with Example Applications in Synthetic Data and Feature Selection Evaluation". This paper has been accepted to the 2025 SIAM International Conference on Data Mining (SDM).

Jupyter notebooks are provided for the experiments presented in the paper. You can run the code to reproduce the results.

Notebook	Description
Pre-Investigation	Investigation of general properties of the new metrics
Use Case 1	Examples from the paper for evaluation of synthetic tabular data
- Figure 3	Codebook in different repository due to licencing
- Figure 4	Codebook in different repository to avoid duplicate information
Use Case 2	Experiments from the paper on feature selection evaluation

Installation

You can install the package using pip:

pip install pcametric

Usage

Below is an example of how to use the metrics:

from pcametric import PCAMetric
import pandas as pd 

# Loading the datasets
df1 = pd.read_csv('df1.csv')
df2 = pd.read_csv('df2.csv')

# Setting parameters
num_components = 1
normalization = "precise"
preprocess = "std"

# Calculate the values of the two metrics, namely Difference in Explained Variance and Angle Difference
result, _, _ = PCAMetric(df1, df2, num_components, normalization, preprocess)
edv, ad = result['exp_var_diff'], result['comp_angle_diff']

The Average Angle Difference (AAD) metric is also implemented and can be used as a model-agnostic approach for evaluating the performance of feature selection:

from pcametric import AAD
import pandas as pd 

# Loading the dataset
df = pd.read_csv('df.csv')

#Index of selected features
selected_features = [2, 5, 11, 17, 22, 31, 40] 

# Calculate AAD
aad = AAD(df, selected_features)

It is noteworthy that for all the metrics above, a lower value indicates greater similarity to the actual data.

Citation

If you use our metrics in your research, please cite the original paper:

@inproceedings{rajabinasab2025interdatasetsimilarity,
  title={Metrics for Inter-Dataset Similarity with Example Applications in Synthetic Data and Feature Selection Evaluation},
  author={Rajabinasab, Muhammad and Lautrup, Anton D. and Zimek, Arthur},
  booktitle={Proceedings of the 2025 SIAM International Conference on Data Mining (SDM)},
  pages={TBD},
  year={2025},
  organization={SIAM}
}

Project details

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Release history Release notifications | RSS feed

1.2.0

Mar 30, 2026

1.1.0

Mar 27, 2026

1.0.5

Feb 16, 2026

1.0.4

May 5, 2025

1.0.3

Jan 9, 2025

This version

1.0.2

Jan 9, 2025

1.0.1

Dec 28, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pcametric-1.0.2.tar.gz (4.3 kB view details)

Uploaded Jan 9, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pcametric-1.0.2-py3-none-any.whl (4.3 kB view details)

Uploaded Jan 9, 2025 Python 3

File details

Details for the file pcametric-1.0.2.tar.gz.

File metadata

Download URL: pcametric-1.0.2.tar.gz
Upload date: Jan 9, 2025
Size: 4.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.0.1 CPython/3.10.12

File hashes

Hashes for pcametric-1.0.2.tar.gz
Algorithm	Hash digest
SHA256	`c391b1a42fceec0cf02d2a6353163ff98c9dd3e1b6aee8d3c8b59c9b9d880460`
MD5	`ec1e10475cf26dcb89670ebe30e31e91`
BLAKE2b-256	`e3a873df82c84871c0599c6b45c6e366b3e90162f9772ed65cab6d1922a32b6d`

See more details on using hashes here.

File details

Details for the file pcametric-1.0.2-py3-none-any.whl.

File metadata

Download URL: pcametric-1.0.2-py3-none-any.whl
Upload date: Jan 9, 2025
Size: 4.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.0.1 CPython/3.10.12

File hashes

Hashes for pcametric-1.0.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`5c22e395104175ae86db23d2af467e3194252c62dfc7d09ee28598863c616bc1`
MD5	`22e5038ddb0769fdeefc70298725dd87`
BLAKE2b-256	`cb26e90e5155857f6829652b09d0c89b05c1b3331c4669221a2c3989ba7e8bcc`

See more details on using hashes here.

pcametric 1.0.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Inter-Dataset Similarity Metric Based on PCA

Installation

Usage

Citation

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes