Skip to main content

Implementation of novel metrics for measuring inter-dataset similarity based on PCA.

Project description

Inter-Dataset Similarity Metric Based on PCA

This document presents the implementation of two novel metrics for measuring inter-dataset similarity based on PCA. These metrics are proposed in our paper, "Metrics for Inter-Dataset Similarity with Example Applications in Synthetic Data and Feature Selection Evaluation". This paper has been accepted to the 2025 SIAM International Conference on Data Mining (SDM).

Jupyter notebooks are provided for the experiments presented in the paper. You can run the code to reproduce the results.

Notebook Description
Pre-Investigation Investigation of general properties of the new metrics
Use Case 1 Examples from the paper for evaluation of synthetic tabular data
- Figure 3 Codebook in different repository due to licensing
- Figure 4 Codebook in different repository to avoid duplicate information
Use Case 2 Experiments from the paper on feature selection evaluation

Installation

You can install the package using pip:

pip install pcametric

Usage

Below is an example of how to use the metrics:

from pcametric import PCAMetric
import pandas as pd 

# Loading the datasets
df1 = pd.read_csv('df1.csv')
df2 = pd.read_csv('df2.csv')

# Setting parameters
num_components = 1
normalization = "precise"
preprocess = "std"

# Calculate the values of the two metrics, namely Difference in Explained Variance and Angle Difference
result, _, _ = PCAMetric(df1, df2, num_components, normalization, preprocess)
edv, ad = result['exp_var_diff'], result['comp_angle_diff']

The Average Angle Difference (AAD) metric is also implemented and can be used as a model-agnostic approach for evaluating the performance of feature selection:

from pcametric import AAD
import pandas as pd 

# Loading the dataset
df = pd.read_csv('df.csv')

#Index of selected features
selected_features = [2, 5, 11, 17, 22, 31, 40] 

# Calculate AAD
aad = AAD(df, selected_features)

It is noteworthy that for all the metrics above, a lower value indicates greater similarity to the actual data.

Citation

If you use our metrics in your research, please cite the original paper:

@inbook{doi:10.1137/1.9781611978520.57,
author = {Muhammad Rajabinasab and Anton Lautrup and Arthur Zimek},
title = {Metrics for Inter-Dataset Similarity with Example Applications in Synthetic Data and Feature Selection Evaluation},
booktitle = {Proceedings of the 2025 SIAM International Conference on Data Mining (SDM)},
chapter = {},
pages = {527-537},
doi = {10.1137/1.9781611978520.57},
URL = {https://epubs.siam.org/doi/abs/10.1137/1.9781611978520.57},
eprint = {https://epubs.siam.org/doi/pdf/10.1137/1.9781611978520.57},
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pcametric-1.0.5.tar.gz (4.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pcametric-1.0.5-py3-none-any.whl (4.4 kB view details)

Uploaded Python 3

File details

Details for the file pcametric-1.0.5.tar.gz.

File metadata

  • Download URL: pcametric-1.0.5.tar.gz
  • Upload date:
  • Size: 4.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for pcametric-1.0.5.tar.gz
Algorithm Hash digest
SHA256 d5552c9da8482902f8256659e6aedbacdffe76abc824674d74786c8919d6dddb
MD5 f6cec3c1310e761d2fb2af4c89d4705f
BLAKE2b-256 b26bd0b64ea1def8dc520a4460e5dc142718558639dbc1f8a144ce503caddd6a

See more details on using hashes here.

File details

Details for the file pcametric-1.0.5-py3-none-any.whl.

File metadata

  • Download URL: pcametric-1.0.5-py3-none-any.whl
  • Upload date:
  • Size: 4.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for pcametric-1.0.5-py3-none-any.whl
Algorithm Hash digest
SHA256 550872f62252e4eb7c1fed5e83e2ae8581f1130b70a3ae48c1a6f87d90a959db
MD5 b4fe48e0a9f46473f5d4ddef79eef24f
BLAKE2b-256 e37d7bc05c4ad1409942b0e2acd69cd8d6e91d2d9ba53cf91adf69e1b3275e3a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page