Skip to main content

Python Package for complexity measures

Project description

contributions welcome

pycol: Python Class Overlap Library

The Python Class Overlap Library (pycol) assembles a set of data complexity measures associated to the problem of class overlap.

The combination of class imbalance and overlap is currently one of the most challenging issues in machine learning. However, the identification and characterisation of class overlap in imbalanced domains is a subject that still troubles researchers in the field as, to this point, there is no clear, standard, well-formulated definition and measurement of class overlap for real-world domains.

This library characterises the problem of class overlap according to multiple sources of complexity, where four main class overlap representations are acknowledged: Feature Overlap, Instance Overlap, Structural Overlap, and Multiresolution Overlap.

Existing open-source implementations of complexity measures include the DCoL (C++), ECoL, and the recent ImbCoL, SCoL, and mfe packages (R code). There is also pymfe in Python. Regarding class overlap measures, these packages consider the implementation of the following: F1, F1v, F2, F3, F4, N1, N2, N3, N4, T1 and LSCAvg. ImbCoL further provides a decomposition by class of the original measures and SCoL focuses on simulated complexity measures. In order to foster the study of a more comprehensive set of measures of class overlap, we provide an extended Python library, comprising the class overlap measures included in the previous packages, as well as an additional set of measures proposed in recent years. Furthermore, this library implements additional adaptations of complexity measures to class imbalance.

Overall, pycol characterises class overlap as a heterogeneous concept, comprising distinct sources of complexity, and the following measures are implemented:

Feature Overlap:

  • F1: Maximum Fisher's Discriminat Ratio
  • F1v: Directional Vector Maximum Fisher's Discriminat Ratio
  • F2: Volume of Overlapping Region
  • F3: Maximum Individual Feature Efficiency
  • F4: Collective Feature Efficiency
  • IN: Input Noise

Instance Overlap:

  • R-value
  • Raug: Augmented R-value
  • degOver
  • N3: Error Rate of the Nearest Neighbour Classifier
  • SI: Separability Index
  • N4: Non-Linearity of the Nearest Neighbour Classifier
  • kDN: K-Disagreeing Neighbours
  • D3: Class Density in the Overlap Region
  • CM: Complexity Metric Based on k-nearest neighbours
  • wCM: Weighted Complexity Metric
  • dwCM: Dual Weighted Complexity Metric
  • Borderline Examples
  • IPoints: Number of Invasive Points

Structural Overlap:

  • N1: Fraction of Borderline Points
  • T1: Fraction of Hyperspheres Covering Data
  • Clst: Number of Clusters
  • ONB: Overlap Number of Balls
  • LSCAvg: Local Set Average Cardinality
  • DBC: Decision Boundary Complexity
  • N2: Ratio of Intra/Extra Class Nearest Neighbour Distance
  • NSG: Number of samples per group
  • ICSV: Inter-class scale variation

Multiresolution Overlap:

  • MRCA: Multiresolution Complexity Analysis
  • C1: Case Base Complexity Profile
  • C2: Similarity-Weighted Case Base Complexity Profile
  • Purity
  • Neighbourhood Separability

Usage Example:

The dataset folder contains some datasets with binary and multi-class problems. All datasets are numerical and have no missing values. The complexity.py module implements the complexity measures. To run the measures, the Complexity class is instantiated and the results may be obtained as follows:

from pycol_complexity import complexity
complexity = complexity.Complexity("dataset/61_iris.arff",distance_func="default",file_type="arff")

# Feature Overlap
print(complexity.F1())
print(complexity.F1v())
print(complexity.F2())
# (...)

# Instance Overlap
print(complexity.R_value())
print(complexity.deg_overlap())
print(complexity.CM())
# (...)

# Structural Overlap
print(complexity.N1())
print(complexity.T1())
print(complexity.Clust())
# (...)

# Multiresolution Overlap
print(complexity.MRCA())
print(complexity.C1())
print(complexity.purity())
# (...)

Developer notes:

To submit bugs and feature requests, report at project issues.

Licence:

The project is licensed under the MIT License - see the License file for details.

Acknowledgements:

Some complexity measures implemented on pycol are based on the implementation of pymfe. We also thank José Daniel Pascual-Triana for providing the implementation of ONB.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pycol_complexity-1.0.4.tar.gz (28.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pycol_complexity-1.0.4-py3-none-any.whl (26.8 kB view details)

Uploaded Python 3

File details

Details for the file pycol_complexity-1.0.4.tar.gz.

File metadata

  • Download URL: pycol_complexity-1.0.4.tar.gz
  • Upload date:
  • Size: 28.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for pycol_complexity-1.0.4.tar.gz
Algorithm Hash digest
SHA256 75af1a2ce095cd6b8faaea66d1f2d8764aa4e39d281762090ccb462300c3e1f5
MD5 414d666e241f7d23ebda76eedf16f764
BLAKE2b-256 083a16873c6a8e3738b1e848cd060bbb16614ebdcff70916eee6750754fe3b15

See more details on using hashes here.

File details

Details for the file pycol_complexity-1.0.4-py3-none-any.whl.

File metadata

File hashes

Hashes for pycol_complexity-1.0.4-py3-none-any.whl
Algorithm Hash digest
SHA256 de4437a8f036806f719926bf2ffa4e6818e8b632b952a9e21cad2cb81f89d5b0
MD5 73119215cbcdde53ac33549decf9b0bb
BLAKE2b-256 f0c08c9c0c8a8435aa87bef54dd0e4c02f094e8dee5e4d0f0d3359cb332929da

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page