Skip to main content

A Python package for feature selection on a simulated data stream

Project description

PyPI version

pystreamfs is an Open-Source Python package that allows for quick and simple comparison of feature selection algorithms on a simulated data stream.

The user can simulate data streams with varying batch size on any dataset provided as a numpy.ndarray. pystreamfs applies a specified feature selection algorithm to every batch and computes performance metrics for the selected feature set at every time t. pystreamfs can also be used to plot the performance metrics.

pystreamfs comes with 5 built-in feature selection algorithms for data streams. Additionally, you can find 3 datasets ready for download on Github. pystreamfs has a modular structure and is thus easily expandable (see Section 2.5 for more information).

License: MIT License
Upcoming changes:

  • ability to simulate feature streams
  • ability to generate artificial data streams
  • ability to test multiple feature selection algorithms at once

1 Getting started

1.1 Prerequesites

The following Python modules need to be installed (older versions than indicated might also work):

  • python >= 3.7.1
  • numpy >= 1.15.4
  • psutil >= 5.4.7
  • matplotlib >= 2.2.3
  • scikit-learn >= 0.20.1
  • ... any modules required by the feature selection algorithm

1.2 How to get pystreamfs

Using pip: pip install pystreamfs
OR Download and unpack the .tar.gz file in /dist. Navigate to the unpacked folder and execute python setup.py install.

2 The Package

2.1 Files

The main module is /pystreamfs/pystreamfs.py. Feature selection algorithms are stored in /algorithms.

2.2 Main module: pystreamfs.py

pystreamfs.py provides the following functions:

  • X, Y = prepare_data(data, target, shuffle)
    • Description: Prepare the data set for the simulation of a data stream: randomly sort the rows of a data matrix and extract the target variable Y and the features X
    • Input:
      • data: numpy.ndarray, data set
      • target: int, index of the target variable
      • shuffle: bool, if True sort samples randomly
    • Output:
      • X: numpy.ndarray, features
      • Y: numpy.ndarray, target variable
  • stats = simulate_stream(X, Y, fs_algorithm, model, param)
    • Description: Iterate over all datapoints in the dataset to simulate a data stream. Perform given feature selection algorithm and return performance statistics.
    • Input:
      • X: numpy.ndarray, this is the X returned by prepare_data()
      • Y: numpy.ndarray, this is the Y returned by prepare_data()
      • fs_algorithm: function, feature selection algorithm
      • ml_model: object, the machine learning model to use for the computation of the accuracy score (remark on KNN: number of neighbours has to be greater or equal to batch size)
      • param: dict, includes:
        • num_features: integer, the number of features you want returned
        • batch_size: integer, the number of instances processed in one iteration
        • ... additional algorithm specific parameters
    • Output:
      • stats: dict
        • features: list of lists, set of selected features for every batch
        • time_avg: float, average computation time for one execution of the feature selection
        • time_measures: list, time measures for every batch
        • memory_avg: float, average memory usage after one execution of the feature selection, uses psutil.Process(os.getpid()).memory_full_info().uss
        • memory_measures: list, memory measures for every batch
        • acc_avg: float, average accuracy for classification with the selected feature set
        • acc_measures: list, accuracy measures for every batch
        • fscr_avg: float, average feature selection change rate (fscr) per time window. The fscr is the percentage of selected features that changes in t with respect to t-1 (fscr=0 if all selected features remain the same, fscr=1 if all selected features change)
        • fscr_measures list, fscr measures for every batch
  • plt = plot_stats(stats, ftr_names, param, fs_name, model_name):
    • Description: Plot the statistics for time, memory, fscr and selected features over all time windows.
    • Input:
      • stats: dict (see stats of simulate_stream())
      • ftr_names: numpy.ndarray, contains all feature names
      • param: dict, parameters
      • fs_name: string, name of feature selection algorithm
      • model_name: string, name of machine learning model
    • Output:
      • plt: pyplot object: statistic plots

2.3 Built-in feature selection algorithms

  • Online Feature Selection (OFS) based on the Perceptron algorithm by Wang et al. (2013) - link to paper
  • Unsupervised Feature Selection on Data Streams (FSDS) using matrix sketching by Huang et al. (2015) - link to paper
  • Feature Selection based on Micro Cluster Nearest Neighbors by Hamoodi et al. (2018) - link to paper
  • Extremal Feature Selection based on a Modified Balanced Winnow classifier by Carvalho et al. (2006) - link to paper
  • CancelOut Feature Selection based on a Neural Network by Vadim Borisov (Github)

2.4 Downloadable datasets

All datasets are cleaned and normalized. The target variable of all datasets is moved to the first column.

  • German Credit Score (link)
  • Binary version of Human Activity Recognition (link).
    • The original HAR dataset has a multivariate target. For its binary version we defined the class "WALKING" as our positive class (label=1) and all other classes as the negative (non-walking) class. We combined the 1722 samples of the original "WALKING" class with a random sample of 3000 instances from all other classes.
  • Usenet (link)

2.5 How to add a feature selection algorithm

If you want to use pystreamfs to test your own feature selection algorithm, you have to encapsulate your algorithm in a function with the following format:

def your_fs_algorithm(X, Y, w, param):
    """Your feature selection algorithm

    :param numpy.nparray X: current data batch
    :param numpy.nparray Y: labels of current batch
    :param numpy.nparray w: feature weights
    :param dict param: any parameters the algorithm requires
    :return: w (updated feature weights), param
    :rtype numpy.ndarray, dict
    """

    ...do feature selection...

    return w, param

Afterwards you can import and test your feature selection algorithm in the same way as for any built-in algorithm (see the example).

3. Example

from pystreamfs import pystreamfs
import numpy as np
import pandas as pd
from pystreamfs.algorithms import ofs
from sklearn.neighbors import KNeighborsClassifier

# Load a dataset
data = pd.read_csv('../datasets/har.csv')
feature_names = np.array(data.drop('target', 1).columns)
data = np.array(data)

# Extract features and target variable
X, Y = pystreamfs.prepare_data(data, 0, False)

# Load a FS algorithm
fs_algorithm = ofs.run_ofs

# Define parameters
param = dict()
param['num_features'] = 5  # number of features to return
param['batch_size'] = 50  # batch size

# Define ML model
model = KNeighborsClassifier(n_jobs=-1, n_neighbors=5)

# Data stream simulation
stats = pystreamfs.simulate_stream(X, Y, fs_algorithm, model, param)

# Plot statistics
pystreamfs.plot_stats(stats, feature_names, param, 'Online feature selection (OFS)', 'K Nearest Neighbor').show()

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pystreamfs-0.0.6.tar.gz (23.1 MB view details)

Uploaded Source

Built Distribution

pystreamfs-0.0.6-py3-none-any.whl (19.8 kB view details)

Uploaded Python 3

File details

Details for the file pystreamfs-0.0.6.tar.gz.

File metadata

  • Download URL: pystreamfs-0.0.6.tar.gz
  • Upload date:
  • Size: 23.1 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.28.1 CPython/3.7.1

File hashes

Hashes for pystreamfs-0.0.6.tar.gz
Algorithm Hash digest
SHA256 ffc20878b884bbbd837bf6c52a3192729a42f45ea9199554b81704dfa8d7fca5
MD5 a092f277310956116fe60b582a1b371a
BLAKE2b-256 87db1fd8fa912632858b0319b7a4a16f84aaf7a6323065bd89a593350eb2781c

See more details on using hashes here.

File details

Details for the file pystreamfs-0.0.6-py3-none-any.whl.

File metadata

  • Download URL: pystreamfs-0.0.6-py3-none-any.whl
  • Upload date:
  • Size: 19.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.28.1 CPython/3.7.1

File hashes

Hashes for pystreamfs-0.0.6-py3-none-any.whl
Algorithm Hash digest
SHA256 e1babb1c4b39941986e2a22178173cb14ddc4dd37d06222ec8af7cfc77b49dea
MD5 4726b3ffa03b62b6e60a7e0742c5dee7
BLAKE2b-256 bd0358a07cd3f627696b4cb3c6d4e422f55b51a04844cce1f48bd4c3228c000c

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page