A Python package for feature selection on a simulated data stream
Project description
pystreamfs is an Open-Source Python package that allows for quick and simple comparison of feature selection algorithms on a simulated data stream.
The user can simulate data streams with varying batch size on any dataset provided as a numpy.ndarray. pystreamfs applies a specified feature selection algorithm to every batch and computes performance metrics for the selected feature set at every time t. pystreamfs can also be used to plot the performance metrics.
The package currently includes 3 datasets and 4 feature selection algorithms built in. pystreamfs has a modular structure and is thus easily expandable.
License: MIT License
Upcoming changes:
- additional built in datasets, feature selection algorithms and classifiers
- ability to simulate feature streams
- ability to generate artificial data streams
1 Getting started
1.1 Prerequesites
The following Python modules need to be installed (older versions than indicated might also work):
- python >= 3.7.1
- numpy >= 1.15.4
- psutil >= 5.4.7
- matplotlib >= 2.2.3
- scikit-learn >= 0.20.1
- ... any modules required by the feature selection algorithm
1.2 How to get pystreamfs
Using pip: pip install pystreamfs
OR Download and unpack the .zip (Windows) or .tar.gz (Linux) file in /dist
. Navigate to the unpacked folder and execute
python setup.py install
.
2 The Package
2.1 Files
The main module is /pystreamfs/pystreamfs.py
. Feature selection algorithms are stored in /algorithms
.
Datasets are stored in /datasets
. Examples can be found in /examples
.
2.2 Main module: pystreamfs.py
pystreamfs.py
provides the following functions:
X, Y = prepare_data(data, target, shuffle)
- Description: Prepare the data set for the simulation of a data stream: randomly sort the rows of a the data matrix and extract the target variable
Y
and the featuresX
- Input:
data
: numpy.ndarray, data settarget
: int, index of the target variableshuffle
: bool, ifTrue
sort samples randomly
- Output:
X
: numpy.ndarray, featuresY
: numpy.ndarray, target variable
- Description: Prepare the data set for the simulation of a data stream: randomly sort the rows of a the data matrix and extract the target variable
stats = simulate_stream(X, Y, fs_algorithm, model, param)
- Description: Iterate over all datapoints in the dataset to simulate a data stream. Perform given feature selection algorithm and return performance statistics.
- Input:
X
: numpy array, this is theX
returned byprepare_data()
Y
: numpy array, this is theY
returned byprepare_data()
fs_algorithm
: function, feature selection algorithmml_model
: object, the machine learning model to use for the computation of the accuracy score (remark on KNN: number of neighbours has to be greater or equal to batch size)param
: python dict(), includes:num_features
: integer, the number of features you want returnedbatch_size
: integer, number of instances processed in one iteration- ... additional algorithm specific parameters
- Output:
stats
: python dictionaryfeatures
: set of selected features for every batchtime_avg
: float, average computation time for one execution of the feature selectiontime_measures
: list, time measures for every batchmemory_avg
: float, average memory usage after one execution of the feature selection, usespsutil.Process(os.getpid()).memory_full_info().uss
memory_measures
: list, memory measures for every batchacc_avg
: float, average accuracy for classification with the selected feature setacc_measures
: list, accuracy measures for every batchfscr_avg
: float, average feature selection change rate (fscr) per time window. The fscr is the percentage of selected features that changes in t with respect to t-1 (fscr=0 if all selected features remain the same, fscr=1 if all selected features change)fscr_measures
list, fscr measures for every batch
plt = plot_stats(stats, ftr_names, param, fs_name, model_name):
- Description: Plot the statistics for time, memory, fscr and selected features over all time windows.
- Input:
stats
: python dictionary (seestats
ofsimulate_stream()
)ftr_names
: numpy array, contains all feature namesparam
: python dict(), parametersfs_name
: string, name of FS algorithmmodel_name
: string, name of ML model
- Output:
plt
: pyplot object: statistic plots
2.3 Built-in feature selection algorithms
- Online Feature Selection (OFS) by Wang et al. (paper)
- Unsupervised Feature Selection on Data Streams (FSDS) by Huang et al.(paper)
- Feature Selection based on Micro Cluster Nearest Neighbors by Hamoodi et al. (paper)
- CancelOut Feature Selection based on a Neural Network by Vadim Borisov (more information will be included)
2.4 Built-in datasets
All datasets are cleaned and normalized. The target variable of all datasets is moved to the first column.
- German Credit Score (link)
- Binary version of Human Activity Recognition (link).
- The original HAR dataset has a multivariate target. For its binary version we defined the class "WALKING" as our positive class (label=1) and all other classes as the negative (non-walking) class. We combined the 1722 samples of the original "WALKING" class with a random sample of 3000 instances from all other classes.
- Usenet (link)
3. Example
from pystreamfs import pystreamfs
import numpy as np
import pandas as pd
from pystreamfs.algorithms import ofs
from sklearn.neighbors import KNeighborsClassifier
# Load a dataset
data = pd.read_csv('../datasets/credit.csv')
feature_names = np.array(data.drop('target', 1).columns)
data = np.array(data)
# Extract features and target variable
X, Y = pystreamfs.prepare_data(data, 0, False)
Y[Y == 0] = -1 # change 0 to -1, required by ofs
# Load a FS algorithm
fs_algorithm = ofs.run_ofs
# Define parameters
param = dict()
param['num_features'] = 5 # number of features to return
param['batch_size'] = 50 # batch size
# Define ML model
model = KNeighborsClassifier(n_jobs=-1, n_neighbors=5)
# Data stream simulation
stats = pystreamfs.simulate_stream(X, Y, fs_algorithm, model, param)
# Plot statistics
pystreamfs.plot_stats(stats, feature_names, param, 'Online feature selection (OFS)', 'K Nearest Neighbor').show()
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for pystreamfs-0.0.2-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 46ad57fdfced15cdcb9200023d5dbac213194d26a6a769e08485331322d35f7e |
|
MD5 | 09e31dfdfd1d3dcf020d39592b36605a |
|
BLAKE2b-256 | 4607839e11e7b825e3216614afd0b1de7f10d729d644ef1e1ca68ea9806e7831 |