No project description provided

Project description

End to End TimeSeries Model Evaluation and Feature Engineering Library

Overview

This project was developed as part of a Data Science Challenge

The objectives include:

Building a binary classifier to predict whether a visit to repair or maintain a node will succeed or fail.
Providing intuitive explanations for the classifier's predictions to ensure interpretability.

This library offers tools for preprocessing data, engineering features, evaluating models tailored to the requirements of the challenge and interpreting predictions.

The library includes:

FeatureEngineeringProcessor: Comprehensive feature engineering for complex datasets.
TimeSeriesModelEvaluator: Evaluate machine learning models using time series split validation.
Utility Functions: Essential utilities for data manipulation, visualization, and model analysis.

Features

FeatureEngineeringProcessor

A comprehensive, end-to-end class designed to process the visits.txt dataset and transform it into a feature-rich, model-ready format for predictive analysis.

Encoding: Converts categorical variables into numerical features using encoding techniques.
Lag Features: Generates lag-based features to uncover sequential patterns and relationships in the data
Network Features: Derives insights from the network structure by calculating features like degree, closeness centrality, pagerank, and more
Text Analysis: Encodes engineering notes with co-occurrence metrics, token frequency analysis etc.

TimeSeriesModelEvaluator

Time-Series-Cross-Validation: Supports time-series cross-validation with custom splits, test sizes, and gaps.
Model Evaluation: Built-in support for evaluating Logistic Regression and SVM with configurable hyperparameters.
Feature Selection: Supports chi-squared feature selection for optimal model inputs.
Metrics Reporting: Automatic computation of AUC, F1-score, recall, and ROC curves of the best configuration.

Utility Functions

SHAP Analysis: Provides SHAP explanations with waterfall and summary plots.
Visualization: ROC curves, feature importance plots, and more.

Installation

pip install FSM-Challenge

Complete Usage Example

Below is a complete example of using the library to preprocess data, evaluate models, and interpret predictions.

Step 1: Import Necessary Modules

from TimeSeriesModelEvaluator import TimeSeriesModelEvaluator
from utils import *
from FeatureEngineering import *

Step 2: Define File Paths

VISITS_FILE = 'visits.txt'
NETWORK_FILE = 'network.adjlist'
OUTPUT_FILE = 'preprocessed_data.csv'

Step 3: Data Preprocessing

Use the FeatureEngineeringProcessor to preprocess the data and generate features:

if __name__ == "__main__":
    data_processor = FeatureEngineeringProcessor(VISITS_FILE, NETWORK_FILE, OUTPUT_FILE)
    data_processor.process_data(add_netork_features=True, add_engineer_note_features=True, add_lag_features=True)

Step 4: Initialize and Evaluate Models

Set up the TimeSeriesModelEvaluator for cross-validation and evaluate configurations:

evaluator = TimeSeriesModelEvaluator(data_path=OUTPUT_FILE, n_splits=5, test_size=1000, gap=0)
evaluator.build_configurations(model_types=['LogisticRegression'], K=[50, 100, 150, 200, 300])
evaluator.run_evaluation()

Step 5: Find and Save the Best Model Configuration

Determine the best-performing configuration and save it:

best_configuration = find_best_configuration(evaluator.configurations, evaluator.n_splits)
print_results_metrics(best_configuration)
save_best_config(best_configuration)

Step 6: Evaluate a Trivial Classifier for Baseline Comparison

Compare the best model with a trivial majority classifier:

trivial_results = evaluate_majority_classifier(evaluator.X, evaluator.y, evaluator.tscv)
plot_configuration_results(best_configuration, trivial_classifier_results=trivial_results)

Step 7: Analyze Individual Predictions with SHAP

Retrieve specific instances and generate SHAP explanations:

instance_idx_1 = 1  # Start counting from 1
instance_idx_2 = 2

sample1 = get_instance(instance_idx_1, evaluator.X, evaluator.y)
sample2 = get_instance(instance_idx_2, evaluator.X, evaluator.y)

save_shap_summary(best_configuration, evaluator.X, evaluator.y)
shap_explainer(configuration=best_configuration, dataX=evaluator.X, dataY=evaluator.y, sample=sample1, name='Instance1')
shap_explainer(configuration=best_configuration, dataX=evaluator.X, dataY=evaluator.y, sample=sample2, name='Instance2')

Step 8: Visualize Model Insights

Generate a visualization of the most important features:

plot_logistic_regression_top_weights(best_configuration, top_n=15)

Project details

Release history Release notifications | RSS feed

This version

1.0.1

Dec 1, 2024

1.0.0

Dec 1, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fsm_challenge-1.0.1.tar.gz (14.7 kB view details)

Uploaded Dec 1, 2024 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

FSM_Challenge-1.0.1-py3-none-any.whl (16.1 kB view details)

Uploaded Dec 1, 2024 Python 3

File details

Details for the file fsm_challenge-1.0.1.tar.gz.

File metadata

Download URL: fsm_challenge-1.0.1.tar.gz
Upload date: Dec 1, 2024
Size: 14.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.0.1 CPython/3.9.20

File hashes

Hashes for fsm_challenge-1.0.1.tar.gz
Algorithm	Hash digest
SHA256	`4863a939635e9d574e1d4ab22662e0e19bb5dcbcb827a6d9f804d3b5ff4024cb`
MD5	`342e5074d2bca7171d13374cc4ab3163`
BLAKE2b-256	`c3f63a54fdbc98e113438e2c5fdd08ca23e825a0ad36ae84352271f237eaae13`

See more details on using hashes here.

File details

Details for the file FSM_Challenge-1.0.1-py3-none-any.whl.

File metadata

Download URL: FSM_Challenge-1.0.1-py3-none-any.whl
Upload date: Dec 1, 2024
Size: 16.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.0.1 CPython/3.9.20

File hashes

Hashes for FSM_Challenge-1.0.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`0528052b928764fa90f5e42c08a021c6a2ea7429e2baf27d69472de20fb2b849`
MD5	`817350a43294cff221d96ed3f19cd03c`
BLAKE2b-256	`c862c5cb1068393e1b073cad8c8ecc1be435902fa368392a2e470ba3ca0036ca`

See more details on using hashes here.

FSM-Challenge 1.0.1

Navigation

Verified details

Maintainers

Unverified details

Project description

End to End TimeSeries Model Evaluation and Feature Engineering Library

Overview

Features

FeatureEngineeringProcessor

TimeSeriesModelEvaluator

Utility Functions

Installation

Complete Usage Example

Step 1: Import Necessary Modules

Step 2: Define File Paths

Step 3: Data Preprocessing

Step 4: Initialize and Evaluate Models

Step 5: Find and Save the Best Model Configuration

Step 6: Evaluate a Trivial Classifier for Baseline Comparison

Step 7: Analyze Individual Predictions with SHAP

Step 8: Visualize Model Insights

Project details

Verified details

Maintainers

Unverified details

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes