Finite-Interval Forecasting Engine for Spark: Machine learning models for discrete-time survival analysis and multivariate time series forecasting for Apache Spark
Project description
The Finite-Interval Forecasting Engine for Spark (FIFEforSpark) is an adaptation of the Finite-Interval Forecasting Engine for the Apache Spark environment. Currently, it provides machine learning models (specifically a gradient boosted tree model) for discrete-time survival analysis.
If you are already familiar with FIFE, you'll recognize the following explanation of how FIFEforSpark approaches survival analysis. Many of the sections were borrowed heavily from FIFE as this is merely an adaptation of the package to the Spark environment with the exact same methodology. If you would like more information on FIFE, you can read the documentation here. If you want more documentation on FIFEforSpark, you can go here
Suppose you have a dataset that looks like this:
ID | period | feature_1 | feature_2 | feature_3 | ... |
---|---|---|---|---|---|
0 | 2016 | 7.2 | A | 2AX | ... |
0 | 2017 | 6.4 | A | 2AX | ... |
0 | 2018 | 6.6 | A | 1FX | ... |
0 | 2019 | 7.1 | A | 1FX | ... |
1 | 2016 | 5.3 | B | 1RM | ... |
1 | 2017 | 5.4 | B | 1RM | ... |
2 | 2017 | 6.7 | A | 1FX | ... |
2 | 2018 | 6.9 | A | 1RM | ... |
2 | 2019 | 6.9 | A | 1FX | ... |
3 | 2017 | 4.3 | B | 2AX | ... |
3 | 2018 | 4.1 | B | 2AX | ... |
4 | 2019 | 7.4 | B | 1RM | ... |
... | ... | ... | ... | ... | ... |
The entities with IDs 0, 2, and 4 are observed in the dataset in 2019.
While FIFE offers a significantly larger suite of models designed to answer a variety of questions, FIFEforSpark is mainly focused on one question: what are each individual's probabilities of being observed in any future year? Fortunately, FIFEforSpark can estimate answers to these questions for any unbalanced panel dataset.
Exactly like FIFE, FIFEforSpark unifies survival analysis and multivariate time series analysis. Tools for the former neglect future states of survival; tools for the latter neglect the possibility of discontinuation. Traditional forecasting approaches for each, such as proportional hazards and vector autoregression (VAR), respectively, impose restrictive functional forms that limit forecasting performance. FIFEforSpark supports one of the best approaches for maximizing forecasting performance: gradient-boosted trees (using MMLSpark's LightGBM).
FIFEforSpark is simple to use and the syntax is almost identical to that of FIFE; however, given that this is meant to be run in the Spark environment in a Python notebook, there are some notable differences. First, the package 'mmlspark' must already be installed and attached to the cluster. Unfortunately, the PyPI version of MMLSpark is not compatible with FIFEforSpark. As such, FIFE is best utilized in a Databricks notebook. For a tutorial on how to download mmlspark on databricks, click here.
FIFEforSpark is a supported package on PyPI (Python Package Index), thus downloading FIFEforSpark is as simple as entering the package name in the 'Create Library' tab on Databricks (with Library Source set to PyPI) or by running the following statement in the command prompt:
pip install fifeforspark
Once installed, generating forecasts is simple. If you are working in a Databricks python notebook, you may run something like the following code, where 'your_table' is the name of your table.
from fifeforspark.processors import PanelDataProcessor
from fifeforspark.lgb_modelers import LGBSurvivalModeler
data_processor = PanelDataProcessor(data = spark.sql("select * from your_table"))
data_processor.build_processed_data()
modeler = LGBSurvivalModeler(data=data_processor.data)
modeler.build_model()
forecasts = modeler.forecast()
If you are working in a Python IDE and have both pyspark and MMLSpark successfully installed, you can run the following:
import findspark
findspark.init()
import pyspark # only run after findspark.init()
from pyspark.sql import SparkSession
from fifeforspark.processors import PanelDataProcessor
from fifeforspark.lgb_modelers import LGBSurvivalModeler
spark = SparkSession.builder.getOrCreate()
data_processor = PanelDataProcessor(data=spark.read.csv(path_to_your_data))
data_processor.build_processed_data()
modeler = LGBSurvivalModeler(data=data_processor.data)
modeler.build_model()
forecasts = modeler.forecast()
Here's a notebook with real data, where we forecast when world leaders will lose power: REIGN Example Notebook
If you would like more information on FIFEforSpark, you can read the documentation here: FIFEforSpark Documentation
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file fifeforspark-0.0.2.tar.gz
.
File metadata
- Download URL: fifeforspark-0.0.2.tar.gz
- Upload date:
- Size: 709.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.0 CPython/3.9.6
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 3dba11f1c1d52e4a628aeb6e11dae8c192186763e761a13ebf284555ff1e1f59 |
|
MD5 | c9dc9a110980198386c5b65a6a0ec2dd |
|
BLAKE2b-256 | dd55b44d0d3bc37ee18165a066e3a3cc84de50072d6d095fc130fc4357e1cefa |
File details
Details for the file fifeforspark-0.0.2-py3-none-any.whl
.
File metadata
- Download URL: fifeforspark-0.0.2-py3-none-any.whl
- Upload date:
- Size: 35.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.0 CPython/3.9.6
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7a754715d11d6e381534abce416e3b7f9273094bb732b7e51d5874d622cd8e33 |
|
MD5 | 916b7c3a1a752cb5beb1e3721fa3c9ce |
|
BLAKE2b-256 | cf7dc7937b4c8ad107c16a63303ae05d1b19a9d1e3594c3018e0b59feefc1dce |