Skip to main content

Facilitating machine learning in the Norwegian Avalanche Forecasting Service

Project description

avaml - helper functions for Avalanche machine learning

This package contains functions used to prepare data from the Norwegian Avalanche Forecasting Service to facilitate machine learning.

Installation

To install using pip:

pip install avaml

Example program

Searching for data

Here is a short example program using the package to prepare data for training a RandomForestClassifier:

import sys

from regobslib import SnowRegion
import avaml
import datetime as dt
import pandas as pd
from sklearn import metrics
from sklearn.ensemble import RandomForestClassifier

DAYS_IN_TIMELINE = 4
START_DATE = dt.date(2017, 11, 1)
STOP_DATE = dt.date(2021, 7, 1)
TRAINING_REGIONS = [
    SnowRegion.VEST_FINNMARK,
    SnowRegion.NORD_TROMS,
    SnowRegion.LYNGEN,
    SnowRegion.SOR_TROMS,
    SnowRegion.INDRE_TROMS,
    SnowRegion.LOFOTEN_OG_VESTERALEN,
    SnowRegion.SVARTISEN,
    SnowRegion.HELGELAND,
    SnowRegion.TROLLHEIMEN,
    SnowRegion.ROMSDAL,
    SnowRegion.SUNNMORE,
    SnowRegion.INDRE_FJORDANE,
    SnowRegion.JOTUNHEIMEN,
    SnowRegion.HALLINGDAL,
    SnowRegion.INDRE_SOGN,
    SnowRegion.VOSS,
    SnowRegion.HEIANE,
]
VALIDATION_REGIONS = [
    SnowRegion.TROMSO,
    SnowRegion.SALTEN,
    SnowRegion.VEST_TELEMARK,
]
TEST_REGIONS = [
    SnowRegion.FINNMARKSKYSTEN,
    SnowRegion.OFOTEN,
    SnowRegion.HARDANGER,
]
REGIONS = sorted(TRAINING_REGIONS + VALIDATION_REGIONS + TEST_REGIONS)

# fetch_and_prepare_aps_varsom() does a number of things:
# * See if a call with the same parameters have been called earlier.
#   If so, load prepared csv's from disk.
# * If there is no prepared data on disk, see if there are raw
#   data instead.
# * If no previously downloaded data is found, download data from
#   APS and Varsom.
# * Save the data as csv's to the 'cache_dir'
# * Prepare the data:
#   * It transforms the Varsom DataFrame to contain the same index
#     as the APS DataFrame, replicating the forecast for every
#     elevation level. It then removes avalanche problems that
#     does not exist at the row's elevation level and remove the
#     elevation data from the avalanche problems. All rows that
#     does not contain complete APS data is removed.
#   * Creates a timeline out of the APS DataFrame. This means that
#     the APS data is concatenated onto a shifted version of itself,
#     making each APS row contain several days worth of data.
#     This will make some rows incomplete. They will be removed.
aps, varsom = avaml.prepare.fetch_and_prepare_aps_varsom(START_DATE,
                                                         STOP_DATE,
                                                         DAYS_IN_TIMELINE,
                                                         REGIONS,
                                                         print_to=sys.stdout,  # Print progression to terminal
                                                         cache_dir=".",  # Use current dir for csv-files
                                                         read_cache=True,
                                                         write_cache=True)

print("Training and predicting problems")
# split_avalanche_problems() takes the Varsom Dataframe and splits it into
# several DataFrames, containing information about one avalanche problem each.
# If a problem row contains some values, but also some NaNs, it is invalid and
# is removed from both Varsom and APS.
#
# A 3-tuple is returned, (problems_X: Dict, problems_Y: Dict, problems: List).
# * problems_X is a dict with the problem as key and a DataFrame with indata as value
# * problems_Y is a dict with the problem as key and a DataFrame with labels as value
# * problems is a list of avalanche problem names
problems_X, problems_Y, problems = avaml.prepare.split_avalanche_problems(aps, varsom)
f1_dict = {}
for problem in problems:
    X = problems_X[problem]
    Y = problems_Y[problem]

    # Splitting data into TRAINING and VALIDATION sets
    training_index = Y.index.isin(TRAINING_REGIONS, level="region")
    validation_index = Y.index.isin(VALIDATION_REGIONS, level="region")
    training_X = X.loc[training_index]
    training_Y = Y.loc[training_index].any(axis=1)
    validation_X = X.loc[validation_index]
    validation_Y = Y.loc[validation_index].any(axis=1)

    # Training and validating
    classifier = RandomForestClassifier(n_estimators=10)
    classifier.fit(training_X.values, training_Y)
    prediction = pd.Series(classifier.predict(validation_X.values), index=validation_X.index)

    # Calculating and storing scores to dict
    elevation_prediction = prediction
    elevation_ground_truth = validation_Y
    problem_prediction = elevation_prediction.unstack().any(axis=1)
    problem_ground_truth = elevation_ground_truth.unstack().any(axis=1)
    training_elevation_ground_truth = training_Y
    training_problem_ground_truth = training_elevation_ground_truth.unstack().any(axis=1)
    elevation_f1 = metrics.f1_score(elevation_ground_truth, elevation_prediction)
    problem_f1 = metrics.f1_score(problem_ground_truth, problem_prediction)
    f1_dict[problem] = {
        ("f1", "per_elevation"): elevation_f1,
        ("f1", "per_forecast"): problem_f1,
        ("training_n_true", "per_elevation"):
            f"{training_elevation_ground_truth.sum()}/{len(training_elevation_ground_truth)}",
        ("training_n_true", "per_forecast"):
            f"{training_problem_ground_truth.sum()}/{len(training_problem_ground_truth)}",
        ("validation_n_true", "per_elevation"):
            f"{elevation_ground_truth.sum()}/{len(elevation_ground_truth)}",
        ("validation_n_true", "per_forecast"):
            f"{problem_ground_truth.sum()}/{len(problem_ground_truth)}",
    }

with pd.option_context('display.max_rows', None,
                       'display.max_columns', None,
                       'display.expand_frame_repr', False):
    print(pd.DataFrame(f1_dict).T)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

avaml-0.0.6.tar.gz (5.9 kB view details)

Uploaded Source

Built Distribution

avaml-0.0.6-py3-none-any.whl (6.0 kB view details)

Uploaded Python 3

File details

Details for the file avaml-0.0.6.tar.gz.

File metadata

  • Download URL: avaml-0.0.6.tar.gz
  • Upload date:
  • Size: 5.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.7.1 importlib_metadata/4.6.4 pkginfo/1.8.2 requests/2.22.0 requests-toolbelt/0.9.1 tqdm/4.62.1 CPython/3.8.10

File hashes

Hashes for avaml-0.0.6.tar.gz
Algorithm Hash digest
SHA256 75473fd4f1aad92c6aea9aed989f10f9be05e7271d481c200722435619fbe291
MD5 824ba4aa04af91b32652953d6b2a6251
BLAKE2b-256 2ae871532685210c00eb60e9f09b12c3324f3c4b1c77b6ae3ea01b9c377b6e2b

See more details on using hashes here.

File details

Details for the file avaml-0.0.6-py3-none-any.whl.

File metadata

  • Download URL: avaml-0.0.6-py3-none-any.whl
  • Upload date:
  • Size: 6.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.7.1 importlib_metadata/4.6.4 pkginfo/1.8.2 requests/2.22.0 requests-toolbelt/0.9.1 tqdm/4.62.1 CPython/3.8.10

File hashes

Hashes for avaml-0.0.6-py3-none-any.whl
Algorithm Hash digest
SHA256 df2ae1d269147704c5208e5f74bbb75062939be9ede0116b40c4ce83f0c2ca94
MD5 23802d3e18920f18063fa916f2fa998a
BLAKE2b-256 a89908fbae45a97ff6ea71f8bccca0b3bea0a675d0070443d64c6acc2e93e3ea

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page