Skip to main content

The library for GLM and Ensemble Tree model explanation

Project description

The "Transparency" Library

Scalable and Fast, local (single level) and global (population level) prediction explanation of:

  • Ensemble trees (e.g., XGB, GBM, RF, and Decision tree)
  • Generalized linear models GLM (support for various families, link powers, and variance powers, e.g., logistic regression)

implemented for models in:

  • Python (Scikit-Learn)
  • Pyspark (Scala and Pyspark)

Installation:

  • pip install transparency

additional step for Spark users:

Transformer Set

- Scikit-Learn Ensemble Tree Explainer Transformer

from transparency.python.explainer.ensemble_tree import EnsembleTreeExplainerTransformer
expl = EnsembleTreeExplainerTransformer(estimator)
X_test_df = expl.transform(X_test_df)
  • estimator: the ensemble tree estimator that has been trained (e.g., random forest, gbm, or xgb)
  • X_test: a Pandas dataframe with features as columns and samples as rows The resulting X_test_df will have 3 added columns: 'prediction', 'feature_contributions' and 'intercept_contribution':
  • 'feature_contributions': column of nested arrays with feature contributions (1 array per row)
  • 'intercept_contribution': column of the same scaler value representing the contribution of the intercept sum(contributions) + contrib_intercept for each row equals the prediction for that row.

- Scikit-Learn Generalized Linear Model (e.g., Logistic regression) Explainer Transformer

from transparency.python.explainer.glm import GLMExplainerTransformer
expl = GLMExplainerTransformer(estimator)
X_test_df = expl.transform(X_test_df, output_proba=False)
  • estimator: the glm estimator that has been trained (e.g., logistic regression)
  • X_test: a Pandas dataframe with features as columns and samples as rows The resulting X_test_df will have 3 added columns: 'prediction', 'feature_contributions' and 'intercept_contribution':
  • 'feature_contributions': column of nested arrays with feature contributions (1 array per row)
  • 'intercept_contribution': column of the same scaler value representing the contribution of the intercept sum(contributions) + contrib_intercept for each row equals the prediction for that row.
  • if output_proba is set to True, for the case of logistic regression, the output prediction and its corresponding explanation will be proba instead of the classification result

- Pyspark Ensemble Tree Explainer Transformer

 from transparency.spark.prediction.explainer.tree import EnsembleTreeExplainTransformer
 EnsembleTreeExplainTransformer(predictionView=predictions_view, 
                                featureImportanceView=features_importance_view,
                                modelPath=rf_model_path, 
                                label=label_column,
                                dropPathColumn=True, 
                                isClassification=classification, 
                                ensembleType=ensemble_type)
  • Path to load model modelPath

  • Supported ensembleType

    1. dct
    2. gbt
    3. rf
    4. xgboost4j
  • The feature importance extracted from Apache Spark Model Meta Data.featureImportanceView Reference this python script : testutil.common.get_feature_importance

    1. Feature_Index
    2. Feature
    3. Original_Feature
    4. Importance
  • The transformer append 3 main column to the prediction view

    1. contrib_column ==> f"{prediction_{label_column}_contrib : array of contributions
    2. contrib_column_sum ==> f"{contrib_column}_sum"
    3. contrib_column_intercept ==> f"{contrib_column}_intercept"

- Pyspark Generalized Linear Model (GLM) Explainer Transformer

  from transparency.spark.prediction.explainer.tree import GLMExplainTransformer
  GLMExplainTransformer(predictionView=predictions_view, 
                        coefficientView=coefficients_view,
                        linkFunctionType=link_function_type, 
                        label=label_column, nested=True,
                        calculateSum=True, 
                        family=family, 
                        variancePower=variance_power, 
                        linkPower=link_power)
  • Supported linkFunctionType

    1. logLink
    2. powerHalfLink
    3. identityLink
    4. logitLink
    5. inverseLink
    6. otherPowerLink
  • The feature coefficient extracted from Apache Spark Model Meta Data.coefficientView Reference this python script : testutil.common.get_feature_coefficients

    1. Feature_Index
    2. Feature
    3. Original_Feature
    4. Coefficient
  • The transformer append 3 main column to the prediction view

    1. contrib_column ==> f"{prediction_{label_column}_contrib : array of contributions
    2. contrib_column_sum ==> f"{contrib_column}_sum"
    3. contrib_column_intercept ==> f"{contrib_column}_intercept"

Example Notebooks

Authors

License

Apache License Version 2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

transparency-0.0.9.tar.gz (13.3 kB view details)

Uploaded Source

File details

Details for the file transparency-0.0.9.tar.gz.

File metadata

  • Download URL: transparency-0.0.9.tar.gz
  • Upload date:
  • Size: 13.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.2.0 requests-toolbelt/0.9.1 tqdm/4.47.0 CPython/3.7.3

File hashes

Hashes for transparency-0.0.9.tar.gz
Algorithm Hash digest
SHA256 835da70418f6cb6e84e04dc7e6d1892f80c750b697daea817182101510e2eea0
MD5 66900ebc5d38fb60d87e9c11bbce7a4a
BLAKE2b-256 9dde83481654bb19d5dc6e3f5dfa38dc8b2e3b6225d0d48ea3901267c2075ff0

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page