---
title: "Evaluation and explanation"
author: "Alex Zwanenburg"
date: "2026-04-22"
output:
rmarkdown::html_vignette:
includes:
in_header: familiar_logo.html
after_body: license.html
toc: TRUE
rmarkdown::github_document:
html_preview: FALSE
includes:
in_header: familiar_logo.html
after_body: license.html
toc: TRUE
bibliography: "refs.bib"
vignette: >
%\VignetteIndexEntry{Evaluation and explanation}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
``` r
library(familiar)
library(data.table)
set.seed(19)
```
Familiar will automatically evaluate created models, and export corresponding
tables and plots. The table below lists all available analyses. Several
analyses allow for determining distributions and confidence intervals to provide
an indication of the value spread for a given dataset and models (if
applicable). The `estimation_type` argument defines this behaviour:
- `point`: Compute point estimates, i.e. single values.
- `bias_correction` or `bc`: Bias-corrected estimates. A bias-corrected
estimate is computed from (at least) 20 point estimates.
- `bootstrap_confidence_interval` or `bci`: Bias-corrected estimates with
bootstrap confidence intervals [@Efron2016-ws]. This estimation type allows for
plotting confidence intervals in plots. The number of point estimates required
depends on the `confidence_level` parameter. Familiar uses a rule of thumb of
n=20/(1-`confidence_level`), i.e. at least 400 points for
`confidence_level=0.95`.
Another argument, `detail_level` determines how the required point estimates are
formed:
- `ensemble`: Point estimates are computed at the ensemble level, i.e. over
all models in the ensemble. This means that, for example, bias-corrected
estimates of model performance are assessed by creating (at least) 20 bootstraps
and computing the model performance of the ensemble model for each bootstrap.
- `hybrid`: Point estimates are computed from the models in an ensemble.
This means that, for example, bias-corrected estimates of model performance are
directly computed using the models in the ensemble. If there are at least 20
trained models in the ensemble, performance is computed for each model, in
contrast to `ensemble` where performance is computed for the ensemble of models.
If there are less than 20 trained models in the ensemble, bootstraps are created
so that at least 20 point estimates can be made.
- `model`: Point estimates are computed at the model level. This means that,
for example, bias-corrected estimates of model performance are assessed by
creating (at least) 20 bootstraps and computing the performance of the model for
each bootstrap. Familiar will not automatically create plots in this case.
The more point estimates are required, the more computationally
intensive the analysis is. Some analyses also become computationally more
expensive with more samples. The `sample_limit` argument can be used to specify
the maximum number of samples that should be used.
Some analyses, such as model predictions and individual conditional expectation
plots, do not benefit from bootstraps as they are solely based on values
predicted by the (ensemble) models. If you want to compute bias-corrected
estimates or bootstrap confidence intervals for these analyses you should ensure
that sufficient models are created. Experimental designs such as
`experimental_design="bs(fs+mb,400)+ev"` allow for this, but are computationally
expensive.
Several methods do not require any more information than that provided in a
prediction table or dataset.
| Name | Plot function | Export function | Est. type | Detail level | Sample limit | Model | Pred. table | Dataset |
|------------------------------------------|---------------------------------------------------------------------------------------|---------------------------------------|:---------:|:------------:|:------------:|:-----:|:-----------:|:-------:|
| AUC-PR^a^ | `plot_auc_precision_recall_curve` | `export_auc_data` | × | × | | × | × | |
| AUC-ROC^a^ | `plot_auc_roc_curve` | `export_auc_data` | × | × | | × | × | |
| Calibration info^c^ | | `export_calibration_info` | | × | | × | | |
| Confusion matrix^a^ | `plot_confusion_matrix` | `export_confusion_matrix_data` | | × | | × | × | |
| Decision curve analysis^ab^ | `plot_decision_curve` | `export_decision_curve_analysis_data` | × | × | | × | × | |
| Feature expression^c^ | `plot_sample_clustering` | `export_feature_expressions` | | | | × | | × |
| Feature selection variable importance^c^ | `plot_feature_selection_occurrence`; `plot_feature_selection_variable_importance` | `export_fs_vimp` | | | | × | | |
| Feature similarity^c^ | `plot_feature_similarity` | `export_feature_similarity` | × | | | × | | × |
| Hyperparameters^c^ | | `export_hyperparameters` | | | | × | | |
| Individual conditional expectation^c^ | `plot_ice` | `export_ice_data` | ×^d^ | × | × | × | | |
| Model calibration^ce^ | `plot_calibration_data` | `export_calibration_data` | × | × | | × | | |
| Model performance^c^ | `plot_model_performance` | `export_model_performance` | × | × | | × | × | |
| Model predictions^c^ | | `export_prediction_data` | ×^d^ | × | | × | x | |
| Model variable importance^ce^ | `plot_model_signature_occurrence`; `plot_model_signature_variable_importance` | `export_model_vimp` | | | | × | | |
| Partial dependence^c^ | `plot_pd` | `export_partial_dependence_data` | ×^d^ | × | × | × | | |
| Permutation variable importance^c^ | `plot_permutation_variable_importance` | `export_permutation_vimp` | × | × | | × | | |
| Risk stratification info^b^ | | `export_risk_stratification_info` | | × | | × | | |
| Risk stratification data^b^ | `plot_kaplan_meier` | `export_risk_stratification_data` | | | | × | × | |
| Sample similarity^c^ | | `export_sample_similarity` | × | | × | × | | × |
| SHAP values | `plot_shap_summary`, `plot_shap_waterfall`, `plot_shap_force`, `plot_shap_dependence` | `export_shap` | | × | × | × | | |
| Univariate analysis^c^ | `plot_univariate_importance` | `export_univariate_analysis_data` | | | | × | | |
: Evaluation and explanation steps.\
^a^ Available for binomial and multinomial outcomes.\
^b^ Available for survival outcomes.\
^c^ Available for all outcomes.\
^d^ Estimation types other than `point` require sufficient models.\
^e^ May not be available for all models.
When familiar is run from `summon_familiar` plots and tables are exported
automatically. The corresponding plot and export functions can all be used externally as well.
This vignette shows their use.
We will first create two models. The first model aims to predict the risk for
ischemic stroke based on clinical and image features. This is a classification problem.
For illustration purposes, we will use the first 99 samples as development data,
and the remaining as a holdout set.
``` r
ischemic_stroke_data <- data.table::as.data.table(modeldata::ischemic_stroke)
# Set categorical variables.
ischemic_stroke_data$male <- factor(
ischemic_stroke_data$male, levels = c(0, 1), labels = c("no", "yes")
)
ischemic_stroke_data$smoking_history <- factor(
ischemic_stroke_data$smoking_history, levels = c(0, 1), labels = c("no", "yes")
)
ischemic_stroke_data$atrial_fibrillation <- factor(
ischemic_stroke_data$atrial_fibrillation, levels = c(0, 1), labels = c("no", "yes")
)
ischemic_stroke_data$coronary_artery_disease <- factor(
ischemic_stroke_data$coronary_artery_disease, levels = c(0, 1), labels = c("no", "yes")
)
ischemic_stroke_data$diabetes_history <- factor(
ischemic_stroke_data$diabetes_history, levels = c(0, 1), labels = c("no", "yes")
)
ischemic_stroke_data$hypercholesterolemia_history <- factor(
ischemic_stroke_data$hypercholesterolemia_history, levels = c(0, 1), labels = c("no", "yes")
)
ischemic_stroke_data$hypertension_history <- factor(
ischemic_stroke_data$hypertension_history, levels = c(0, 1), labels = c("no", "yes")
)
# Create training and hold-out test set.
ischemic_stroke_data_train <- ischemic_stroke_data[1L:99L, ]
ischemic_stroke_data <- ischemic_stroke_data[100L:126L, ]
# Call familiar to create models. We don't use summon_familiar here, but call
# train_familiar instead, which returns familiar models.
ischemic_stroke_model <- familiar::train_familiar(
data = ischemic_stroke_data_train,
outcome_column = "stroke",
class_levels = c("no", "yes"),
vimp_method = "none",
learner = "random_forest_ranger_default",
parallel = FALSE,
verbose = FALSE
)
```
The second model is a survival model, and is trained on a colon
chemotherapy clinical trial dataset [@Moertel1995-bh] to predict recurrence
after treatment.
``` r
survival_data <- data.table::as.data.table(survival::colon)
# Select data corresponding to tumour recurrence.
survival_data <- survival_data[etype == 1L]
# Remove unncessary columns.
survival_data[, ":=" ("id" = NULL, "study" = NULL, "etype" = NULL, "node4" = NULL)]
# Set categorical variables.
survival_data$sex <- factor(
survival_data$sex, levels = c(0, 1), labels = c("female", "male")
)
survival_data$obstruct <- factor(
survival_data$obstruct, levels = c(0, 1), labels = c("no", "yes")
)
survival_data$adhere <- factor(
survival_data$adhere, levels = c(0, 1), labels = c("no", "yes")
)
survival_data$differ <- factor(
survival_data$differ,
levels = c(1, 2, 3),
labels = c("well", "moderate", "poor"),
ordered = TRUE
)
survival_data$extent <- factor(
survival_data$extent,
levels = c(1, 2, 3, 4),
labels = c("submucosa", "muscle", "serosa", "contiguous structures"),
ordered = TRUE
)
survival_data$surg <- factor(
survival_data$surg, levels = c(0, 1), labels = c("short", "long")
)
# Create training and hold-out test set.
survival_data_train <- survival_data[1L:799L, ]
survival_data <- survival_data[800L:929L, ]
survival_model <- familiar::train_familiar(
data = survival_data_train,
outcome_column = c("status", "time"),
vimp_method = "none",
learner = "random_forest_ranger_default",
parallel = FALSE,
verbose = FALSE
)
```
# Model performance
We assess model performance to evaluate accuracy of a model. Model accuracy is
quantified using performance metrics that are documented in the *Performance metrics*
vignette, and can be set using the `evaluation_metric` argument of the
`summon_familiar` function or `metric` for plot and export functions. For
example, this results in the following plot for the ischemic stroke dataset:
``` r
plots <- familiar::plot_model_performance(
object = ischemic_stroke_model,
data = ischemic_stroke_data,
metric = c("auc", "brier", "f1_score"),
x_axis_by = "metric"
)
```
The plot shows distributions of the area under the receiver operating
characteristic curve, the brier score and the f1-score, based on bootstrap
resampling of model predictions.
Model performance can also be determined directly from predictions, which makes
it possible to assess predictions made by non-familiar models. This involves
conversion of the predictions to a prediction table object using `as_prediction_table`,
as shown below:
``` r
predictions <- predict(
object = ischemic_stroke_model,
newdata = ischemic_stroke_data
)
# Create a prediction table object using as_prediction_table. The minimum
# you need to provide here are: prediction probabilities for the positive class
# (yes), the reference labels, the type of prediction (classification),
# and providing the class levels (no, yes).
predictions <- familiar::as_prediction_table(
x = predictions$yes,
y = ischemic_stroke_data$stroke,
type = "classification",
class_levels = c("no", "yes")
)
#> Error in `familiar::as_prediction_table()`:
#> ! (converted from warning) The positive class cannot be directly inferred from names of the provided prediction data, and is assumed to be yes.
# The prediction table object is then provided as object
plots <- familiar::plot_model_performance(
object = predictions,
metric = c("auc", "brier", "f1_score"),
x_axis_by = "metric"
)
#> Error in `.local()`:
#> ! data_element expects one of feature_expressions, risk_stratification_data, feature_similarity and sample_similarity as input. Found the following unknown values: model_performance
```
This produces the exact same plot. Model performance can also be exported to a
table using using `export_model_performance`.
``` r
results <- familiar::export_model_performance(
object = ischemic_stroke_model,
data = ischemic_stroke_data,
metric = c("auc", "brier", "f1_score"),
aggregate_results = TRUE
)
results[[1L]]@data
#> data_set vimp_method learner ensemble_model_name metric value ci_low ci_up
#>
#> 1: 173529_2wFvaYlrMM6XXSNOt9rE none random_forest_ranger_default ensemble.NA.NA auc 0.7387626 0.5166193 0.9146674
#> 2: 173529_2wFvaYlrMM6XXSNOt9rE none random_forest_ranger_default ensemble.NA.NA brier 0.2041370 0.1639073 0.2492457
#> 3: 173529_2wFvaYlrMM6XXSNOt9rE none random_forest_ranger_default ensemble.NA.NA f1_score 0.7407407 0.5152258 0.8967186
```
In the above table, `data_set` is randomly generated because we call this method
on a (technically) *new* dataset. `ensemble_model_name` contains a placeholder
for the same reason. `vimp_method` and `learner` are derived from the model, and
metrics with 95% confidence intervals are shown.
## Receiver-operating characteristic curve
The receiver-operating characteristic (ROC) curve visualises concordance between
predicted class probabilities and the observed classes [@Hand2001-il]. It shows
the trade-off between sensitivity and specificity of the model. The
receiver-operating characteristic curve for the ischemic stroke dataset is as
follows:
``` r
plots <- familiar::plot_auc_roc_curve(
object = ischemic_stroke_model,
data = ischemic_stroke_data
)
```
Under ideal circumstances, the curve would lie in the top-left corner of the
plot. If a model does not predict better than at random, the curve would lie
along the diagonal, which corresponds to an AUC-ROC value of 0.50. The plot
above indicates that the model can somewhat differentiate between patients with
and without stroke.
Familiar creates an ROC curve by ascendingly sorting instances by the predicted
probability of the positive class, and computing the number of true positive,
true negative, false positive, false negative cases observed at each instance.
These are then used to compute sensitivity and specificity.
The above plot shows the average of 400 single curves, with 95% confidence
intervals. To achieve this, familiar interrogates each curve over the complete
range of potential specificity values ([0.000, 1.000]) with a step size of
0.005. Linear interpolation is used for this purpose [@Davis2006-xb]. This
allows for assessing the distribution of sensitivity at fixed specificity
values.
The above approach is the general approach used by familiar. However, familiar
will plot an exact ROC curve if the point estimate (`estimation_type="point"`)
of a single model or the complete ensemble of models is evaluated
(`detail_level="ensemble"`):
``` r
plots <- familiar::plot_auc_roc_curve(
object = ischemic_stroke_model,
data = ischemic_stroke_data,
estimation_type = "point"
)
```
Note that we did not specify the detail level to create the above plot because
only a single model is evaluated.
## Precision-recall curve
The precision-recall (PR) curve shows the trade-off between precision (positive
predictive value) and recall (sensitivity). An important distinction between the
PR curve and the ROC curve described above is that precision does not assess
true negatives but false positives. In case the number of negative instances
greatly exceeds the number of positive instances, a large change in the number
of false positives reduces the ROC curve somewhat, but this change is far easier
observed in the PR curve [@Davis2006-xb]. Hence the PR curve is commonly
assessed if the positive class is more important than and rarer compared to the
negative class. For the ischemic stroke dataset, the precision-recall curve is as
follows:
``` r
plots <- familiar::plot_auc_precision_recall_curve(
object = ischemic_stroke_model,
data = ischemic_stroke_data
)
```
Under ideal circumstances, the curve would lie in the top-right part of the
plot. A random classifier would yield a curve that is mostly horizontal and
located at the fraction of positive instances. This would equal a precision of
56% for the current dataset.
Like the ROC curve, familiar forms the PR curve by ascendingly sorting instances
by the predicted probability of the positive class and computing the number of
true positive and false positive negative cases observed at each instance. These
values are then used to compute recall and precision.
The above plot shows the average of 400 single curves, with 95% confidence
intervals. To achieve this, familiar interrogates each curve over the complete
range of recall values ([0.000, 1.000]) with a step size of 0.005. We use
linear interpolation for this purpose, treating the PR curve as a piecewise
function to manage the situation identified by Davis and Goadrich
[@Davis2006-xb]. Interpolating in this manner allows for assessing the
distribution of precision values at fixed recall values.
Again, this is the general approach used by familiar. Familiar can plot an exact
PR curve when evaluating the point estimate (`estimation_type="point"`) of a
single model or of the complete ensemble of models (`detail_level="ensemble"`).
In the ischemic strok dataset, this produces the following curve:
``` r
plots <- familiar::plot_auc_precision_recall_curve(
object = ischemic_stroke_model,
data = ischemic_stroke_data,
estimation_type = "point"
)
```
## Confusion matrix
A confusion matrix schematically represents the number of predicted (expected)
class instances and their actual (observed) values. This may help identify
potential weaknesses of classifiers, i.e. false positives or negatives.
``` r
plots <- familiar::plot_confusion_matrix(
object = ischemic_stroke_model,
data = ischemic_stroke_data
)
```
Familiar selects the class with the highest predicted class probability as the
expected class. It is currently not possible in familiar to obtain confusion
matrices for set probability thresholds, such as the threshold that maximises
the Youden's Index.
## Kaplan-Meier survival curves
Kaplan-Meier survival curves are a standard method for assessing survival over time in
a population, or for assessing survival differences between groups.
Familiar applies one or more thresholds to stratify instances to risk groups.
These thresholds are created during training to avoid bias. The number
and value of these thresholds is determined by two parameters:
`stratification_method` and `stratification_threshold`. Both parameters are set
when calling `summon_familiar` or `train_familiar`.
`stratification_method` has three options:
- `median`: Familiar uses the median predicted value in the development cohort
to stratify instances into two risk groups.
- `fixed`: Instances are stratified based on the sample quantiles of the
predicted values. These quantiles are defined using the
`stratification_threshold` parameter.
- `optimised`: Use maximally selected rank statistics to determine the optimal
threshold [@Lausen1992-qh; @Hothorn2003-qn] to stratify instances into two
optimally separated risk groups.
The `stratification_threshold` parameter is only used when
`stratification_method = "fixed"`. It allows for specifying the sample quantiles
that are used to determine the thresholds. For example
`stratification_threshold = c(0.25,0.75)` creates two thresholds that stratifies
the development dataset into three groups: one with 25% of the instances with
the highest risk, a group with 50% of the instances with medium risk, and a
third group with 25% of instances with lowest risk.
Applying the survival model to the colon cancer dataset yields the following
Kaplan-Meier plot:
``` r
plots <- familiar::plot_kaplan_meier(
object = survival_model,
data = survival_data
)
```
In the plot above, familiar divides instances into two groups by the median
threshold in the development dataset. The risk groups are shown with 95%
confidence intervals. Familiar uses the `survival::survfit` function to create
both the survival curves and the confidence intervals. The low-risk group has
significantly better survival than the high-risk group. This is also indicated
in the bottom-left of the main plot, which shows the result of a logrank test
between the two groups. Crosses indicate right-censored patients, of which there
were few. The number of surviving patients in both groups at the time points
along the x-axis are shown below the main plot.
# Model calibration
Evaluating model performance is important as it allows us to assess how
accurate a model is. However, it often does not tell us how well we can rely on
predictions for individual instances, e.g. how accurate an instance class
probability is. Model calibration is assessed for this purpose. The model for
the ischemic stroke dataset produces the following calibration plot:
``` r
plots <- familiar::plot_calibration_data(
object = ischemic_stroke_model,
data = ischemic_stroke_data
)
```
The calibration curve shows the expected (predicted) probability of ischemic
stroke versus the observed proportion of ischemic stroke, together with a 95%
confidence interval. In an ideal calibration plot the calibration curve should
lie on the diagonal (dashed) line. Low expected probabilities tend to
overestimate the observed probability, whereas high expected probabilities tend
to underestimate the observed probability. The density plot in the top panel
indicates that predicted probabilities were distributed around 0.5, with
relatively few low- and high-probability cases being predicted.
Familiar also performs several calibration tests to assess various degrees of
calibration [@Van_Calster2016-ok]. First of all, we assess two aspects of the
linear fit (solid line in the plot): the intercept (calibration-in-the-large)
and slope. These are shown in the top-left of the main plot.
The intercept ideally is 0.00. In our example, it lies below this point, but is
still contained in the 95% confidence interval, owing to a small number of
samples in the test set.
In an ideal case, the slope of the linear fit is 1.00. In our example, the slope
of the linear fit lies above this value.
One issue with assessing calibration through a linear fit, is that the
calibration curve can be decidedly non-linear and still produce good values for
the intercept and slope. Therefore, we also assess calibration using statistical
goodness-of-fit tests: the Hosmer-Lemeshow (HL) test for categorical and
regression outcomes [@Hosmer1997-ov], and the Nam-D'Agostino (ND)
[@DAgostino2003-ot] and Greenwood-Nam-D'Agostino (GND) [@Demler2015-yx] tests
for survival outcomes.
## Implementation details
Here we detail the implementation of the model calibration in familiar. The
statistical tests mentioned above, as well as the linear fit, rely on grouping
instances. First we order the instances in the dataset by their expected
(predicted) values. Within each group, the expected values and observed values
are compared:
* Categorical outcomes: We compare the fraction instances with the (positive)
observed class in the group against the mean expected probability for that same
class.
* Regression outcomes: Both expected and observed values are first normalised to
a $[0, 1]$ range using the value range observed in the development dataset. We
then compare the mean observed value in the group against the mean expected
value.
* Survival outcomes: We predict survival probabilities at time points indicated
by `evaluation_times`. The mean expected survival probability in the group is
then compared against the observed probability (Kaplan-Meier estimator) at the
same time points.
We can observe the points thus created in a calibration plot when
`estimation_type="point"`:
``` r
plots <- familiar::plot_calibration_data(
object = ischemic_stroke_model,
data = ischemic_stroke_data,
estimation_type = "point"
)
```
The plot with point estimates and the earlier calibration plot with bootstrap
confidence intervals lead to the same assessment. Moreover, you may not arrive
at the same plot. This happens because familiar introduces randomness into the
assessment of calibration. First, familiar randomises the number of groups drawn
because using a fixed number of 10 groups is arbitrary. Second, the actual group
sizes are randomised by drawing group identifiers with replacement for all
instances. We then sort the set of drawn identifiers, and the instances, ordered
by increasing expected value, are then assigned to their respective group.
Linear fit intercept and slope as well as the goodness-of-fit test p-value are
determined for each single set of groups. Intercept and slopes are then averaged
(`point` and `bias_correction`) or assessed as a distribution
(`bootstrap_confidence_interval`). p-values from goodness-of-fit tests need to
be combined. These tests are not fully independent because they derive from the
same dataset. We therefore compute a harmonic mean p-value [@Wilson2019-bi] from
the individual p-values.
# Decision curve analysis
Decision curve analysis is a method to assess the net (clinical) benefit of
models, introduced by Vickers and Elkin [@Vickers2006-be]. The decision curve
compares the model benefit for different threshold probabilities against two or
more options. The two standard options are that all instances receive an
intervention, and none receive an intervention. What constitutes an intervention
is model-dependent [@Vickers2019-ar]. For our ischemic stroke cohort, intervention
might constitute treatment of all patients, versus no treatment and treatment for
patients with malignancy predicted by the model. This produces the following
figure:
``` r
plots <- familiar::plot_decision_curve(
object = ischemic_stroke_model,
data = ischemic_stroke_data,
y_range = c(-0.2, 0.8)
)
```
```
#> Error in `ggplot2::geom_line()`:
#> ! Problem while converting geom to grob.
#> ℹ Error occurred in the 1st layer.
#> Caused by error:
#> ! (converted from warning) Removed 33 rows containing missing values or values outside the scale range (`geom_line()`).
```
The shaded curve is the decision curve produced by the model, with 95%
confidence intervals. The declining curve represents the intervention for all
strategy, which offers no benefit for a probability of ischemic stroke of 56%.
The horizontal line at a net benefit of 0.0 represent the no intervention
strategy. As may be observed, the model offers an expected net benefit compared
to either strategy, but not always significantly.
For regression problems no definition of decision curves exists. However,
familiar can perform decision curve analysis for survival outcomes, based on
Vickers et al. [@Vickers2008-uu].
# Variable importance
Not all features are strongly associated with the outcome of interest. One of
the most critical points in machine learning is selecting those features that
are important enough to incorporate into a model. Familiar assesses variable
importance using the methods detailed in the *Variable importance methods*
vignette during training. Later, during evaluation, familiar employs both
model-specific and model-agnostic methods to explaining which features are
considered important by the model itself.
## Model-specific methods
Model-specific methods rely on aspects of the models themselves to determine
variable importance. For instance, the coefficients of regression models are a
measure of variable importance when features are normalised. In the case of the
random forest model for the ischemic stroke dataset, permutation variable
importance is determined. This produces the following plot:
``` r
plots <- familiar::plot_model_signature_variable_importance(
object = ischemic_stroke_model,
data = ischemic_stroke_data,
rotate_x_tick_labels = TRUE
)
```
```
#> Error in `ggplot2::geom_bar()`:
#> ! Problem while converting geom to grob.
#> ℹ Error occurred in the 1st layer.
#> Caused by error:
#> ! (converted from warning) Removed 13 rows containing missing values or values outside the scale range (`geom_bar()`).
```
In such plots, we always order features so that the most important features are
on the left, and the least important features on the right. Letters indicate
features that are clustered together.
In case an ensemble of models is assessed, scores are aggregated using an
aggregation method (see the *variable importance methods* vignette). This method
can be set using the `aggregation_method` argument, or the
`eval_aggregation_method` configuration parameter (`summon_familiar`).
The `plot_model_signature_occurrence` method plots the occurrence of features
among the first 5 ranks across an ensemble of models (if available). The rank
threshold can be specified using the `rank_threshold` argument or the
`eval_aggregation_rank_threshold` parameter (`summon_familiar`).
## Permutation variable importance
Permutation variable importance is a model-agnostic variable importance method.
It assesses importance by measuring the decrease in model performance caused by
shuffling values of a feature across the instances in a dataset.
For the ischemic stroke dataset, this results in the following figure:
``` r
plots <- familiar::plot_permutation_variable_importance(
object = ischemic_stroke_model,
data = ischemic_stroke_data,
estimation_type = "bias_correction"
)
```
The figure shows the features along the y-axis, and the loss in AUC-ROC caused
by permuting the feature values along the x-axis. Accordingly,
`max_remodeling_ratio` is considered to be the most important feature in the
model.
The figure shows two bars for each feature. A well-known issue with permutation
variable importance is that the presence of highly correlated features causes
permutation variable importance to become unreliable [@Hooker2019-pv]. If two
features are both important and highly correlated, shuffling one of them does
not decrease model performance as much. For this reason, familiar will permute
features together depending on their mutual correlation, and the provided
threshold values.
By default, similarity between features is assessed using McFadden's
pseudo-$R^2$. This metric can assess similarity between different types of
features, i.e. numeric and categorical. The default thresholds are 0.23 and 1.0
that respectively cluster similar features and permute all features
individually. The dataset contains many features that are highly
similar and exceed the relatively stringent threshold of 0.23.
## SHAP summary plots
A SHAP value is the marginal contributions of a feature value to the predicted
value. For example, if you were to predict the number of daily ice cream sales,
season could be a feature. A SHAP value for the summer season is likely
positive, whereas the SHAP value for the winter season would likely be negative.
Important features have larger SHAP values than less important features. Thus
a SHAP summary plot can be used to evaluate feature importance.
``` r
plots <- familiar::plot_shap_summary(
object = ischemic_stroke_model,
data = ischemic_stroke_data
)
```
The SHAP summary plot shows several things: features are listed along the
y-axis, with the most important feature on top, i.e. the feature that has the
highest mean absolute SHAP value. Each point corresponds to the value in an
instance (a patient), and each point is coloured according to its (normalised)
value. For the purpose of representation, all feature values are mapped to the
[-1.0, 1.0] range.
The default presentation might provide too much detail. It is therefore possible
to change the `plot_type` or `value_representation`. For example,
the above data could be presented as violin plots:
``` r
plots <- familiar::plot_shap_summary(
object = ischemic_stroke_model,
data = ischemic_stroke_data,
plot_type = "violinplot"
)
```
The violin plot shows the distribution of values more clearly, at the cost
of losing information concerning underlying feature values.
If just variable importance is relevant, the SHAP summary plot can also be shown
as a bar plot (`plot_type = "barplot"`) showing the mean absolute SHAP value by
default (`value_representation = "abs_mean"`):
``` r
plots <- familiar::plot_shap_summary(
object = ischemic_stroke_model,
data = ischemic_stroke_data,
value_representation = "abs_mean"
)
```
# Feature effects
Variable importance does not explain how features influence the model, although
the SHAP summary plot already provided some hints. Familiar offers multiple
methods for assessing the effect of features.
## Partial dependence and individual conditional expectation plots
Partial dependence (PD) [@Friedman2001-jf] and individual conditional
expectation (ICE) [@Goldstein2015-rv] plots can be used to visualise how
features influence the model. As an example, we can create an ICE plot for the
`max_remodeling_ratio` feature that was found to be important for predicting
ischemic stroke.
``` r
plots <- familiar::plot_ice(
object = ischemic_stroke_model,
data = ischemic_stroke_data,
features = "max_remodeling_ratio"
)
```
The ICE plot shows multiple curves for individual samples, as well as their
average, the partial dependence plot as a somewhat thicker curve. These curves
are created by iteratively updating the value of `max_remodeling_ratio` and then
observing the predicted outcome. To generate the sample points, familiar samples
the feature value distribution, as observed in the training dataset, at regular
percentile intervals. Also note that by default, ICE plots only show a limited
number of samples to prevent clutter. This can be specified through
`n_max_samples_shown`.
The plot also shows novelty, which is a measure of how out-of-distribution the value
is. Novelty is computed using extended isolation forests [@Hariri2019-xz] and is
the average normalised height at which a sample is found in a terminal leaf node
of the trees in the isolation forest [@Liu2008-kw]. Novelty values range between
`0.00` and `1.00`. Novelty values should be interpreted with care. While increasing
novelty values represent instances that lie further from the distribution, the
values that represent in-distribution instances may vary. In the ICE plot above,
novelty values range between `0.40` and `0.50`. None of the points in ICE plot
can therefore be considered out-of-distribution. Move to a conformal prediction
framework is planned for future versions.
It is possible to anchor the curves in the ICE plot to a specific feature value.
This allows for assessing the model response for individual instances versus a
fixed value. In effect this can reduce some of the offset caused by other
(fixed) features in each instance, and may help to elucidate the observed
behaviour:
``` r
plots <- familiar::plot_ice(
object = ischemic_stroke_model,
data = ischemic_stroke_data,
features = "max_remodeling_ratio",
anchor_values = list("max_remodeling_ratio" = 2.5)
)
```
In the plot above, we have anchored the values at `max-remodeling-ratio = 2.5`.
All curves are then drawn relative to the predicted probability found for this
particular value.
It is also possible to show the partial dependence of two features at the same
time:
``` r
plots <- familiar::plot_ice(
object = ischemic_stroke_model,
data = ischemic_stroke_data,
features = c("max_remodeling_ratio", "coronary_artery_disease")
)
```
The partial dependence plot above shows `max_remodeling_ratio` along the x-axis
and `coronary_artery_disease` along the y-axis. Higher intensities are
associated with a higher expected probability of malignancy. The figure thus
indicates that higher values of `max_remodeling_ratio` clearly affect the
predicted probability of ischemic stroke. The presence of
`coronary_artery_disease` can be seen to decrease the probability of ischemic
stroke - probably because these patients arrived at the hospital due to this
condition, not necessarily cerebral stroke. The average novelty score of the
underlying instances determines the size of each point.
It also possible to anchor 2D partial dependence plots:
``` r
plots <- familiar::plot_ice(
object = ischemic_stroke_model,
data = ischemic_stroke_data,
features = c("max_remodeling_ratio", "coronary_artery_disease"),
anchor_values = list(
"max_remodeling_ratio" = 2.5,
"coronary_artery_disease" = "no"
),
show_novelty = FALSE
)
```
We disabled novelty (`show_novelty=FALSE`) for plotting the 2D anchored plot.
This changes the appearance of the plot, which now consists of coloured
rectangles. The white dots show the position at which familiar samples the
feature values.
Familiar automatically determines the values at which numerical features are
sampled based on their distribution in the development dataset. Such values can
also be manually set using the `feature_x_range` and `feature_y_range`
arguments.
## SHAP dependence plots
SHAP dependence plots function similar to dependence plots. The SHAP dependence
plot below shows how the SHAP values of `max_remodeling_ratio` are related
to its feature values.
``` r
plots <- familiar::plot_shap_dependence(
object = ischemic_stroke_model,
data = ischemic_stroke_data,
shap_feature = "max_remodeling_ratio"
)
```
The above plot shows that with increasing value of `max_remodeling_ratio`, the
marginal contribution to the prediction changes from negative (low values
decrease the probability of the ischemic stroke) to positive (high values
increase the probability of the ischemic stroke).
SHAP dependence plots can also show interaction with other features. For
example, we can show the interaction with the `coronary_artery_disease`
feature.
``` r
plots <- familiar::plot_shap_dependence(
object = ischemic_stroke_model,
data = ischemic_stroke_data,
shap_feature = "max_remodeling_ratio",
interaction_feature = "coronary_artery_disease"
)
```
The addition of an interaction feature does not alter the shape of the plot, but
now individual values are coloured by the value of the `coronary_artery_disease`
feature for the same sample. It appears that `coronary_artery_disease` is not
strongly interacting with `max_remodeling_ratio`.
``` r
plots <- familiar::plot_shap_dependence(
object = ischemic_stroke_model,
data = ischemic_stroke_data,
shap_feature = "max_remodeling_ratio",
interaction_feature = "max_dilation_by_area"
)
```
The `max_dilation_by_area` feature, however, is much more strongly correlated,
as shown in the SHAP dependence plot above.
## SHAP force plots
SHAP force plots show the *force* exerted by features for individual samples.
Samples are (by default) ordered by increasing predicted value, and the
positive and negative marginal contributions are shown for each sample.
``` r
plots <- familiar::plot_shap_force(
object = ischemic_stroke_model,
data = ischemic_stroke_data
)
```
Force plots are not highly informative. However, one can set one or more highlight
features. For example, setting `max_remodeling_ratio` as a highlight yields the
following figure:
``` r
plots <- familiar::plot_shap_force(
object = ischemic_stroke_model,
data = ischemic_stroke_data,
highlight_feature = "max_remodeling_ratio"
)
```
## SHAP waterfall plots
SHAP waterfall plots show how SHAP values for each feature value contribute to
the predicted value for an individual sample.
``` r
plots <- familiar::plot_shap_waterfall(
object = ischemic_stroke_model,
data = ischemic_stroke_data
)
```
The above plot shows a waterfall plot for the first sample, which has a
predicted value of $0.43$, where the average predicted value in
`ischemic_stroke_data` (i.e. the hold-out test set that we initially defined) is
$0.54$. Feature values are shown on the left, and all features are ordered by
the magnitude of their corresponding SHAP value.
# Feature and sample similarity
Highly similar features can carry redundant information for a model. This can be
visualised using similarity heatmaps:
``` r
plots <- familiar::plot_feature_similarity(
object = ischemic_stroke_model,
data = ischemic_stroke_data,
feature_similarity_metric = "spearman",
rotate_x_tick_labels = TRUE
)
```
Because we specify `feature_similarity_metric = "spearman"` the heatmap displays
Spearman's correlation coefficient between the different features (even
categorical ones). The features are also ordered by similarity, so that clusters
of similar features can be identified. Note that the dendrograms depict cluster
distance, i.e. $1-|\rho|$ for Spearman's correlation coefficient.
It is also possible to view the sample clustering. This can be used to identify
if samples are readily stratified into clusters, which may or may not correlate
to the endpoint of interest. For the ischemic stroke dataset, this yields the
following heatmap:
``` r
plots <- familiar::plot_sample_clustering(
object = ischemic_stroke_model,
data = ischemic_stroke_data,
feature_similarity_metric = "spearman",
rotate_x_tick_labels = TRUE
)
```
The sample clustering figure shows an absence of clear structure among the
samples, nor a clear ordering of features with their observed class label
(ischemic stroke). The top dendrogram is the same as for feature clustering, as
it should be. The right dendrogram is formed after computing Gower's distance
between samples (by default), and shows clustering of samples. The heatmap
itself shows normalised values for the features, based on normalisation and
transformation parameters obtained during model development.
Normalisation options can be altered by changing the `show_normalised_data`
argument. However, in the ischemic stroke dataset features do not have the same
value ranges, so without normalisation the figure looks less interpretable:
``` r
plots <- familiar::plot_sample_clustering(
object = ischemic_stroke_model,
data = ischemic_stroke_data,
feature_similarity_metric = "spearman",
show_normalised_data = "none",
rotate_x_tick_labels = TRUE
)
```
# References