Jun 7, 2023 6 min read Archive

Approximating Shapley Values for Machine Learning

The how and why of Shapley value approximation, explained in code

In a previous post, I explained the theory behind Shapley values. I also explained that calculating Shapley values for real world machine learning use cases is typically computationally infeasible, which is why in practice, methods that approximate them are used instead.

In this article, we will explore a simple approach for approximating Shapley values. This sets the foundation for discussing the foremost technique for estimating Shapley values: SHAP.

Approximating Shapley Values

As explained previously, the Shapley values of a model can be computed exactly as a weighted sum of each feature's marginal contributions. This involves retraining the model 2^F times, where \( F \) is the number of features. These retrained models encompass every possible combination of features, which in game theory terms, is referred to as the power set of all possible feature coalitions.

Training so many models is usually prohibitive, so instead we must find ways to approximate this process. In this section, we explore a naive approach for approximating Shapley values that avoids model retraining.

Reintroducing the dataset

We'll explore this idea using the same simplified version of the Boston housing dataset as used in the How Shapley Values Work post. This dataset contains the prices of 506 houses, accompanied by three predictive features (Table 1).

Variable Name	Description
% working class	Percentage of the population that is working class.
number of rooms	The average number of rooms per house in the housing unit.
NOX concentration	Nitric oxides concentration (parts per 10 million).

^{Table 1. The model input variables used to predict house prices.}

Shapley values will enable us to understand the house price predictions of a machine learning model trained on these three features.

import numpy as np
import pandas as pd

features = ["% working class", "number of rooms", "NOX concentration"]
df = pd.read_csv("data.csv")
y = df["y"].values
print(f"{len(y)} rows")
print(df[features + ["y"]].sample(5, random_state=0))

# returns:
# 506 rows
#      % working class  number of rooms  NOX concentration        y
# 329            14.68            6.333              0.460  22600.0
# 371            19.06            6.216              0.631  50000.0
# 219            21.00            6.373              0.550  23000.0
# 403            39.54            5.349              0.693   8300.0
# 78             24.68            6.232              0.437  21200.0

Training the Model

Whereas before we had to train \( 2^3 = 8 \) models, this time, we only need to train the model we're ultimately interested in: the one that uses all three features.

from sklearn.ensemble import RandomForestRegressor

X = df[features]
model = RandomForestRegressor(random_state=0).fit(X, y)
y_pred = m.predict(X)

Generating marginalised predictions

Our naive approximation approach is to use this single model for the entire power set, but for each feature that's missing in a given feature coalition, we randomly replace that feature's value with another value from the dataset. In SHAP, "removing" features using methods such as this is referred to as masking.

The function below shows what this looks like for a model generating a prediction for a single row in a pandas DataFrame.

def marginalised_prediction(row, m=model, X=X, missing=[]):
    """Generate a prediction for `row` using model `m`, replacing
    features in `missing` by sampling randomly from `X`."""
    instance = row.copy()
    for feature in missing:
        instance[feature] = np.random.choice(X[feature])
    return m.predict([instance])

Replacing the value of a feature with another value sampled at random from the dataset is referred to as sampling from the feature's marginal distribution. This can result in unrealistic combinations of feature values. More sophisticated methods will sample from the conditional distribution, which ensures that only realistic feature combinations are yielded. But we'll make do with the marginal distribution for now.

Generating a prediction for a coalition that has only sampled each missing feature once will give highly variable results. To obtain consistent predictions, we average this process over a large number of samples, as follows:

def approximate_prediction(row, m=model, X=X, missing=[], n=100):
    """Average the results returned by `marginalised_prediction()`
    over `n` predictions."""
    predictions = []
    for _ in range(n):
        predictions.append(marginalised_prediction(m, X, row, missing))
    return np.mean(predictions)

We can now compute the marginal contributions of each feature in an analogous manner to how we did when we computed them exactly via model retraining. We use the approximate_prediction() function to generate marginalised predictions for every instance in the dataset, across all eight feature coalitions. This corresponds to one coalition with zero features (for which the predictions will be equal to the average house price in the dataset), three coalitions with one feature, three coalitions with two features, and one coalition with all three features (i.e. the model we're interested in).

predictions = {}

# predictions with no features
predictions["none"] = X.apply(
    lambda row: approximate_prediction(
        row,
        missing=features
    ),
    axis=1
)

# predictions with one feature
for feat in features:
    predictions[feat] = X.apply(
        lambda row: approximate_prediction(
            row,
            missing=[c for c in features if c != feat]
        ),
        axis=1
    )

# predictions with two features
for i, feat1 in enumerate(features):
    for feat2 in features[i+1:]:
        predictions[f"{feat1}, {feat2}"] = X.apply(
            lambda row: approximate_prediction(
                row,
                missing=[c for c in features if c not in [feat1, feat2]]
            ),
            axis=1
        )

# predictions with all features
predictions["all"] = m.predict(X)

Approximating the features' Shapley values

Finally, we approximate the Shapley values of each feature using the same weighted averages as before. To understand this calculation, see The Mechanics of Shapley Values.

sv_pwc = 1/3 * (predictions["% working class"] -
                predictions["none"]) +\
         1/6 * (predictions["% working class, number of rooms"] -
                predictions["number of rooms"]) +\
         1/6 * (predictions["% working class, NOX concentration"] -
                predictions["NOX concentration"]) +\
         1/3 * (predictions["all"] -
                predictions["number of rooms, NOX concentration"])

sv_nor = 1/3 * (predictions["number of rooms"] -
                predictions["none"]) +\
         1/6 * (predictions["% working class, number of rooms"] -
                predictions["% working class"]) +\
         1/6 * (predictions["number of rooms, NOX concentration"] -
                predictions["NOX concentration"]) +\
         1/3 * (predictions["all"] -
                predictions["% working class, NOX concentration"])

sv_nc  = 1/3 * (predictions["NOX concentration"] -
                predictions["none"]) +\
         1/6 * (predictions["% working class, NOX concentration"] -
                predictions["% working class"]) +\
         1/6 * (predictions["number of rooms, NOX concentration"] -
                predictions["number of rooms"]) +\
         1/3 * (predictions["all"] -
                predictions["% working class, number of rooms"])

We can now used our approximated Shapley values to quantify the average impact of each feature on the model's house price predictions.

print("Mean absolute shapley values:")
print(f"% working class  : {np.abs(sv_pwc).mean():,.1f}")
print(f"number of rooms  : {np.abs(sv_nor).mean():,.1f}")
print(f"NOX concentration: {np.abs(sv_nc).mean():,.1f}")

# Returns:
# Mean absolute shapley values:
# % working class  : 4,051.3
# number of rooms  : 2,695.0
# NOX concentration: 1,289.9

Is this actually better than what we had before?

We've successfully estimated Shapley values without having to do any model retraining, but the astute reader has probably realised that we've just replaced one computationally intensive process with another. In fact, the above code - which prioritises clarity over efficiency - is much slower than the exact calculation of Shapley values demonstrated previously.

This leads us to SHAP, which encompasses various methods that leverage clever masking strategies to approximate Shapley values more efficiently than the naive approach outlined above. In follow up posts, we'll consider SHAP's two most prominent techniques: KernelSHAP and TreeSHAP.

Subscribe for future posts on KernelSHAP and TreeSHAP