explain
ShapExplainer
- class ai4water.postprocessing.explain.ShapExplainer(model, data: Union[ndarray, DataFrame, List[ndarray]], train_data: Optional[Union[ndarray, DataFrame, List[ndarray]]] = None, explainer: Optional[Union[str, Callable]] = None, num_means: int = 10, path: Optional[str] = None, feature_names: Optional[list] = None, framework: Optional[str] = None, layer: Optional[Union[int, str]] = None, save: bool = True, show: bool = True)[source]
Bases:
ExplainerMixin
Wrapper around SHAP explainers and plots to draw and save all the plots for a given model.
- features
- train_summary
only for KernelExplainer
- explainer
- shap_values
- - summary_plot
- - force_plot_single_example
- - dependence_plot_single_feature
- - force_plot_all
- Examples:
>>> from ai4water.postprocessing import ShapExplainer >>> from sklearn.model_selection import train_test_split >>> from sklearn import linear_model >>> import shap ... >>> X,y = shap.datasets.diabetes() >>> X_train,X_test,y_train,y_test = train_test_split(X, y, test_size=0.2, random_state=0) >>> lin_regr = linear_model.LinearRegression() >>> lin_regr.fit(X_train, y_train) >>> explainer = ShapExplainer(lin_regr, X_test, X_train, num_means=10) >>> explainer()
- __init__(model, data: Union[ndarray, DataFrame, List[ndarray]], train_data: Optional[Union[ndarray, DataFrame, List[ndarray]]] = None, explainer: Optional[Union[str, Callable]] = None, num_means: int = 10, path: Optional[str] = None, feature_names: Optional[list] = None, framework: Optional[str] = None, layer: Optional[Union[int, str]] = None, save: bool = True, show: bool = True)[source]
- Parameters:
model – a Model/regressor/classifier from sklearn/xgboost/catboost/LightGBM/tensorflow/pytorch/ai4water The model must have a predict method.
data – Data on which to make interpretation. Its dimension should be same as that of training data. It can be either training or test data
train_data – The data on which the model was trained. It is used to get train_summary. It can a numpy array or a pandas DataFrame. Only required for scikit-learn based models.
explainer – str the explainer to use. If not given, the explainer will be inferred.
num_means – int Numher of means, used in shap.kmeans to calculate train_summary using shap.kmeans. Only used when explainer is “KernelExplainer”
path – str path to save the plots. By default, plots will be saved in current working directory
feature_names – list Names of features. Should only be given if train/test data is numpy array.
framework – str either “DL” or “ML”. Here “DL” shows that the model is a deep learning or neural network based model and “ML” represents other models. For “DL” the explainer will be either “DeepExplainer” or “GradientExplainer”. If not given, it will be inferred. In such a case “DeepExplainer” will be prioritized over “GradientExplainer” for DL frameworks and “TreeExplainer” will be prioritized for “ML” frameworks.
layer – Union[int, str] only relevant when framework is “DL” i.e when the model consits of layers of neural networks.
show – whether to show the plot or not
save – whether to save the plot or not
- allowed_explainers = ['Explainer', 'DeepExplainer', 'TreeExplainer', 'KernelExplainer', 'LinearExplainer', 'AdditiveExplainer', 'GPUTreeExplainer', 'GradientExplainer', 'PermutationExplainer', 'SamplingExplainer', 'PartitionExplainer']
- beeswarm_plot(name: str = 'beeswarm', max_display: int = 10, **kwargs)[source]
Draws the beeswarm plot of shap.
- Parameters:
name – str name of saved file
max_display – maximum
kwargs – any keyword arguments for shap.beeswarm plot
- decision_plot(indices=None, name: str = 'decision_', **decision_kwargs)[source]
decision plot. For details see this blog.
- dependence_plot_single_feature(feature, name='dependence_plot', **kwargs)[source]
dependence plot for a single feature. See this .
- force_plot_all(name='force_plot.html', save=True, show=True, **force_kws)[source]
draws force plot for all examples in the given data and saves it in an html
- force_plot_single_example(idx: int, name=None, **force_kws)[source]
Draws force_plot for a single example/row/sample/instance/data point.
If the data is 3d and shap values are 3d then they are unrolled/flattened before plotting
- Parameters:
idx – index of exmaple to use. It can be any value >=0
name – name of saved file
force_kws – any keyword argument for force plot
- Returns:
plotter object
- heatmap(name: str = 'heatmap', max_display=10)[source]
Plots the heatmap and saves it
This can be drawn for xgboost/lgbm as well as for randomforest type models but not for CatBoostRegressor which is todo.
Note
The upper line plot on the heat map shows $-fx/max(abs(fx))$ where $fx$ is the mean SHAP value of all features. The length of $fx$ is equal to length of data/examples. Thus one point on this line is the mean of SHAP values of all input features for the given/one example normalized by the maximum absolute value of $fx$.
- property layer
- pdp_all_features(**pdp_kws)[source]
partial dependence plot of all features.
- Parameters:
pdp_kws – any keyword arguments
- pdp_single_feature(feature_name: str, **pdp_kws)[source]
partial depence plot using SHAP package for a single feature.
- plot_shap_values(interpolation=None, cmap='coolwarm', name: str = 'shap_values')[source]
Plots the SHAP values.
- Parameters:
name – name of saved file
interpolation – interpolation argument to axis.imshow
cmap – color map
- scatter_plot_all_features(name='scatter_plot', **scatter_kws)[source]
draws scatter plot for all features
- scatter_plot_single_feature(feature: int, name: str = 'scatter', **scatter_kws)[source]
scatter plot for a single feature
- summary_plot(plot_type: Optional[str] = None, name: str = 'summary_plot', **kwargs)[source]
Plots the summary plot of SHAP package.
- Parameters:
plot_type – str, either “bar”, or “violen” or “dot”
name – name of saved file
kwargs – any keyword arguments to shap.summary_plot
- waterfall_plot_all_examples(name: str = 'waterfall', **waterfall_kws)[source]
Plots the waterfall plot of SHAP package
It plots for all the examples/instances from test_data.
- waterfall_plot_single_example(example_index: int, name: str = 'waterfall', max_display: int = 10)[source]
- draws and saves waterfall plot
for one example.
The waterfall plots are based upon SHAP values and show the contribution by each feature in model’s prediction. It shows which feature pushed the prediction in which direction. They answer the question, why the ML model simply did not predict mean of training y instead of what it predicted. The mean of training observations that the ML model saw during training is called base value or expected value.
- Parameters:
example_index – int index of example to use
max_display – int maximu features to display
name – str name of plot
LimeMLExplainer
- class ai4water.postprocessing.explain.LimeExplainer(model, data, train_data, mode: str, explainer=None, path=None, feature_names: Optional[list] = None, verbosity: Union[int, bool] = True, save: bool = True, show: bool = True, **kwargs)[source]
Bases:
ExplainerMixin
Wrapper around LIME module.
Example
>>> from ai4water import Model >>> from ai4water.postprocessing import LimeExplainer >>> from ai4water.datasets import busan_beach >>> model = Model(model="GradientBoostingRegressor") >>> model.fit(data=busan_beach()) >>> lime_exp = LimeExplainer(model=model, ... train_data=model.training_data()[0], ... data=model.test_data()[0], ... mode="regression") >>> lime_exp.explain_example(0)
- explaination_objects
location explaination objects for each individual example/instance
- __init__(model, data, train_data, mode: str, explainer=None, path=None, feature_names: Optional[list] = None, verbosity: Union[int, bool] = True, save: bool = True, show: bool = True, **kwargs)[source]
- Parameters:
model – the model to explain. The model must have predict method.
data – the data to explain. This would typically be test data but it can be any data.
train_data – the data on which the model was trained.
mode – either of regression or classification
explainer – The explainer to use. By default, LimeTabularExplainer is used.
path – path where to save all the plots. By default, plots will be saved in current working directory.
feature_names – name/names of features.
verbosity – whether to print information or not.
show – whether to show the plot or not
save – whether to save the plot or not
- explain_all_examples(plot_type='pyplot', name='lime_explaination', num_features=None, **kwargs)[source]
Draws and saves plot for all examples of test_data.
- Parameters:
plot_type –
name –
num_features –
kwargs – any keyword argument for explain_instance
An example here means an instance/sample/data point.
- explain_example(index: int, plot_type: str = 'pyplot', name: str = 'lime_explaination', num_features: Optional[int] = None, colors=None, annotate=False, **kwargs) Figure [source]
Draws and saves plot for a single example of test_data.
- Parameters:
index – index of test_data
plot_type – either pyplot or html
name – name with which to save the file
num_features –
colors –
annotate – whether to annotate figure or not
kwargs – any keyword argument for explain_instance
- Returns:
matplotlib figure if plot_type=”pyplot” and show is False.
- property mode
PermutationImportance
- class ai4water.postprocessing.explain.PermutationImportance(model: Callable, inputs: Union[ndarray, List[ndarray]], target: ndarray, scoring: Union[str, Callable] = 'r2', n_repeats: int = 14, noise: Optional[Union[str, ndarray]] = None, cat_map: Optional[dict] = None, use_noise_only: bool = False, feature_names: Optional[list] = None, path: Optional[str] = None, seed: Optional[int] = None, weights=None, save: bool = True, show: bool = True, **kwargs)[source]
Bases:
ExplainerMixin
permutation importance answers the question, how much the model’s prediction performance is influenced by a feature? It defines the feature importance as the decrease in model performance when one feature is corrupted Molnar et al., 2021
- importances
Example
>>> from ai4water import Model >>> from ai4water.datasets import busan_beach >>> from ai4water.postprocessing.explain import PermutationImportance >>> data = busan_beach() >>> model = Model(model="XGBRegressor", verbosity=0) >>> model.fit(data=data) >>> x_val, y_val = model.validation_data() ... # initialize the PermutationImportance class >>> pimp = PermutationImportance(model.predict, x_val, y_val.reshape(-1,)) >>> fig = pimp.plot_1d_pimp()
- __init__(model: Callable, inputs: Union[ndarray, List[ndarray]], target: ndarray, scoring: Union[str, Callable] = 'r2', n_repeats: int = 14, noise: Optional[Union[str, ndarray]] = None, cat_map: Optional[dict] = None, use_noise_only: bool = False, feature_names: Optional[list] = None, path: Optional[str] = None, seed: Optional[int] = None, weights=None, save: bool = True, show: bool = True, **kwargs)[source]
initiates a the class and calculates the importances
- Parameters:
model – the trained model object which is callable e.g. if you have Keras or sklearn model then you should pass model.predict instead of model.
inputs – arrays or list of arrays which will be given as input to model
target – the true outputs or labels for corresponding inputs It must be a 1-dimensional numpy array
scoring – the peformance metric to use. It can be any metric from RegressionMetrics or ClassificationMetrics or a callable. If callable, then this must take true and predicted as input and sprout a float as output
n_repeats – number of times the permutation for each feature is performed. Number of calls to the model will be num_features * n_repeats
noise – The noise to add in the feature. It should be either an array of noise or a string of scipy distribution name defining noise.
use_noise_only – If True, the original feature will be replaced by the noise.
weights –
feature_names – names of features
seed – random seed for reproducibility. Permutation importance is strongly affected by random seed. Therfore, if you want to reproduce your results, set this value to some integer value.
path – path to save the plots
show – whether to show the plot or not
save – whether to save the plot or not
kwargs – any additional keyword arguments for model
- property noise
- plot_1d_pimp(plot_type: str = 'boxplot', **kwargs) Axes [source]
Plots the 1d permutation importance either as box-plot or as bar_chart
- Parameters:
plot_type (str, optional) – either boxplot or barchart
**kwargs – keyword arguments either for boxplot or bar_chart
- Return type:
matplotlib AxesSubplot
PartialDependencePlot
- class ai4water.postprocessing.explain.PartialDependencePlot(model: Callable, data, feature_names=None, num_points: int = 100, path=None, save: bool = True, show: bool = True, **kwargs)[source]
Bases:
ExplainerMixin
Partial dependence plots as introduced by Friedman et al., 2001
Example
>>> from ai4water import Model >>> from ai4water.datasets import busan_beach >>> from ai4water.postprocessing.explain import PartialDependencePlot >>> data = busan_beach() >>> model = Model(model="XGBRegressor") >>> model.fit(data=data) # get the data to explain >>> x, _ = model.training_data() >>> pdp = PartialDependencePlot(model.predict, x, model.input_features, >>> num_points=14)
- __init__(model: Callable, data, feature_names=None, num_points: int = 100, path=None, save: bool = True, show: bool = True, **kwargs)[source]
Initiates the class
- Parameters:
model (Callable) – the trained/calibrated model which must be callable. It must take the data as input and sprout an array of predicted values. For example if you are using Keras/sklearn model, then you must pass model.predict
data (np.ndarray, pd.DataFrame) – The inputs to the model. It can numpy array or pandas DataFrame.
feature_names (list, optional) – Names of features. Used for labeling.
num_points (int, optional) – determines the grid for evaluation of model
path (str, optional) – path to save the plots. By default the results are saved in current directory
show – whether to show the plot or not
save – whether to save the plot or not
**kwargs – any additional keyword arguments for model
- calc_pdp_1dim(data, feature, lookback=None)[source]
calculates partial dependence for 1 dimension data
- nd_interactions(height: int = 2, ice: bool = False, show_dist: bool = False, show_minima: bool = False) Figure [source]
Plots 2d interaction plots of all features as done in skopt
- Parameters:
height – height of each subplot in inches
ice – whether to show the ice lines or not
show_dist – whether to show the distribution of data as histogram or not
show_minima – whether to show the function minima or not
- Returns:
matplotlib Figure
Examples
>>> from ai4water import Model >>> from ai4water.datasets import busan_beach >>> from ai4water.postprocessing.explain import PartialDependencePlot >>> data = busan_beach() >>> model = Model(model="XGBRegressor") >>> model.fit(data=busan_beach()) >>> x, _ = model.training_data() >>> pdp = PartialDependencePlot(model.predict, x, model.input_features, ... num_points=14) >>> pdp.nd_interactions(show_dist=True)
- plot_1d(feature: Union[str, List[str]], show_dist: bool = True, show_dist_as: str = 'hist', ice: bool = True, feature_expected_value: bool = False, model_expected_value: bool = False, show_ci: bool = False, show_minima: bool = False, ice_only: bool = False, ice_color: str = 'lightblue', feature_name: Optional[str] = None, pdp_line_kws: Optional[dict] = None, ice_lines_kws: Optional[dict] = None, hist_kws: Optional[dict] = None)[source]
partial dependence plot in one dimension
- Parameters:
feature – the feature name for which to plot the partial dependence For one hot encoded categorical features, provide a list
show_dist – whether to show actual distribution of data or not
show_dist_as – one of “hist” or “grid”
ice – whether to show individual component elements on plot or not
feature_expected_value – whether to show the average value of feature on the plot or not
model_expected_value – whether to show average prediction on plot or not
show_ci – whether to show confidence interval of pdp or not
show_minima – whether to indicate the minima or not
ice_only (bool, False) – whether to show only ice plots
ice_color – color for ice lines. It can also be a valid maplotlib colormap
feature_name (str) – name of the feature. If not given, then value of
feature
is used.pdp_line_kws (dict) – any keyword argument for axes.plot when plotting pdp lie
ice_lines_kws (dict) – any keyword argument for axes.plot when plotting ice lines
hist_kws – any keyword arguemnt for axes.hist when plotting histogram
Examples
>>> from ai4water import Model >>> from ai4water.datasets import busan_beach >>> model = Model(model="XGBRegressor") >>> data = busan_beach() >>> model.fit(data=data) >>> x, _ = model.training_data(data=data) >>> pdp = PartialDependencePlot(model.predict, x, model.input_features, ... num_points=14) >>> pdp.plot_1d("tide_cm")
with categorical features
>>> from ai4water.datasets import mg_photodegradation >>> data, cat_enc, an_enc = mg_photodegradation(encoding="ohe") >>> model = Model(model="XGBRegressor") >>> model.fit(data=data) >>> x, _ = model.training_data(data=data) >>> pdp = PartialDependencePlot(model.predict, x, model.input_features, ... num_points=14) >>> feature = [f for f in model.input_features if f.startswith('Catalyst_type')] >>> pdp.plot_1d(feature) >>> pdp.plot_1d(feature, show_dist_as="grid") >>> pdp.plot_1d(feature, show_dist=False) >>> pdp.plot_1d(feature, show_dist=False, ice=False) >>> pdp.plot_1d(feature, show_dist=False, ice=False, model_expected_value=True) >>> pdp.plot_1d(feature, show_dist=False, ice=False, feature_expected_value=True)
- plot_interaction(features: list, lookback: Optional[int] = None, ax: Optional[Axes] = None, plot_type: str = '2d', cmap=None, colorbar: bool = True, show: bool = True, save: bool = True, **kwargs) Axes [source]
Shows interaction between two features
- Parameters:
features – a list or tuple of two feature names to use
lookback (optional) – only relevant in data is 3d
ax (optional) – matplotlib axes on which to draw. If not given, current axes will be used.
plot_type (optional) – either “2d” or “surface”
cmap (optional) – color map to use
colorbar (optional) – whether to show the colorbar or not
show (bool) –
save (bool) –
**kwargs – any keyword argument for axes.plot_surface or axes.contourf
- Return type:
matplotlib Axes
Examples
>>> from ai4water import Model >>> from ai4water.datasets import busan_beach >>> from ai4water.postprocessing.explain import PartialDependencePlot >>> data = busan_beach() >>> model = Model(model="XGBRegressor") >>> model.fit(data=busan_beach()) >>> x, _ = model.training_data() >>> pdp = PartialDependencePlot(model.predict, x, model.input_features, ... num_points=14) ... # specifying features whose interaction is to be calculated and plotted. >>> axis = pdp.plot_interaction(["tide_cm", "wat_temp_c"])
explain_model
Explains the ai4water’s Model class.
- param model:
the AI4Water’s model to explain
- param features_to_explain:
the input features to explain. It must be a string or a list of strings where a string is a feature name.
- param examples_to_explain:
the examples to explain. If integer, it will be the number/index of example to explain. If float, it will be fraction of values to explain. If list/array, it will be index of examples to explain. The examples are choosen which have highest variance in prediction.
- param explainer:
the explainer to use. If None, it will be inferred based upon the model type.
- param layer:
layer to explain. Only relevant if the model consits of layers of neural networks. If integer, it will be the number of layer to explain. If string, it will be name of layer of to explain.
- param method:
either ‘both’, ‘shap’ or ‘lime’. If both, then the model will be explained using both lime and shap methods.
- returns:
if `method`==both, it will return a tuple of LimeExplainer and ShapExplainer otherwise it will return the instance of either LimeExplainer or ShapExplainer.
Example
>>> from ai4water import Model
>>> from ai4water.datasets import busan_beach
>>> from ai4water.postprocessing.explain import explain_model
>>> model = Model(model="RandomForestRegressor")
>>> model.fit(data=busan_beach())
>>> explain_model(model, total_data=busan_beach())
explain_model_with_lime
Explains the model with LimeExplainer
- param data_to_explain:
the data to explain
- param train_data:
the data used for training.
- param total_data:
total data from which training and test data will be extracted. This is only required if data_to_explain/train data is not given.
- param model:
the AI4Water’s model to explain
- param examples_to_explain:
the examples to explain
- type examples_to_explain:
Union[int, float, list]
- rtype:
an instance of [LimeExplainer][ai4water.postprocessing.explain.LimeExplainer]
Example
>>> from ai4water import Model
>>> from ai4water.datasets import busan_beach
>>> from ai4water.postprocessing.explain import explain_model_with_lime
>>> model = Model(model="RandomForestRegressor")
>>> model.fit(data=busan_beach())
>>> explain_model_with_lime(model, total_data=busan_beach())
explain_model_with_shap
Expalins the model which is built by AI4Water’s Model class using SHAP.
- param model:
the model to explain.
- param data_to_explain:
the data to explain. If given, then
train_data
must be given as well. If not given thentotal_data
must be given.- param train_data:
the data on which model was trained. If not given, then
total_data
must be given.- param total_data:
raw unpreprocessed data from which train and test data will be extracted. The explanation will be done on test data. This is only required if data_to_explain and train_data are not given.
- param features_to_explain:
the features to explain.
- type features_to_explain:
Optional[Union[str, list]]
- param examples_to_explain:
the examples to explain. If integer, it will be the number of examples to explain. If float, it will be fraction of values to explain. If list/array, it will be index of examples to explain. The examples are choosen which have highest variance in prediction.
- type examples_to_explain:
Union[int, float, list]
- param explainer:
the explainer to use
- param layer:
layer to explain.
- type layer:
Optional[Union[int, str]]
- param plot_name:
name of plot to draw
- rtype:
an instance of ShapExplainer
Examples
>>> from ai4water import Model
>>> from ai4water.datasets import busan_beach
>>> from ai4water.postprocessing.explain import explain_model_with_shap
>>> model = Model(model="RandomForestRegressor")
>>> model.fit(data=busan_beach())
>>> explain_model_with_shap(model, total_data=busan_beach())