Model

BaseModel

The core of AI4Water is the Model class which builds and trains the machine learning model. This class interacts with pre-processing and post-processing modules.

The Model class uses a python dictionary to build layers of neural networks.

To build Tensorflow based models using python dictionary see the guide for declarative model definition for tensorflow. To build pytorch based NN models using python dicitonary see the guide for declarative model definition for pytorch .

class ai4water._main.BaseModel(model: Optional[Union[dict, str, Callable]] = None, x_transformation: Optional[Union[str, dict, list]] = None, y_transformation: Optional[Union[str, dict, list]] = None, lr: float = 0.001, optimizer='Adam', loss: Union[str, Callable] = 'mse', quantiles=None, epochs: int = 14, min_val_loss: float = 0.0001, patience: int = 100, save_model: bool = True, monitor: Optional[Union[str, list]] = None, val_metric: Optional[str] = None, cross_validator: Optional[dict] = None, wandb_config: Optional[dict] = None, seed: int = 313, prefix: Optional[str] = None, path: Optional[str] = None, verbosity: int = 1, accept_additional_args: bool = False, **kwargs)[source]

Model class that implements logic of AI4Water.

__init__(model: Optional[Union[dict, str, Callable]] = None, x_transformation: Optional[Union[str, dict, list]] = None, y_transformation: Optional[Union[str, dict, list]] = None, lr: float = 0.001, optimizer='Adam', loss: Union[str, Callable] = 'mse', quantiles=None, epochs: int = 14, min_val_loss: float = 0.0001, patience: int = 100, save_model: bool = True, monitor: Optional[Union[str, list]] = None, val_metric: Optional[str] = None, cross_validator: Optional[dict] = None, wandb_config: Optional[dict] = None, seed: int = 313, prefix: Optional[str] = None, path: Optional[str] = None, verbosity: int = 1, accept_additional_args: bool = False, **kwargs)[source]

The Model class can take a large number of possible arguments depending upon the machine learning model/algorithm used. Not all the arguments are applicable in each case. The user must define only the relevant/applicable parameters and leave the others as it is.

Parameters
  • model

    a dictionary defining machine learning model. If you are building a non-neural network model then this dictionary must consist of name of name of model as key and the keyword arguments to that model as dictionary. For example to build a decision forest based model

    >>> model = {'DecisionTreeRegressor': {"max_depth": 3,
    ...                                    "criterion": "mae"}}
    

    The key ‘DecisionTreeRegressor’ should exactly match the name of the model from one of following libraries

    The value {“max_depth”: 3, “criterion”: “mae”} is another dictionary which can be any keyword argument which the model (DecisionTreeRegressor in this case) accepts. The user must refer to the documentation of the underlying library (scikit-learn for DecisionTreeRegressor) to find out complete keyword arguments applicable for a particular model. See examples to learn how to build machine learning models If You are building a Deep Learning model using tensorflow, then the key must be ‘layers’ and the value must itself be a dictionary defining layers of neural networks. For example we can build an MLP as following

    >>> model = {'layers': {
    ...             "Dense_0": {'units': 64, 'activation': 'relu'},
    ...              "Flatten": {},
    ...              "Dense_3": {'units': 1}
    >>>             }}
    

    The MLP in this case consists of dense, and flatten layers. The user can define any keyword arguments which is accepted by that layer in TensorFlow. For example the Dense layer in TensorFlow can accept units and activation keyword argument among others. For details on how to buld neural networks using such layered API see examples

  • x_transformation

    type of transformation to be applied on x/input data. The transformation can be any transformation name from ai4water.utils.transformations.py. The user can specify more than one transformation. Moreover, the user can also determine which transformation to be applied on which input feature. Default is ‘minmax’. To apply a single transformation on all the data

    >>> x_transformation = 'minmax'
    

    To apply different transformations on different input and output features

    >>> x_transformation = [{'method': 'minmax', 'features': ['input1', 'input2']},
    ...                {'method': 'zscore', 'features': ['input3', 'input4']}
    ...                 ]
    

    Here input1, input2, input3 and input4 are the columns in the data. For more info see ai4water.preprocessing.Transformations and ai4water.preprocessing.Transformation classes.

  • y_transformation – type of transformation to be applied on y/label/output data.

  • lr (, default 0.001.) – learning rate,

  • optimizer (str/keras.optimizers like) – the optimizer to be used for neural network training. Default is ‘Adam’

  • loss (str/callable Default is mse.) – the cost/loss function to be used for training neural networks.

  • quantiles (list Default is None) – quantiles to be used when the problem is quantile regression.

  • epochs (int Default is 14) – number of epochs to be used.

  • min_val_loss (float Default is 0.0001.) – minimum value of validatin loss/error to be used for early stopping.

  • patience (int) – number of epochs to wait before early stopping. Set this value to None if you don’t want to use EarlyStopping.

  • save_model (bool) – whether to save the model or not. For neural networks, the model will be saved only an improvement in training/validation loss is observed. Otherwise model is not saved.

  • monitor (str/list) – metrics to be monitored. e.g. [‘nse’, ‘pbias’]

  • val_metric (str) – performance metric to be used for validation/cross_validation. This metric will be used for hyper-parameter optimizationa and experiment comparison. If not defined then r2_score will be used for regression and accuracy will be used for classification.

  • cross_validator (dict) –

    selects the type of cross validation to be applied. It can be any cross validator from sklear.model_selection. Default is None, which means validation will be done using validation_data. To use kfold cross validation,

    >>> cross_validator = {'KFold': {'n_splits': 5}}
    

  • batches (str) – either 2d or 3d`.

  • wandb_config (dict) –

    Only valid if wandb package is installed. Default value is None, which means, wandb will not be utilized. For simplest case, pass a dictionary with at least two keys namely project and entity. Otherwise use a dictionary of all the arugments for wandb.init, wandb.log and WandbCallback. For training_data and validation_data in WandbCallback, pass True instead of providing a tuple as shown below

    >>> wandb_config = {'entity': 'entity_name', 'project': 'project_name',
    ...                 'training_data':True, 'validation_data': True}
    

  • int (seed) – random seed for reproducibility. This can be set to None. The seed is set to os, tf, torch and random modules simultaneously. Please note that this seed is not set for numpy because that will result in constant sampling during hyperparameter optimization. If you want to seed everything, then use following function >>> model.seed_everything()

  • prefix (str) – prefix to be used for the folder in which the results are saved. default is None, which means within ./results/model_path

  • path (str/path like) – if not given, new model_path path will not be created.

  • verbosity (int default is 1) – determines the amount of information being printed. 0 means no print information. Can be between 0 and 3. Setting this value to 0 will also reqult in not showing some plots such as loss curve or regression plot. These plots will only be saved in self.path.

  • accept_additional_args (bool Default is False) – If you want to pass any additional argument, then this argument must be set to True, otherwise an error will be raise.

  • **kwargs – keyword arguments for ai4water.preprocessing.DataSet.__init__()

Note

The transformations applied on x and y data using x_transformation and y_transformations are part of model. See transformation

Examples

>>> from ai4water import Model
>>> from ai4water.datasets import busan_beach
>>> df = busan_beach()
>>> model_ = Model(input_features=df.columns.tolist()[0:-1],
...              batch_size=16,
...              output_features=df.columns.tolist()[-1:],
...              model={'layers': {'LSTM': 64, 'Dense': 1}},
... )
>>> history = model_.fit(data=df)
>>> y, obs = model_.predict()
cross_val_score(x=None, y=None, data: Optional[Union[DataFrame, ndarray, str]] = None, scoring: Optional[Union[str, list]] = None, refit: bool = False, process_results: bool = False) list[source]

computes cross validation score

Parameters
  • x – input data

  • y – output corresponding to x.

  • data – raw unprepared data which will be given to ai4water.preprocessing.DataSet to prepare x,y from it.

  • scoring – performance metric to use for cross validation. If None, it will be taken from config[‘val_metric’]

  • refit (bool, optional (default=False) – If True, the model will be trained on the whole training+validation data after calculating cross validation score.

  • process_results (bool, optional) – whether to process results at each cv iteration or not

Returns

cross validation score for each of metric in scoring

Return type

list

Example

>>> from ai4water.datasets import busan_beach
>>> from ai4water import Model
>>> model = Model(model="XGBRegressor",
>>>               cross_validator={"KFold": {"n_splits": 5}})
>>> model.cross_val_score(data=busan_beach())

Note

Currently not working for deep learning models.

eda(data, freq: Optional[str] = None)[source]

Performs comprehensive Exploratory Data Analysis.

Parameters
  • data

  • freq – if specified, small chunks of data will be plotted instead of whole data at once. The data will NOT be resampled. This is valid only plot_data and box_plot. Possible values are yearly, weekly`, and monthly.

Return type

an instance of EDA ai4water.eda.EDA class

evaluate(x=None, y=None, data=None, metrics=None, **kwargs)[source]

Evalutes the performance of the model on a given data. calls the evaluate method of underlying model. If the evaluate method is not available in underlying model, then predict is called.

Parameters
  • x – inputs

  • y – outputs/true data corresponding to x

  • data – Raw unprepared data which will be fed to ai4water.preprocessing.DataSet to prepare x and y. If x and y are given, this argument will have no meaning.

  • metrics

    the metrics to evaluate. It can a string indicating the metric to evaluate. It can also be a list of metrics to evaluate. Any metric name from RegressionMetrics or ClassificationMetrics can be given. It can also be name of group of metrics to evaluate. Following groups are available

    • minimal

    • all

    • hydro_metrics

    If this argument is given, the evaluate function of the underlying class is not called. Rather the model is evaluated manually for given metrics. Otherwise, if this argument is not given, then evaluate method of underlying model is called, if available.

  • kwargs – any keyword argument for the evaluate method of the underlying model.

Returns

If metrics is not given then this method returns whatever is returned by evaluate method of underlying model. Otherwise the model is evaluated for given metric or group of metrics and the result is returned

Examples

>>> from ai4water import Model
>>> from ai4water.datasets import busan_beach
>>> model = Model(model={"layers": {"Dense": 1}})
>>> model.fit(data=busan_beach())

for evaluation on test data

>>> model.evaluate(data=busan_beach())
...

evaluate on any metric from SeqMetrics library

>>> model.evaluate(data=busan_beach(), metrics='pbias')
...
... # to evaluate on custom data, the user can provide its own x and y
>>> new_inputs = np.random.random((10, 13))
>>> new_outputs = np.random.random((10, 1, 1))
>>> model.evaluate(new_inputs, new_outputs)

backward compatability Since the ai4water’s Model is supposed to behave same as Keras’ Model the following expressions are equally valid.

>>> model.evaluate(x, y=y)
>>> model.evaluate(x=x, y=y)
evaluate_on_all_data(data, metrics=None, **kwargs)[source]

evaluates the model on all i.e. training+validation+test data.

evaluate_on_test_data(data, metrics=None, **kwargs)[source]

evaluates the model on test data.

evaluate_on_training_data(data, metrics=None, **kwargs)[source]

evaluates the model on training data.

Examples

>>> from ai4water import Model
>>> from ai4water.datasets import busan_beach
>>> model = Model(model={"layers": {"Dense": 1}})
>>> model.fit(data=busan_beach())
... # for evaluation on training data
>>> model.evaluate_on_training_data(data=busan_beach())
evaluate_on_validation_data(data, metrics=None, **kwargs)[source]

evaluates the model on validation data.

explain(*args, **kwargs)[source]

Calls the :py:meth:ai4water.postprocessing.explain.explain_model` function to explain the model.

explain_example(data, example_num: int, method='shap')[source]

explains a single exmaple either using shap or lime

Parameters
  • data – the data to use

  • example_num – the example/sample number/index to explain

  • method – either shap or lime

Examples

>>> from ai4water import Model
>>> from ai4water.datasets import busan_beach
>>> data = busan_beach()
>>> model = Model(model="RandomForestRegressor")
>>> model.fit(data=data)
>>> model.explain(data=data, example_num=2)
feature_interaction(features: List[str], x: Optional[Union[ndarray, DataFrame]] = None, data=None, data_type: str = 'all', feature_names: Optional[List[str]] = None, plot_type='heatmap', num_grid_points=None, grid_types=None, percentile_ranges=None, grid_ranges=None, cust_grid_points=None, show_percentile: bool = False, show_outliers: bool = False, endpoint: bool = True, which_classes=None, ncols=2, figsize: Optional[Tuple[Union[int, float]]] = None, annotate: bool = False, annotate_counts: bool = True, show: bool = True, save_info: bool = True, annotate_colors=('black', 'white'), annotate_color_threshold: Optional[float] = None, annotate_fmt: Optional[str] = None, annotate_fontsize: int = 7) Tuple[DataFrame, Axes][source]

shows prediction distribution with respect to two input features.

Parameters
  • x – input data to the model.

  • data – raw unprepared data from which x,y pairs for training,validation and test are generated. It must only be given if x is not given.

  • data_type (str, optional (default="test")) – The kind of data to be used. It is only valid if data argument is used. It should be one of training, validation, test or all.

  • features (list) – two features to investigate

  • feature_names (list) – feature names

  • num_grid_points (list, optional, default=None) – number of grid points for each feature

  • grid_types (list, optional, default=None) – type of grid points for each feature

  • percentile_ranges (list of tuple, optional, default=None) – percentile range to investigate for each feature

  • grid_ranges (list of tuple, optional, default=None) – value range to investigate for each feature

  • cust_grid_points (list of (Series, 1d-array, list), optional, default=None) – customized list of grid points for each feature

  • show_percentile (bool, optional, default=False) – whether to display the percentile buckets for both feature

  • show_outliers (bool, optional, default=False) – whether to display the out of range buckets for both features

  • endpoint (bool, optional) – If True, stop is the last grid point, default=True Otherwise, it is not included

  • which_classes (list, optional, default=None) – which classes to plot, only use when it is a multi-class problem

  • figsize (tuple or None, optional, default=None) – size of the figure, (width, height)

  • ncols (integer, optional, default=2) – number subplot columns, used when it is multi-class problem

  • annotate (bool, default=False) – whether to annotate the points

  • annotate_counts (bool, default=False) – whether to annotate counts or not.

  • annotate_colors (tuple) – pair of colors

  • annotate_color_threshold (float) – threshold value for annotation

  • annotate_fmt (str) – format string for annotation.

  • annotate_fontsize (int, optinoal (default=7)) – fontsize for annotation

  • plot_type (str, optional (default="circles")) – either circles or hetmap

  • show (bool, optional (default=True)) – whether to show the plot or not

  • save_info (bool, optional, default=True) – whether to save the information as csv or not

Returns

a pandas dataframe and matplotlib Axes

Return type

tuple

Examples

>>> from ai4water.datasets import busan_beach
>>> from ai4water import Model
...
>>> model = Model(model="XGBRegressor")
>>> model.fit(data=busan_beach())
>>> model.feature_interaction(
...     ['tide_cm', 'sal_psu'],
...     data=busan_beach(),
...     annotate_counts=True,
...     annotate_colors=("black", "black"),
...     annotate_fontsize=10,
...     cust_grid_points=[[-41.4, -20.0, 0.0, 20.0, 42.0],
...                       [33.45, 33.7, 33.9, 34.05, 34.4]],
... )
fit(x=None, y=None, data: Union[ndarray, DataFrame, DataSet, str] = 'training', callbacks: Optional[Union[list, dict]] = None, **kwargs)[source]

Trains the model with data. The data is either x or it is taken from data by feeding it to DataSet.

Parameters
  • x – The input data consisting of input features. It can also be tf.Dataset or TorchDataset.

  • y – Correct labels/observations/true data corresponding to ‘x’.

  • data – Raw data fromw which x,``y`` pairs are prepared. This will be passed to ai4water.preprocessing.DataSet. It can also be an instance if ai4water.preprocessing.DataSet or ai4water.preprocessing.DataSetPipeline. It can also be name of dataset from ai4water.datasets.all_datasets

  • callbacks

    Any callback compatible with keras. If you want to log the output to tensorboard, then just use callbacks={‘tensorboard’:{}} or to provide additional arguments

    >>> callbacks={'tensorboard': {'histogram_freq': 1}}
    

  • kwargs – Any keyword argument for the fit method of the underlying library. if ‘x’ is present in kwargs, that will take precedent over data.

Returns

A keras history object in case of deep learning model with tensorflow as backend or anything returned by fit method of underlying model.

Examples

>>> from ai4water import Model
>>> from ai4water.datasets import busan_beach
>>> model = Model(model="XGBRegressor")
>>> model.fit(data=busan_beach())

using your own data for training

>>> new_inputs = np.random.random((100, 10))
>>> new_outputs = np.random.random(100)
>>> model.fit(x=new_inputs, y=new_outputs)
classmethod from_config(config: dict, make_new_path: bool = False, **kwargs)[source]

Loads the model from config dictionary i.e. model.config

Parameters
  • config (dict) – dictionary containing model’s parameters i.e. model.config

  • make_new_path (bool, optional) – whether to make new path or not?

  • **kwargs – any additional keyword arguments to Model class.

Return type

an instalnce of ai4water.Model

Example

>>> from ai4water import Model
>>> from ai4water.datasets import busan_beach
>>> data = busan_beach()
>>> old_model = Model(model="XGBRegressor")
>>> old_model.fit(data=data)
... # now construct a new model instance from config dictionary
>>> model = Model.from_config(old_model.config)
>>> x = np.random.random((100, 14))
>>> prediction = model.predict(x=x)
classmethod from_config_file(config_path: str, make_new_path: bool = False, **kwargs) BaseModel[source]

Loads the model from a config file.

Parameters
  • config_path – complete path of config file

  • make_new_path (bool, optional) – If true, then it means we want to use the config file, only to build the model and a new path will be made. We would not normally update the weights in such a case.

  • **kwargs – any additional keyword arguments for the ai4water.Model

Return type

an instance of ai4water.Model class

Example

>>> from ai4water import Model
>>> config_file_path = "../file/to/config.json"
>>> model = Model.from_config_file(config_file_path)
>>> x = np.random.random((100, 14))
>>> prediction = model.predict(x=x)
interpret(**kwargs)[source]

Interprets the underlying model. Call it after training.

Returns

An instance of ai4water.postprocessing.interpret.Interpret class

Example

>>> from ai4water import Model
>>> from ai4water.datasets import busan_beach
>>> model = Model(model=...)
>>> model.fit(data=busan_beach())
>>> model.interpret()
optimize_hyperparameters(data: Union[tuple, list, DataFrame, ndarray], algorithm: str = 'bayes', num_iterations: int = 14, process_results: bool = True, update_config: bool = True, **kwargs)[source]

optimizes the hyperparameters of the built model

The parameaters that needs to be optimized, must be given as space.

Parameters
  • data

    It can be one of following

    • raw unprepared data in the form of a numpy array or pandas dataframe

    • a tuple of x,y pairs

    If it is unprepared data, it is passed to ai4water.preprocessing.DataSet. which prepares x,y pairs from it. The DataSet class also splits the data into training, validation and tests sets. If it is a tuple of x,y pairs, it is split into training and validation. In both cases, the loss on validation set is used as objective function. The loss calculated using val_metric.

  • algorithm – the algorithm to use for optimization

  • num_iterations – number of iterations for optimization.

  • process_results – whether to perform postprocessing of optimization results or not

  • update_config – whether to update the config of model or not.

Returns

an instance of ai4water.hyperopt.HyperOpt which is used for optimization

Examples

>>> from ai4water import Model
>>> from ai4water.datasets import busan_beach
>>> from ai4water.hyperopt import Integer, Categorical, Real
>>> model_config = {"XGBRegressor": {"n_estimators": Integer(low=10, high=20),
>>>                 "max_depth": Categorical([10, 20, 30]),
>>>                 "learning_rate": Real(0.00001, 0.1)}}
>>> model = Model(model=model_config)
>>> optimizer = model.optimize_hyperparameters(data=busan_beach())

Same can be done if a model is defined using neural networks

>>> model_conf = {"layers": {
...     "Input": {"input_shape": (15, 13)},
...     "LSTM":  {"config": {"units": Integer(32, 64), "activation": "relu"}},
...      "Dense1": {"units": 1,
...            "activation": Categorical(["relu", "tanh"], name="dense1_act")}}}
>>> model = Model(model=model_config)
>>> optimizer = model.optimize_hyperparameters(data=busan_beach())
optimize_transformations(data: Union[ndarray, DataFrame], transformations: Optional[Union[str, list]] = None, include: Optional[Union[str, dict, list]] = None, exclude: Optional[Union[str, list]] = None, append: Optional[dict] = None, y_transformations: Optional[Union[list, dict]] = None, algorithm: str = 'bayes', num_iterations: int = 12, process_results: bool = True, update_config: bool = True)[source]

optimizes the transformations for the input/output features

The ‘val_score’ parameter given as input to the Model is used as objective function for optimization problem.

Parameters
  • data

    It can be one of following

    • raw unprepared data in the form of a numpy array or pandas dataframe

    • a tuple of x,y pairs

    If it is unprepared data, it is passed to ai4water.preprocessing.DataSet. which prepares x,y pairs from it. The DataSet class also splits the data into training, validation and tests sets. If it is a tuple of x,y pairs, it is split into training and validation. In both cases, the loss on validation set is used as objective function. The loss calculated using val_metric.

  • transformations

    the transformations to consider for input features. By default, following transformations are considered for input features

    • minmax rescale from 0 to 1

    • center center the data by subtracting mean from it

    • scale scale the data by dividing it with its standard deviation

    • zscore first performs centering and then scaling

    • box-cox

    • yeo-johnson

    • quantile

    • robust

    • log

    • log2

    • log10

    • sqrt square root

  • include – list, dict, str, optional the name/names of input features to include. If you don’t want to include any feature. Set this to an empty list

  • exclude – the name/names of input features to exclude

  • append

    the input features with custom candidate transformations. For example if we want to try only minmax and zscore on feature tide_cm, then it can be done as following

    >>> append={"tide_cm": ["minmax", "zscore"]}
    

  • y_transformations

    It can either be a list of transformations to be considered for output features for example

    >>> y_transformations = ['log', 'log10', 'log2', 'sqrt']
    

    would mean that consider log, log10, log2 and sqrt are to be considered for output transformations during optimization. It can also be a dictionary whose keys are names of output features and whose values are lists of transformations to be considered for output features. For example

    >>> y_transformations = {'output1': ['log2', 'log10'], 'output2': ['log', 'sqrt']}
    

    Default is None, which means do not optimize transformation for output features.

  • algorithm – str The algorithm to use for optimizing transformations

  • num_iterations – int The number of iterations for optimizatino algorithm.

  • process_results – whether to perform postprocessing of optimization results or not

  • update_config – whether to update the config of model or not.

Returns

an instance of HyperOpt ai4water.hyperopt.HyperOpt class which is used for optimization

Example

>>> from ai4water.datasets import busan_beach
>>> from ai4water import Model
>>> model = Model(model="XGBRegressor")
>>> optimizer_ = model.optimize_transformations(data=busan_beach(), exclude="tide_cm")
>>> print(optimizer_.best_paras())  # find the best/optimized transformations
>>> model.fit(data=busan_beach())
>>> model.predict()
partial_dependence_plot(x=None, data=None, data_type='all', feature_name=None, num_points=100)[source]

Shows partial depedence plot for a feature.

Parameters
  • x – the input data to use. If not given, then data must be given.

  • data – raw unprepared data from which x,y paris are to be made. If given, x must not be given.

  • data_type (str) – the kind of the data to be used. It is only valid when data is given.

  • feature_name (str/list) – name/names of features. If only one feature is given, 1 dimensional partial dependence plot is plotted. You can also provide a list of two feature names, in which case 2d interaction plot will be plotted.

  • num_points (int) – number of points. It is used to define grid.

Return type

an instance of ai4water.postprocessing.PartialDependencePlot

Examples

>>> from ai4water import Model
>>> from ai4water.datasets import busan_beach
>>> data = busan_beach()
>>> model = Model(model="RandomForestRegressor")
>>> model.fit(data=data)
>>> model.partial_dependence_plot(x=data.iloc[:, 0:-1], feature_name="tide_cm")
...
>>> model.partial_dependence_plot(data=data, feature_name="tide_cm")
permutation_importance(data=None, data_type: str = 'test', x=None, y=None, scoring: Union[str, Callable] = 'r2', n_repeats: int = 5, noise: Optional[Union[str, ndarray]] = None, use_noise_only: bool = False, weights=None, plot_type: Optional[str] = None)[source]

Calculates the permutation importance on the given data

Parameters
  • data – Raw unprepared data from which x,y paris of training and test data are prepared.

  • data_type (str) – one of training, test or validation. By default test data is used based upon recommendations of Christoph Molnar’s book. Only valid if data argument is given.

  • x – inputs for the model. alternative to data

  • y – target/observation data for the model. alternative to data

  • scoring – the scoring to use to calculate importance

  • n_repeats – number of times the permutation for each feature is performed.

  • noise – the noise to add when a feature is permutated. It can be a 1D array of length equal to len(data) or string defining the distribution

  • use_noise_only – If True, then the feature being perturbed is replaced by the noise instead of adding the noise into the feature. This argument is only valid if noise is not None.

  • weights

  • plot_type – if not None, it must be either heatmap or boxplot or bar_chart

Return type

an instance of ai4water.postprocessing.PermutationImprotance

Examples

>>> from ai4water import Model
>>> from ai4water.datasets import busan_beach
>>> model = Model(model="XGBRegressor")
>>> model.fit(data=busan_beach())
>>> perm_imp = model.permutation_importance(data=busan_beach(),
...  data_type="validation", plot_type="boxplot")
>>> perm_imp.importances
predict(x=None, y=None, data: Union[str, DataFrame, ndarray, DataSet] = 'test', process_results: bool = True, metrics: str = 'minimal', return_true: bool = False, plots: Optional[Union[str, list]] = None, **kwargs)[source]

Makes prediction from the trained model.

Parameters
  • x – The data on which to make prediction. if given, it will override data. It can also be tf.Dataset or TorchDataset

  • y – Used for pos-processing etc. if given it will overrite data

  • data – It can also be unprepared/raw data which will be given to ai4water.preprocessing.DataSet to prepare x,y values.

  • process_results – bool post processing of results

  • metrics – str only valid if process_results is True. The metrics to calculate. Valid values are minimal, all, hydro_metrics

  • return_true – bool whether to return the true values along with predicted values or not. Default is False, so that this method behaves sklearn type.

  • plots – optional (default=None) The kind of of plots to draw. Only valid if post_process is True

  • kwargs – any keyword argument for predict method.

Returns

An numpy array of predicted values. If return_true is True then a tuple of arrays. The first is true and the second is predicted. If x is given but y is not given, then, first array which is returned is None.

Examples

>>> from ai4water import Model
>>> from ai4water.datasets import busan_beach
>>> model = Model(model="XGBRegressor")
>>> model.fit(data=busan_beach())
>>> pred = model.predict(data=busan_beach())

get true values

>>> true, pred = model.predict(data=busan_beach(), return_true=True)

postprocessing of results

>>> pred = model.predict(data=busan_beach(), process_results=True)

calculate all metrics during postprocessing

>>> pred = model.predict(data=busan_beach(), process_results=True, metrics="all")

using your own data

>>> new_input = np.random.random(10, 14)
>>> pred = model.predict(x = new_input)
predict_log_proba(x=None, data='test', **kwargs)[source]

since preprocesisng is part of Model, so the trained model with sklearn/xgboost/catboost/lgbm as backend must also be able to apply preprocessing on inputs before calling predict_log_proba from underlying library. Currently it just calls the log_proba function of underlying library by first transforming x

predict_on_all_data(data, process_results=True, return_true=False, metrics='minimal', plots: Optional[Union[str, list]] = None, **kwargs)[source]

It makes prediction on training+validation+test data.

Parameters
  • data – raw, unprepared data from which x,y paris will be generated.

  • process_results (bool, optional) – whether to post-process the results or not

  • return_true (bool, optional) – If true, the returned value will be tuple, first is true and second is predicted array

  • metrics (str, optional) – the metrics to calculate during post-processing

  • plots (optional (default=None)) – The kind of of plots to draw. Only valid if post_process is True

  • **kwargs – any keyword argument for .predict method.

predict_on_test_data(data, process_results=True, return_true=False, metrics='minimal', plots: Optional[Union[str, list]] = None, **kwargs)[source]

makes prediction on test data.

Parameters
  • data – raw, unprepared data from which test data (x,y paris) will be generated.

  • process_results (bool, optional) – whether to post-process the results or not

  • return_true (bool, optional) – If true, the returned value will be tuple, first is true and second is predicted array

  • metrics (str, optional) – the metrics to calculate during post-processing

  • plots (optional (default=None)) – The kind of of plots to draw. Only valid if post_process is True

  • **kwargs – any keyword argument for .predict method.

predict_on_training_data(data, process_results=True, return_true=False, metrics='minimal', plots: Optional[Union[str, list]] = None, **kwargs)[source]

makes prediction on training data.

Parameters
  • data – raw, unprepared data from which training data (x,y paris) will be generated.

  • process_results (bool, optional) – whether to post-process the results or not

  • return_true (bool, optional) – If true, the returned value will be tuple, first is true and second is predicted array

  • metrics (str, optional) – the metrics to calculate during post-processing

  • plots (optional (default=None)) – The kind of of plots to draw. Only valid if post_process is True

  • **kwargs – any keyword argument for .predict method.

predict_on_validation_data(data, process_results=True, return_true=False, metrics='minimal', plots: Optional[Union[str, list]] = None, **kwargs)[source]

makes prediction on validation data.

Parameters
  • data – raw, unprepared data from which validation data (x,y paris) will be generated.

  • process_results (bool, optional) – whether to post-process the results or not

  • return_true (bool, optional) – If true, the returned value will be tuple, first is true and second is predicted array

  • metrics (str, optional) – the metrics to calculate during post-processing

  • plots (optional (default=None)) – The kind of of plots to draw. Only valid if post_process is True

  • **kwargs – any keyword argument for .predict method.

predict_proba(x=None, data='test', **kwargs)[source]

since preprocesisng is part of Model, so the trained model with sklearn/xgboost/catboost/lgbm as backend must also be able to apply preprocessing on inputs before calling predict_proba from underlying library. Currently it just calls the predict_proba function of underlying library by first transforming x

prediction_distribution(feature: Union[str, list], feature_name: Optional[str] = None, x: Optional[Union[ndarray, DataFrame]] = None, data=None, data_type: str = 'test', num_grid_points=10, grid_type='percentile', percentile_range=None, grid_range=None, cust_grid_points=None, show_percentile: bool = False, show_outliers: bool = False, endpoint: bool = True, figsize: Optional[tuple] = None, ncols: int = 2, save_info: bool = True, show: bool = True, plot_params: Optional[dict] = None) Tuple[Figure, Axes, DataFrame][source]

plots distribution of prediction from the model against an input feature.

Parameters
  • feature (str or list) – the name of input feature against which the distribution is to be plotted. for one-hot encoding features, this must be a list

  • feature_name (str) – only useful when feature is list i.e. it is a one-hot encoded feature

  • x – input data to the model.

  • data – raw unprepared data from which x,y pairs for training,validation and test are generated. It must only be given if x is not given.

  • data_type (str, optional (default="test")) – The kind of data to be used. It is only valid if data argument is used. It should be one of training, validation, test or all.

  • num_grid_points (integer, optional, default=10) – number of grid points for numeric feature

  • grid_type (string, optional, default='percentile') – ‘percentile’ or ‘equal’ type of grid points for numeric feature

  • percentile_range (tuple or None, optional, default=None) – percentile range to investigate for numeric feature when grid_type=’percentile’

  • grid_range (tuple or None, optional, default=None) – value range to investigate for numeric feature when grid_type=’equal’

  • cust_grid_points (Series, 1d-array, list or None, optional, default=None) – customized list of grid points for numeric feature

  • show_percentile (bool, optional, default=False) – whether to display the percentile buckets for numeric feature when grid_type=’percentile’

  • show_outliers (bool, optional, default=False) – whether to display the out of range buckets for numeric feature when percentile_range or grid_range is not None

  • endpoint (bool, optional, default=True) – If True, stop is the last grid point Otherwise, it is not included

  • figsize (tuple or None, optional, default=None) – size of the figure, (width, height)

  • ncols (integer, optional, default=2) – number subplot columns, used when it is multi-class problem

  • save_info (bool, optional, default=True) – whether to save the information as csv or not

  • show (bool, optional, default=True) – whether to show the plot or not

  • plot_params (dict or None, optional, default=None) – parameters for the plot

Return type

a tuple of plt.Figure, plt.Axes and pd.DataFrame

Examples

>>> from ai4water.datasets import busan_beach
>>> from ai4water import Model
...
>>> model = Model(model="XGBRegressor")
>>> model.fit(data=busan_beach())
>>> model.prediction_distribution(feature="tide_cm",
... data=busan_beach(), show_percentile=True)
score(x=None, y=None, data='test', **kwargs)[source]

since preprocesisng is part of Model, so the trained model with sklearn as backend must also be able to apply preprocessing on inputs before calculating score from sklearn. Currently it just calls the score function of sklearn by first transforming x and y.

seed_everything(seed=None) None[source]

resets seeds of numpy, os, random, tensorflow, torch. If any of these module is not available, the seed for that module is not set.

sensitivity_analysis(data=None, bounds=None, sampler='morris', analyzer: Union[str, list] = 'sobol', sampler_kwds: Optional[dict] = None, analyzer_kwds: Optional[dict] = None, save_plots: bool = True, names: Optional[List[str]] = None) dict[source]

performs sensitivity analysis of the model w.r.t input features in data.

The model and its hyperprameters remain fixed while the input data is changed.

Parameters
  • data – data which will be used to get the bounds/limits of input features. If given, it must be 2d numpy array. It should be remembered that the given data is not used during sensitivity analysis. But new synthetic data is prepared on which sensitivity analysis is performed.

  • bounds (list,) – alternative to data

  • sampler (str, optional) – any sampler from SALib library. For example morris, fast_sampler, ff, finite_diff, latin, saltelli, sobol_sequence

  • analyzer (str, optional) – any analyzer from SALib lirary. For example sobol, dgsm, fast ff, hdmr, morris, pawn, rbd_fast. You can also choose more than one analyzer. This is useful when you want to compare results of more than one analyzers. It should be noted that having more than one analyzers does not increases computation time except for hdmr and delta analyzers. The hdmr and delta analyzers ane computation heavy. For example >>> analyzer = [“morris”, “sobol”, “rbd_fast”]

  • sampler_kwds (dict) – keyword arguments for sampler

  • analyzer_kwds (dict) – keyword arguments for analyzer

  • save_plots (bool, optional) –

  • names (list, optional) – names of input features. If not given, names of input features will be used.

Returns

a dictionary whose keys are names of analyzers and values and sensitivity results for that analyzer.

Return type

dict

Examples

>>> from ai4water import Model
>>> from ai4water.datasets import busan_beach
>>> df = busan_beach()
>>> input_features=df.columns.tolist()[0:-1]
>>> output_features = df.columns.tolist()[-1:]
... # build the model
>>> model=Model(model="RandomForestRegressor",
>>>     input_features=input_features,
>>>     output_features=output_features)
... # train the model
>>> model.fit(data=df)
.. # perform sensitivity analysis
>>> si = model.sensitivity_analysis(data=df[input_features].values,
>>>                    sampler="morris", analyzer=["morris", "sobol"],
>>>                        sampler_kwds={'N': 100})
shap_values(data, layer=None) ndarray[source]

returns shap values

Parameters
  • data – raw unprepared data from which training and test data are extracted.

  • layer

Examples

>>> from ai4water import Model
>>> from ai4water.datasets import busan_beach
>>> data = busan_beach()
>>> model = Model(model="RandomForestRegressor")
>>> model.fit(data=data)
>>> model.shap_values(data=data)
update_weights(weight_file: Optional[str] = None)[source]

Updates the weights of the underlying model.

Parameters

weight_file (str, optional) – complete path of weight file. If not given, the weights are updated from model.w_path directory. For neural network based models, the best weights are updated if more than one weight file is present in model.w_path.

Return type

None

view(layer_name: Optional[Union[str, list]] = None, data: str = 'training', x=None, y=None, examples_to_view=None, show=False)[source]

shows all activations, weights and gradients of the model.

Parameters
  • layer_name – the layer to view. If not given, all the layers will be viewed. This argument is only required when the model consists of layers of neural networks.

  • data – the data to use when making calls to model for activation calculation or for gradient calculation. It can either training, validation or test.

  • x – input, alternative to data. If given it will override data argument.

  • y – target/observed/label, alternative to data. If given it will override data argument.

  • examples_to_view – the examples to view.

  • show – whether to show the plot or not!

Returns

An isntance of Visualize ai4water.postprocessing.visualize.Visualize class.

Model subclassing

Model subclassing is different from functional API in the way the model (neural network) is constructed. To understand the difference between model-subclassing API and functional API see Model subclassing vs functional API

This class Inherits from BaseModel. This class is a subclass of keras.Model/torch.nn.Module depending upon the backend used. For scikit-learn/xgboost/catboost type models, this class only inherits from BaseModel. For deep learning/neural network based models, this class directly exposes all the functionalities of underlying Model. Thus `self is now a keras Model or torch.nn.Module. If the user wishes to create his/her own NN architecture, he/she should overwrite initialize_layers and call/forward methods.

ai4water.main.Model.__init__(self, verbosity=1, model=None, path=None, prefix=None, **kwargs)

Initializes the layers of NN model using initialize_layers method. All other input arguments goes to BaseModel.

ai4water.main.Model.fit_pytorch(self, x, **kwargs)

Trains the pytorch model.

ai4water.main.Model.forward(self, *inputs: Any, **kwargs: Any)

implements forward pass implementation for pytorch based NN models.

ai4water.main.Model.initialize_layers(self, layers_config: dict, inputs=None)

Initializes the layers/weights/variables which are to be used in forward or call method.

Parameters
  • layers_config (python dictionary to define neural network. For details) – [see](https://ai4water.readthedocs.io/en/latest/build_dl_models.html)

  • inputs (if None, it will be supposed the the Input layer either) – exists in layers_config or an Input layer will be created withing this method before adding any other layer. If not None, then it must be in Input layer and the remaining NN architecture will be built as defined in layers_config. This can be handy when we want to use this method several times to build a complex or parallel NN structure. Avoid Input in layer names.

Model for functional API

class ai4water.functional.Model(*args, **kwargs)[source]

Model class with Functional API and inherits from BaseModel.

For ML/non-Neural Network based models, there is no difference in functional or sub-clsasing api. For DL/NN-based models, this class implements functional api and differs from subclassing api in internal implementation of NN. This class is usefull, if you want to use the functional API of keras to build your own NN structure. In such as case you can construct your NN structure by overwriting add_layers. Another advantage of this class is that sometimes, model_subclsasing is not possible for example due to some bugs in tensorflow. In such a case this class can be used. Otherwise all the features of ai4water are available in this class as well.

Example

>>>from ai4water.functional import Model

__init__(*args, **kwargs)[source]

Initializes and builds the NN/ML model.

add_layers(layers_config: dict, inputs=None)[source]

Builds the NN from dictionary.

Parameters
  • layers_config

    wholse keys can be one of the following: config: dict/lambda, Every layer must contain initializing

    arguments as config dictionary. The config dictionary for every layer can contain name key and its value must be str type. If name key is not provided in the config, the provided layer name will be used as its name e.g in following case

    layers = {‘LSTM’: {‘config’: {‘units’: 16}}}

    the name of LSTM layer will be LSTM while in follwoing case

    layers = {‘LSTM’: {‘config’: {‘units’: 16, ‘name’: ‘MyLSTM’}}}

    the name of the lstm will be MyLSTM.

    inputs: str/list, The calling arguments for the list. If inputs

    key is missing for a layer, it will be supposed that either this is an Input layer or it uses previous outputs as inputs.

    outputs: str/list We can specifity the outputs from a layer

    by using the outputs key. The value to outputs must be a string or list of strings specifying the name of outputs from current layer which can be used later in the mdoel.

    call_args: str/list We can also specify additional call arguments

    by call_args key. The value to call_args must be a string or a list of strings.

  • inputs – if None, it will be supposed the the Input layer either exists in layers_config or an Input layer will be created within this method before adding any other layer. If not None, then it must be in Input layer and the remaining NN architecture will be built as defined in layers_config. This can be handy when we want to use this method several times to build a complex or parallel NN structure. avoid Input in layer names.

Returns

outputs :

Return type

inputs

classmethod from_config(config: dict, make_new_path: bool = False, **kwargs)

Loads the model from config dictionary i.e. model.config

Parameters
  • config (dict) – dictionary containing model’s parameters i.e. model.config

  • make_new_path (bool, optional) – whether to make new path or not?

  • **kwargs – any additional keyword arguments to Model class.

Return type

an instalnce of ai4water.Model

Example

>>> from ai4water import Model
>>> from ai4water.datasets import busan_beach
>>> data = busan_beach()
>>> old_model = Model(model="XGBRegressor")
>>> old_model.fit(data=data)
... # now construct a new model instance from config dictionary
>>> model = Model.from_config(old_model.config)
>>> x = np.random.random((100, 14))
>>> prediction = model.predict(x=x)

Pytorch Learner

This module can be used to train models which are built outside AI4Water’s model class. Thus, this module does not do any pre-processing, model building and post-processing of results.

This module is inspired from fastai’s Learner and keras’s Model class.

ai4water.models.torch.Learner

alias of None