Model
BaseModel
The core of AI4Water is the Model class which builds and trains the machine learning model. This class interacts with pre-processing and post-processing modules.
The Model class uses a python dictionary to build layers of neural networks.
To build Tensorflow based models using python dictionary see the guide for declarative model definition for tensorflow. To build pytorch based NN models using python dictionary see the guide for declarative model definition for pytorch .
- class ai4water._main.BaseModel(model: Optional[Union[dict, str, Callable]] = None, x_transformation: Optional[Union[str, dict, list]] = None, y_transformation: Optional[Union[str, dict, list]] = None, lr: float = 0.001, optimizer='Adam', loss: Union[str, Callable] = 'mse', quantiles=None, epochs: int = 14, min_val_loss: float = 0.0001, patience: int = 100, save_model: bool = True, monitor: Optional[Union[str, list]] = None, val_metric: Optional[str] = None, cross_validator: Optional[dict] = None, wandb_config: Optional[dict] = None, seed: int = 313, prefix: Optional[str] = None, path: Optional[str] = None, verbosity: int = 1, accept_additional_args: bool = False, **kwargs)[source]
Model class that implements logic of AI4Water.
- __init__(model: Optional[Union[dict, str, Callable]] = None, x_transformation: Optional[Union[str, dict, list]] = None, y_transformation: Optional[Union[str, dict, list]] = None, lr: float = 0.001, optimizer='Adam', loss: Union[str, Callable] = 'mse', quantiles=None, epochs: int = 14, min_val_loss: float = 0.0001, patience: int = 100, save_model: bool = True, monitor: Optional[Union[str, list]] = None, val_metric: Optional[str] = None, cross_validator: Optional[dict] = None, wandb_config: Optional[dict] = None, seed: int = 313, prefix: Optional[str] = None, path: Optional[str] = None, verbosity: int = 1, accept_additional_args: bool = False, **kwargs)[source]
The Model class can take a large number of possible arguments depending upon the machine learning model/algorithm used. Not all the arguments are applicable in each case. The user must define only the relevant/applicable parameters and leave the others as it is.
- Parameters:
model –
a dictionary defining machine learning model. If you are building a non-neural network model then this dictionary must consist of name of name of model as key and the keyword arguments to that model as dictionary. For example to build a decision forest based model
>>> model = {'DecisionTreeRegressor': {"max_depth": 3, ... "criterion": "mae"}}
The key ‘DecisionTreeRegressor’ should exactly match the name of the model from one of following libraries
The value {“max_depth”: 3, “criterion”: “mae”} is another dictionary which can be any keyword argument which the model (DecisionTreeRegressor in this case) accepts. The user must refer to the documentation of the underlying library (scikit-learn for DecisionTreeRegressor) to find out complete keyword arguments applicable for a particular model. See examples to learn how to build machine learning models If You are building a Deep Learning model using tensorflow, then the key must be ‘layers’ and the value must itself be a dictionary defining layers of neural networks. For example we can build an MLP as following
>>> model = {'layers': { ... "Dense_0": {'units': 64, 'activation': 'relu'}, ... "Flatten": {}, ... "Dense_3": {'units': 1} >>> }}
The MLP in this case consists of dense, and flatten layers. The user can define any keyword arguments which is accepted by that layer in TensorFlow. For example the Dense layer in TensorFlow can accept units and activation keyword argument among others. For details on how to buld neural networks using such layered API see examples
x_transformation –
type of transformation to be applied on x/input data. The transformation can be any transformation name from
ai4water.preprocessing.transformations.Transformation
. The user can specify more than one transformation. Moreover, the user can also determine which transformation to be applied on which input feature. Default is ‘minmax’. To apply a single transformation on all the data>>> x_transformation = 'minmax'
To apply different transformations on different input and output features
>>> x_transformation = [{'method': 'minmax', 'features': ['input1', 'input2']}, ... {'method': 'zscore', 'features': ['input3', 'input4']} ... ]
Here input1, input2, input3 and input4 are the columns in the data. For more info see
ai4water.preprocessing.Transformations
andai4water.preprocessing.Transformation
classes.y_transformation – type of transformation to be applied on y/label/output data.
lr (, default 0.001.) – learning rate,
optimizer (str/keras.optimizers like) – the optimizer to be used for neural network training. Default is ‘Adam’
loss (str/callable Default is mse.) – the cost/loss function to be used for training neural networks.
quantiles (list Default is None) – quantiles to be used when the problem is quantile regression.
epochs (int Default is 14) – number of epochs to be used.
min_val_loss (float Default is 0.0001.) – minimum value of validatin loss/error to be used for early stopping.
patience (int) – number of epochs to wait before early stopping. Set this value to None if you don’t want to use EarlyStopping.
save_model (bool) – whether to save the model or not. For neural networks, the model will be saved only an improvement in training/validation loss is observed. Otherwise model is not saved.
monitor (str/list) – metrics to be monitored. e.g. [‘nse’, ‘pbias’]
val_metric (str) – performance metric to be used for validation/cross_validation. This metric will be used for hyper-parameter optimizationa and experiment comparison. If not defined then r2_score will be used for regression and accuracy will be used for classification.
cross_validator (dict) –
selects the type of cross validation to be applied. It can be any cross validator from sklear.model_selection. Default is None, which means validation will be done using validation_data. To use kfold cross validation,
>>> cross_validator = {'KFold': {'n_splits': 5}}
batches (str) – either 2d or 3d`.
wandb_config (dict) –
Only valid if wandb package is installed. Default value is None, which means, wandb will not be utilized. For simplest case, pass a dictionary with at least two keys namely project and entity. Otherwise use a dictionary of all the arugments for wandb.init, wandb.log and WandbCallback. For training_data and validation_data in WandbCallback, pass True instead of providing a tuple as shown below
>>> wandb_config = {'entity': 'entity_name', 'project': 'project_name', ... 'training_data':True, 'validation_data': True}
int (seed) – random seed for reproducibility. This can be set to None. The seed is set to os, tf, torch and random modules simultaneously. Please note that this seed is not set for numpy because that will result in constant sampling during hyperparameter optimization. If you want to seed everything, then use following function >>> model.seed_everything()
prefix (str) – prefix to be used for the folder in which the results are saved. default is None, which means within ./results/model_path
path (str/path like) – if not given, new model_path path will not be created.
verbosity (int default is 1) – determines the amount of information being printed. 0 means no print information. Can be between 0 and 3. Setting this value to 0 will also reqult in not showing some plots such as loss curve or regression plot. These plots will only be saved in self.path.
accept_additional_args (bool Default is False) – If you want to pass any additional argument, then this argument must be set to True, otherwise an error will be raise.
**kwargs – keyword arguments for
ai4water.preprocessing.DataSet.__init__()
Note
The transformations applied on x and y data using x_transformation and y_transformations are part of model. See transformation
Examples
>>> from ai4water import Model >>> from ai4water.datasets import busan_beach >>> df = busan_beach() >>> ann = Model(input_features=df.columns.tolist()[0:-1], ... batch_size=16, ... output_features=df.columns.tolist()[-1:], ... model={'layers': {'Dense': 64, 'Dense': 1}}, ... ) >>> history = ann.fit(data=df) >>> y = ann.predict()
- all_data(x=None, y=None, data=None) tuple [source]
it returns all data i.e. training+validation+test after extracting them
data
.Examples
>>> from ai4water import Model >>> from ai4water.datasets import busan_beach >>> model = Model(model="XGBRegressor") >>> train_x, train_y = model.training_data(data=data) >>> print(train_x.shape, train_y.shape) >>> val_x, val_y = model.validation_data(data=data) >>> print(val_x.shape, val_y.shape) ... # all_data will contain both training and validation data >>> all_x, all_y = model.all_data(data=data) >>> print(all_x.shape, all_y.shape
- cross_val_score(x=None, y=None, data: Optional[Union[DataFrame, ndarray, str]] = None, scoring: Optional[Union[str, list]] = None, refit: bool = False, process_results: bool = False) list [source]
computes cross validation score
- Parameters:
x – input data
y – output corresponding to
x
.data – raw unprepared data which will be given to
ai4water.preprocessing.DataSet
to prepare x,y from it.scoring – performance metric to use for cross validation. If None, it will be taken from config[‘val_metric’]
refit (bool, optional (default=False) – If True, the model will be trained on the whole training+validation data after calculating cross validation score.
process_results (bool, optional) – whether to process results at each cv iteration or not
- Returns:
cross validation score for each of metric in scoring
- Return type:
Example
>>> from ai4water.datasets import busan_beach >>> from ai4water import Model >>> model = Model(model="RandomForestRegressor", >>> cross_validator={"KFold": {"n_splits": 5}}) >>> model.cross_val_score(data=busan_beach())
Note
Currently not working for deep learning models.
- eda(data, freq: Optional[str] = None)[source]
Performs comprehensive Exploratory Data Analysis.
- Parameters:
data –
freq – if specified, small chunks of data will be plotted instead of whole data at once. The data will NOT be resampled. This is valid only plot_data and box_plot. Possible values are yearly, weekly`, and monthly.
- Return type:
an instance of EDA
ai4water.eda.EDA
class
- evaluate(x=None, y=None, data=None, metrics=None, **kwargs)[source]
Evaluates the performance of the model on a given data. calls the
evaluate
method of underlying model. If the evaluate method is not available in underlying model, then predict is called.- Parameters:
x – inputs
y – outputs/true data corresponding to x
data – Raw unprepared data which will be fed to
ai4water.preprocessing.DataSet
to prepare x and y. Ifx
andy
are given, this argument will have no meaning.metrics –
the metrics to evaluate. It can a string indicating the metric to evaluate. It can also be a list of metrics to evaluate. Any metric name from RegressionMetrics or ClassificationMetrics can be given. It can also be name of group of metrics to evaluate. Following groups are available
minimal
all
hydro_metrics
If this argument is given, the evaluate function of the underlying class is not called. Rather the model is evaluated manually for given metrics. Otherwise, if this argument is not given, then evaluate method of underlying model is called, if available.
kwargs – any keyword argument for the evaluate method of the underlying model.
- Returns:
If metrics is not given then this method returns whatever is returned by evaluate method of underlying model. Otherwise the model is evaluated for given metric or group of metrics and the result is returned
Examples
>>> import numpy as np >>> from ai4water import Model >>> from ai4water.models import MLP >>> from ai4water.datasets import busan_beach >>> data = busan_beach() >>> model = Model(model=MLP(), ... input_features=data.columns.tolist()[0:-1], ... output_features=data.columns.tolist()[-1:]) >>> model.fit(data=data)
for evaluation on test data
>>> model.evaluate(data=data) ...
evaluate on any metric from SeqMetrics library
>>> model.evaluate(data=data, metrics='pbias') ... ... # to evaluate on custom data, the user can provide its own x and y >>> new_inputs = np.random.random((10, 13)) >>> new_outputs = np.random.random((10, 1, 1)) >>> model.evaluate(new_inputs, new_outputs)
backward compatability Since the ai4water’s Model is supposed to behave same as Keras’ Model the following expressions are equally valid.
>>> model.evaluate(x, y=y) >>> model.evaluate(x=x, y=y)
- evaluate_on_all_data(data, metrics=None, **kwargs)[source]
evaluates the model on all i.e. training+validation+test data. .. rubric:: Examples
>>> from ai4water import Model >>> from ai4water.models import MLP >>> from ai4water.datasets import busan_beach >>> data = busan_beach() >>> model = Model(model=MLP(), ... input_features=data.columns.tolist()[0:-1], ... output_features=data.columns.tolist()[-1:]) >>> model.fit(data=data) ... # for evaluation on all data >>> print(model.evaluate_on_all_data(data=data))) >>> print(model.evaluate_on_all_data(data=data, metrics='pbias'))
- evaluate_on_test_data(data, metrics=None, **kwargs)[source]
evaluates the model on test data.
- Parameters:
data – Raw unprepared data which will be fed to
ai4water.preprocessing.DataSet
to prepare x and y. Ifx
andy
are given, this argument will have no meaning.metrics –
the metrics to evaluate. It can a string indicating the metric to evaluate. It can also be a list of metrics to evaluate. Any metric name from RegressionMetrics or ClassificationMetrics can be given. It can also be name of group of metrics to evaluate. Following groups are available
minimal
all
hydro_metrics
If this argument is given, the evaluate function of the underlying class is not called. Rather the model is evaluated manually for given metrics. Otherwise, if this argument is not given, then evaluate method of underlying model is called, if available.
kwargs – any keyword argument for the evaluate method of the underlying model.
- Returns:
If metrics is not given then this method returns whatever is returned
by evaluate method of underlying model. Otherwise the model is evaluated
for given metric or group of metrics and the result is returned as float
or dictionary
Examples
>>> from ai4water import Model >>> from ai4water.models import MLP >>> from ai4water.datasets import busan_beach >>> data = busan_beach() >>> model = Model(model=MLP(), ... input_features=data.columns.tolist()[0:-1], ... output_features=data.columns.tolist()[-1:]) >>> model.fit(data=data) ... # for evaluation on test data >>> model.evaluate_on_test_data(data=data) >>> model.evaluate_on_test_data(data=data, metrics='pbias')
- evaluate_on_training_data(data, metrics=None, **kwargs)[source]
evaluates the model on training data.
- Parameters:
data – Raw unprepared data which will be fed to
ai4water.preprocessing.DataSet
to prepare x and y. Ifx
andy
are given, this argument will have no meaning.metrics –
the metrics to evaluate. It can a string indicating the metric to evaluate. It can also be a list of metrics to evaluate. Any metric name from RegressionMetrics or ClassificationMetrics can be given. It can also be name of group of metrics to evaluate. Following groups are available
minimal
all
hydro_metrics
If this argument is given, the evaluate function of the underlying class is not called. Rather the model is evaluated manually for given metrics. Otherwise, if this argument is not given, then evaluate method of underlying model is called, if available.
kwargs – any keyword argument for the evaluate method of the underlying model.
- Returns:
If metrics is not given then this method returns whatever is returned
by evaluate method of underlying model. Otherwise the model is evaluated
for given metric or group of metrics and the result is returned as float
or dictionary
Examples
>>> from ai4water import Model >>> from ai4water.models import MLP >>> from ai4water.datasets import busan_beach >>> data = busan_beach() >>> model = Model(model=MLP(), ... input_features=data.columns.tolist()[0:-1], ... output_features=data.columns.tolist()[-1:]) >>> model.fit(data=data) ... # for evaluation on training data >>> model.evaluate_on_training_data(data=data) >>> model.evaluate(data=data, metrics='pbias')
- evaluate_on_validation_data(data, metrics=None, **kwargs)[source]
evaluates the model on validation data.
- Parameters:
data – Raw unprepared data which will be fed to
ai4water.preprocessing.DataSet
to prepare x and y. Ifx
andy
are given, this argument will have no meaning.metrics –
the metrics to evaluate. It can a string indicating the metric to evaluate. It can also be a list of metrics to evaluate. Any metric name from RegressionMetrics or ClassificationMetrics can be given. It can also be name of group of metrics to evaluate. Following groups are available
minimal
all
hydro_metrics
If this argument is given, the evaluate function of the underlying class is not called. Rather the model is evaluated manually for given metrics. Otherwise, if this argument is not given, then evaluate method of underlying model is called, if available.
kwargs – any keyword argument for the evaluate method of the underlying model.
- Returns:
If metrics is not given then this method returns whatever is returned
by evaluate method of underlying model. Otherwise the model is evaluated
for given metric or group of metrics and the result is returned as float
or dictionary
Examples
>>> from ai4water import Model >>> from ai4water.models import MLP >>> from ai4water.datasets import busan_beach >>> data = busan_beach() >>> model = Model(model=MLP(), ... input_features=data.columns.tolist()[0:-1], ... output_features=data.columns.tolist()[-1:]) >>> model.fit(data=data) ... # for evaluation on validation data >>> model.evaluate_on_validation_data(data=data) >>> model.evaluate_on_validation_data(data=data, metrics='pbias')
- explain(*args, **kwargs)[source]
- Calls the :py:func:ai4water.postprocessing.explain.explain_model` function
to explain the model. Example ——-
>>> from ai4water import Model >>> from ai4water.datasets import busan_beach >>> data = busan_beach() >>> model = Model(model="RandomForestRegressor") >>> model.fit(data=data) >>> model.explain(total_data=data, examples_to_explain=2)
- explain_example(data, example_num: int, method='shap')[source]
explains a single exmaple either using shap or lime
- Parameters:
data – the data to use
example_num – the example/sample number/index to explain
method – either
shap
orlime
Examples
>>> from ai4water import Model >>> from ai4water.datasets import busan_beach >>> data = busan_beach() >>> model = Model(model="RandomForestRegressor") >>> model.fit(data=data) >>> model.explain_example(data=data, example_num=2)
- fit(x=None, y=None, data: Union[ndarray, DataFrame, DataSet, str] = 'training', callbacks: Optional[Union[list, dict]] = None, **kwargs)[source]
Trains the model with data. The data is either
x
or it is taken fromdata
by feeding it to DataSet.- Parameters:
x – The input data consisting of input features. It can also be tf.Dataset or TorchDataset.
y – Correct labels/observations/true data corresponding to ‘x’.
data – Raw data fromw which
x
,``y`` pairs are prepared. This will be passed toai4water.preprocessing.DataSet
. It can also be an instance ifai4water.preprocessing.DataSet
orai4water.preprocessing.DataSetPipeline
. It can also be name of dataset fromai4water.datasets.all_datasets
callbacks –
Any callback compatible with keras. If you want to log the output to tensorboard, then just use callbacks={‘tensorboard’:{}} or to provide additional arguments
>>> callbacks={'tensorboard': {'histogram_freq': 1}}
kwargs – Any keyword argument for the fit method of the underlying library. if ‘x’ is present in kwargs, that will take precedent over data.
- Returns:
A keras history object in case of deep learning model with tensorflow as backend or anything returned by fit method of underlying model.
Examples
>>> from ai4water import Model >>> from ai4water.datasets import busan_beach >>> model = Model(model="XGBRegressor") >>> model.fit(data=busan_beach())
using your own data for training
>>> import numpy as np >>> new_inputs = np.random.random((100, 10)) >>> new_outputs = np.random.random(100) >>> model.fit(x=new_inputs, y=new_outputs)
- fit_on_all_training_data(x=None, y=None, data=None, **kwargs)[source]
This function trains the model on training + validation data.
- Parameters:
x – x data which is supposed to be consisting of training and validation. If not given, then
data
must be given.y – label/target data corresponding to x data.
data –
- raw data which will be passed to py:meth:ai4water.preprocessing.DataSet
to get training and validation x,y pairs.
The x data from training and validation is concatenated. Similarly, y data from training and validation is concatenated
**kwargs – any keyword arguments for
fit
method.
- classmethod from_config(config: dict, make_new_path: bool = False, **kwargs) BaseModel [source]
Loads the model from config dictionary i.e. model.config
- Parameters:
- Return type:
an instalnce of
ai4water.Model
Example
>>> from ai4water import Model >>> import numpy as np >>> from ai4water.datasets import busan_beach >>> data = busan_beach() >>> old_model = Model(model="XGBRegressor") >>> old_model.fit(data=data) ... # now construct a new model instance from config dictionary >>> model = Model.from_config(old_model.config) >>> model.update_weights() >>> x = np.random.random((100, 14)) >>> prediction = model.predict(x=x)
- classmethod from_config_file(config_path: str, make_new_path: bool = False, **kwargs) BaseModel [source]
Loads the model from a config file.
- Parameters:
config_path – complete path of config file
make_new_path (bool, optional) – If true, then it means we want to use the config file, only to build the model and a new path will be made. We would not normally update the weights in such a case.
**kwargs – any additional keyword arguments for the
ai4water.Model
- Return type:
an instance of
ai4water.Model
class
Example
>>> from ai4water import Model >>> config_file_path = "../file/to/config.json" >>> model = Model.from_config_file(config_file_path) >>> x = np.random.random((100, 14)) >>> prediction = model.predict(x=x)
- interpret(**kwargs)[source]
Interprets the underlying model. Call it after training.
- Returns:
An instance of
ai4water.postprocessing.interpret.Interpret
class
Example
>>> from ai4water import Model >>> from ai4water.datasets import busan_beach >>> model = Model(model=...) >>> model.fit(data=busan_beach()) >>> model.interpret()
- optimize_hyperparameters(data: Union[tuple, list, DataFrame, ndarray], algorithm: str = 'bayes', num_iterations: int = 14, process_results: bool = True, refit: bool = True, **kwargs)[source]
optimizes the hyperparameters of the built model
The parameaters that needs to be optimized, must be given as space.
- Parameters:
data –
It can be one of following
raw unprepared data in the form of a numpy array or pandas dataframe
a tuple of x,y pairs
If it is unprepared data, it is passed to
ai4water.preprocessing.DataSet
. which prepares x,y pairs from it. TheDataSet
class also splits the data into training, validation and tests sets. If it is a tuple of x,y pairs, it is split into training and validation. In both cases, the loss on validation set is used as objective function. The loss calculated usingval_metric
.algorithm – str, optional (default=”bayes”) the algorithm to use for optimization
num_iterations – int, optional (default=14) number of iterations for optimization.
process_results – bool, optional (default=True) whether to perform postprocessing of optimization results or not
refit – bool, optional (default=True) whether to retrain the model using both training and validation data
- Returns:
an instance of
ai4water.hyperopt.HyperOpt
which is used for optimization
Examples
>>> from ai4water import Model >>> from ai4water.datasets import busan_beach >>> from ai4water.hyperopt import Integer, Categorical, Real >>> model_config = {"XGBRegressor": {"n_estimators": Integer(low=10, high=20), >>> "max_depth": Categorical([10, 20, 30]), >>> "learning_rate": Real(0.00001, 0.1)}} >>> model = Model(model=model_config) >>> optimizer = model.optimize_hyperparameters(data=busan_beach())
Same can be done if a model is defined using neural networks
… lookback = 14 >>> model_config = {“layers”: { … “Input”: {“input_shape”: (lookback, 13)}, … “LSTM”: {“config”: {“units”: Integer(32, 64), “activation”: “relu”}}, … “Dense”: {“units”: 1, … “activation”: Categorical([“relu”, “tanh”], name=”dense1_act”)}}} >>> model = Model(model=model_config, ts_args={“lookback”: lookback}) >>> optimizer = model.optimize_hyperparameters(data=busan_beach(), … refit=False)
- optimize_transformations(data: Union[ndarray, DataFrame], transformations: Optional[Union[str, list]] = None, include: Optional[Union[str, dict, list]] = None, exclude: Optional[Union[str, list]] = None, append: Optional[dict] = None, y_transformations: Optional[Union[list, dict]] = None, algorithm: str = 'bayes', num_iterations: int = 12, process_results: bool = True, update_config: bool = True)[source]
optimizes the transformations for the input/output features
The ‘val_score’ parameter given as input to the Model is used as objective function for optimization problem.
- Parameters:
data –
It can be one of following
raw unprepared data in the form of a numpy array or pandas dataframe
a tuple of x,y pairs
If it is unprepared data, it is passed to
ai4water.preprocessing.DataSet
. which prepares x,y pairs from it. TheDataSet
class also splits the data into training, validation and tests sets. If it is a tuple of x,y pairs, it is split into training and validation. In both cases, the loss on validation set is used as objective function. The loss calculated usingval_metric
.transformations –
the transformations to consider for input features. By default, following transformations are considered for input features
minmax
rescale from 0 to 1center
center the data by subtracting mean from itscale
scale the data by dividing it with its standard deviationzscore
first performs centering and then scalingbox-cox
yeo-johnson
quantile
robust
log
log2
log10
sqrt
square root
include – list, dict, str, optional the name/names of input features to include. If you don’t want to include any feature. Set this to an empty list
exclude – the name/names of input features to exclude
append –
the input features with custom candidate transformations. For example if we want to try only minmax and zscore on feature tide_cm, then it can be done as following
>>> append={"tide_cm": ["minmax", "zscore"]}
y_transformations –
It can either be a list of transformations to be considered for output features for example
>>> y_transformations = ['log', 'log10', 'log2', 'sqrt']
would mean that consider log, log10, log2 and sqrt are to be considered for output transformations during optimization. It can also be a dictionary whose keys are names of output features and whose values are lists of transformations to be considered for output features. For example
>>> y_transformations = {'output1': ['log2', 'log10'], 'output2': ['log', 'sqrt']}
Default is None, which means do not optimize transformation for output features.
algorithm – str The algorithm to use for optimizing transformations
num_iterations – int The number of iterations for optimizatino algorithm.
process_results – whether to perform postprocessing of optimization results or not
update_config – whether to update the config of model or not.
- Returns:
an instance of HyperOpt
ai4water.hyperopt.HyperOpt
class which is used for optimization
Example
>>> from ai4water.datasets import busan_beach >>> from ai4water import Model >>> model = Model(model="XGBRegressor") >>> optimizer_ = model.optimize_transformations(data=busan_beach(), exclude="tide_cm") >>> print(optimizer_.best_paras()) # find the best/optimized transformations >>> model.fit(data=busan_beach()) >>> model.predict()
- partial_dependence_plot(x=None, data=None, data_type='all', feature_name=None, num_points: int = 100, show: bool = True)[source]
Shows partial depedence plot for a feature.
- Parameters:
x – the input data to use. If not given, then
data
must be given.data – raw unprepared data from which x,y paris are to be made. If given,
x
must not be given.data_type (str) – the kind of the data to be used. It is only valid when
data
is given.feature_name (str/list) – name/names of features. If only one feature is given, 1 dimensional partial dependence plot is plotted. You can also provide a list of two feature names, in which case 2d interaction plot will be plotted.
num_points (int) – number of points. It is used to define grid.
show (bool) – whether to show the plot or not!
- Return type:
an instance of
ai4water.postprocessing.PartialDependencePlot
Examples
>>> from ai4water import Model >>> from ai4water.datasets import busan_beach >>> data = busan_beach() >>> model = Model(model="RandomForestRegressor") >>> model.fit(data=data) >>> model.partial_dependence_plot(x=data.iloc[:, 0:-1], feature_name="tide_cm") ... >>> model.partial_dependence_plot(data=data, feature_name="tide_cm")
- permutation_importance(data=None, data_type: str = 'test', x=None, y=None, scoring: Union[str, Callable] = 'r2', n_repeats: int = 5, noise: Optional[Union[str, ndarray]] = None, use_noise_only: bool = False, weights=None, plot_type: Optional[str] = None)[source]
Calculates the permutation importance on the given data
- Parameters:
data – Raw unprepared data from which x,y paris of training and test data are prepared.
data_type (str) – one of training, test or validation. By default test data is used based upon recommendations of Christoph Molnar’s book. Only valid if
data
argument is given.x – inputs for the model. alternative to data
y – target/observation data for the model. alternative to data
scoring – the scoring to use to calculate importance
n_repeats – number of times the permutation for each feature is performed.
noise – the noise to add when a feature is permutated. It can be a 1D array of length equal to len(data) or string defining the distribution
use_noise_only – If True, then the feature being perturbed is replaced by the noise instead of adding the noise into the feature. This argument is only valid if noise is not None.
weights –
plot_type – if not None, it must be either
heatmap
orboxplot
orbar_chart
- Return type:
an instance of
ai4water.postprocessing.PermutationImprotance
Examples
>>> from ai4water import Model >>> from ai4water.datasets import busan_beach >>> model = Model(model="XGBRegressor") >>> model.fit(data=busan_beach()) >>> perm_imp = model.permutation_importance(data=busan_beach(), ... data_type="validation", plot_type="boxplot") >>> perm_imp.importances
- predict(x=None, y=None, data: Union[str, DataFrame, ndarray, DataSet] = 'test', process_results: bool = True, metrics: str = 'minimal', return_true: bool = False, plots: Optional[Union[str, list]] = None, **kwargs)[source]
Makes prediction from the trained model.
- Parameters:
x – The data on which to make prediction. if given, it will override data. It can also be tf.Dataset or TorchDataset
y – Used for pos-processing etc. if given it will overrite data
data – It can also be unprepared/raw data which will be given to
ai4water.preprocessing.DataSet
to prepare x,y values.process_results – bool post processing of results
metrics – str only valid if process_results is True. The metrics to calculate. Valid values are
minimal
,all
,hydro_metrics
return_true – bool whether to return the true values along with predicted values or not. Default is False, so that this method behaves sklearn type.
plots – optional (default=None) The kind of of plots to draw. Only valid if post_process is True
kwargs – any keyword argument for
predict
method.
- Returns:
An numpy array of predicted values. If return_true is True then a tuple of arrays. The first is true and the second is predicted. If
x
is given buty
is not given, then, first array which is returned is None.
Examples
>>> from ai4water import Model >>> from ai4water.datasets import busan_beach >>> model = Model(model="RandomForestRegressor") >>> model.fit(data=busan_beach()) >>> pred = model.predict(data=busan_beach())
get true values
>>> true, pred = model.predict(data=busan_beach(), return_true=True)
postprocessing of results
>>> pred = model.predict(data=busan_beach(), process_results=True)
calculate all metrics during postprocessing
>>> pred = model.predict(data=busan_beach(), process_results=True, metrics="all")
using your own data
>>> import numpy as np >>> new_input = np.random.random((10, 13)) >>> pred = model.predict(x = new_input)
- predict_log_proba(x=None, data='test', **kwargs)[source]
since preprocessing is part of Model, so the trained model with sklearn/xgboost/catboost/lgbm as backend must also be able to apply preprocessing on inputs before calling predict_log_proba from underlying library. Currently it just calls the log_proba function of underlying library by first transforming x
- predict_on_all_data(data, process_results=True, return_true=False, metrics='minimal', plots: Optional[Union[str, list]] = None, **kwargs)[source]
It makes prediction on training+validation+test data.
- Parameters:
data – raw, unprepared data from which x,y paris will be generated.
process_results (bool, optional) – whether to post-process the results or not
return_true (bool, optional) – If true, the returned value will be tuple, first is true and second is predicted array
metrics (str, optional) – the metrics to calculate during post-processing
plots (optional (default=None)) –
The kind of of plots to draw. Only valid if post_process is True Following plots are avialble.
residual
regression
prediction
errors
fdc
murphy
edf
**kwargs – any keyword argument for .predict method.
- predict_on_test_data(data, process_results=True, return_true=False, metrics='minimal', plots: Optional[Union[str, list]] = None, **kwargs)[source]
makes prediction on test data.
- Parameters:
data – raw, unprepared data from which test data (x,y paris) will be generated.
process_results (bool, optional) – whether to post-process the results or not
return_true (bool, optional) – If true, the returned value will be tuple, first is true and second is predicted array
metrics (str, optional) – the metrics to calculate during post-processing
plots (optional (default=None)) –
The kind of of plots to draw. Only valid if post_process is True Following plots are avialble.
residual
regression
prediction
errors
fdc
murphy
edf
**kwargs – any keyword argument for .predict method.
- predict_on_training_data(data, process_results=True, return_true=False, metrics='minimal', plots: Optional[Union[str, list]] = None, **kwargs)[source]
makes prediction on training data.
- Parameters:
data – raw, unprepared data from which training data (x,y paris) will be generated.
process_results (bool, optional) – whether to post-process the results or not
return_true (bool, optional) – If true, the returned value will be tuple, first is true and second is predicted array
metrics (str, optional) – the metrics to calculate during post-processing
plots (optional (default=None)) –
The kind of of plots to draw. Only valid if post_process is True Following plots are avialble.
residual
regression
prediction
errors
fdc
murphy
edf
**kwargs – any keyword argument for .predict method.
- predict_on_validation_data(data, process_results=True, return_true=False, metrics='minimal', plots: Optional[Union[str, list]] = None, **kwargs)[source]
makes prediction on validation data.
- Parameters:
data – raw, unprepared data from which validation data (x,y paris) will be generated.
process_results (bool, optional) – whether to post-process the results or not
return_true (bool, optional) – If true, the returned value will be tuple, first is true and second is predicted array
metrics (str, optional) – the metrics to calculate during post-processing
plots (optional (default=None)) –
The kind of of plots to draw. Only valid if post_process is True Following plots are avialble.
residual
regression
prediction
errors
fdc
murphy
edf
**kwargs – any keyword argument for .predict method.
- predict_proba(x=None, data='test', **kwargs)[source]
since preprocessing is part of Model, so the trained model with sklearn/xgboost/catboost/lgbm as backend must also be able to apply preprocessing on inputs before calling predict_proba from underlying library. Currently it just calls the predict_proba function of underlying library by first transforming x
- prediction_analysis(features: Union[list, str], x: Optional[Union[ndarray, DataFrame]] = None, y: Optional[ndarray] = None, data=None, data_type: str = 'all', feature_names: Optional[Union[str, list]] = None, num_grid_points: Optional[int] = None, grid_types='percentile', percentile_ranges=None, grid_ranges=None, custom_grid: Optional[list] = None, show_percentile: bool = False, show_outliers: bool = False, end_point: bool = True, which_classes=None, ncols=2, figsize: Optional[tuple] = None, annotate: bool = True, annotate_kws: Optional[dict] = None, cmap='YlGn', border=False, show: bool = True, save_metadata: bool = True) Axes [source]
shows prediction distribution with respect to two input features.
- Parameters:
x – input data to the model.
y – true data corresponding to
x
.data – raw unprepared data from which x,y pairs for training,validation and test are generated. It must only be given if
x
is not given.data_type (str, optional (default="test")) – The kind of data to be used. It is only valid if
data
argument is used. It should be one oftraining
,validation
,test
orall
.features (str/list) – name or names of features to investigate
feature_names (list) – feature names
num_grid_points (list, optional, default=None) – number of grid points for each feature
grid_types (list, optional, default=None) – type of grid points for each feature
percentile_ranges (list of tuple, optional, default=None) – percentile range to investigate for each feature
grid_ranges (list of tuple, optional, default=None) – value range to investigate for each feature
custom_grid (list of (Series, 1d-array, list), optional, default=None) – customized list of grid points for each feature
show_percentile (bool, optional, default=False) – whether to display the percentile buckets for both feature
show_outliers (bool, optional, default=False) – whether to display the out of range buckets for both features
end_point (bool, optional) – If True, stop is the last grid point, default=True Otherwise, it is not included
which_classes (list, optional, default=None) – which classes to plot, only use when it is a multi-class problem
figsize (tuple or None, optional, default=None) – size of the figure, (width, height)
ncols (integer, optional, default=2) – number subplot columns, used when it is multi-class problem
annotate (bool, default=False) – whether to annotate the points
annotate_kws (dict, optional) –
- a dictionary of keyword arguments with following keys
- annotate_countsbool, default=False
whether to annotate counts or not.
- annotate_colorstuple
pair of colors
- annotate_color_thresholdfloat
threshold value for annotation
- annotate_fmtstr
format string for annotation.
- annotate_fontsizeint, optinoal (default=7)
fontsize for annotation
cmap –
border –
show (bool, optional (default=True)) – whether to show the plot or not
save_metadata (bool, optional, default=True) – whether to save the information as csv or not
- Returns:
a pandas dataframe and matplotlib Axes
- Return type:
Examples
>>> from ai4water.datasets import busan_beach >>> from ai4water import Model ... >>> model = Model(model="XGBRegressor") >>> model.fit(data=busan_beach()) >>> model.prediction_analysis(features="tide_cm", ... data=busan_beach(), show_percentile=True) ... # for multiple features >>> model.prediction_analysis( ... ['tide_cm', 'sal_psu'], ... data=busan_beach(), ... annotate_kws={"annotate_counts":True, ... "annotate_colors":("black", "black"), ... "annotate_fontsize":10}, ... custom_grid=[[-41.4, -20.0, 0.0, 20.0, 42.0], ... [33.45, 33.7, 33.9, 34.05, 34.4]], ... )
- score(x=None, y=None, data='test', **kwargs)[source]
since preprocessing is part of Model, so the trained model with sklearn as backend must also be able to apply preprocessing on inputs before calculating score from sklearn. Currently it just calls the score function of sklearn by first transforming x and y.
- seed_everything(seed=None) None [source]
resets seeds of numpy, os, random, tensorflow, torch. If any of these module is not available, the seed for that module is not set.
- sensitivity_analysis(data=None, bounds=None, sampler='morris', analyzer: Union[str, list] = 'sobol', sampler_kwds: Optional[dict] = None, analyzer_kwds: Optional[dict] = None, save_plots: bool = True, names: Optional[List[str]] = None) dict [source]
performs sensitivity analysis of the model w.r.t input features in data.
The model and its hyperprameters remain fixed while the input data is changed.
- Parameters:
data – data which will be used to get the bounds/limits of input features. If given, it must be 2d numpy array. It should be remembered that the given data is not used during sensitivity analysis. But new synthetic data is prepared on which sensitivity analysis is performed.
bounds (list,) – alternative to data
sampler (str, optional) – any sampler from SALib library. For example
morris
,fast_sampler
,ff
,finite_diff
,latin
,saltelli
,sobol_sequence
analyzer (str, optional) – any analyzer from SALib lirary. For example
sobol
,dgsm
,fast
ff
,hdmr
,morris
,pawn
,rbd_fast
. You can also choose more than one analyzer. This is useful when you want to compare results of more than one analyzers. It should be noted that having more than one analyzers does not increases computation time except forhdmr
anddelta
analyzers. Thehdmr
anddelta
analyzers ane computation heavy. For example >>> analyzer = [“morris”, “sobol”, “rbd_fast”]sampler_kwds (dict) – keyword arguments for sampler
analyzer_kwds (dict) – keyword arguments for analyzer
save_plots (bool, optional) –
names (list, optional) – names of input features. If not given, names of input features will be used.
- Returns:
a dictionary whose keys are names of analyzers and values and sensitivity results for that analyzer.
- Return type:
Examples
>>> from ai4water import Model >>> from ai4water.datasets import busan_beach >>> df = busan_beach() >>> input_features=df.columns.tolist()[0:-1] >>> output_features = df.columns.tolist()[-1:] ... # build the model >>> model=Model(model="RandomForestRegressor", >>> input_features=input_features, >>> output_features=output_features) ... # train the model >>> model.fit(data=df) ... # perform sensitivity analysis >>> si = model.sensitivity_analysis(data=df[input_features].values, >>> sampler="morris", analyzer=["morris", "sobol"], >>> sampler_kwds={'N': 100})
- shap_values(data, layer=None) ndarray [source]
returns shap values
- Parameters:
data – raw unprepared data from which training and test data are extracted.
layer –
Examples
>>> from ai4water import Model >>> from ai4water.datasets import busan_beach >>> data = busan_beach() >>> model = Model(model="RandomForestRegressor") >>> model.fit(data=data) >>> model.shap_values(data=data)
- test_data(x=None, y=None, data='test', key='test') tuple [source]
returns the x,y pairs for test. x,y are not used but only given to be used if user overwrites this method for further processing of x, y as shown below.
>>> from ai4water import Model >>> class MyModel(Model): >>> def ttest_data(self, *args, **kwargs) ->tuple: >>> train_x, train_y = super().training_data(*args, **kwargs) ... # further process x, y >>> return train_x, train_y
- training_data(x=None, y=None, data='training', key='train') tuple [source]
returns the x,y pairs for training. x,y are not used but only given to be used if user overwrites this method for further processing of x, y as shown below.
>>> from ai4water import Model >>> class MyModel(Model): >>> def training_data(self, *args, **kwargs) ->tuple: >>> train_x, train_y = super().training_data(*args, **kwargs) ... # further process x, y >>> return train_x, train_y
- update_weights(weight_file: Optional[str] = None)[source]
Updates the weights of the underlying model.
- Parameters:
weight_file (str, optional) – complete path of weight file. If not given, the weights are updated from model.w_path directory. For neural network based models, the best weights are updated if more than one weight file is present in model.w_path.
- Return type:
None
- validation_data(x=None, y=None, data='validation', key='val') tuple [source]
returns the x,y pairs for validation. x,y are not used but only given to be used if user overwrites this method for further processing of x, y as shown below.
>>> from ai4water import Model >>> class MyModel(Model): >>> def validation_data(self, *args, **kwargs) ->tuple: >>> train_x, train_y = super().training_data(*args, **kwargs) ... # further process x, y >>> return train_x, train_y
- view(layer_name: Optional[Union[str, list]] = None, data=None, data_type: str = 'training', x=None, y=None, examples_to_view=None, show=False)[source]
shows all activations, weights and gradients of the model.
- Parameters:
layer_name – the layer to view. If not given, all the layers will be viewed. This argument is only required when the model consists of layers of neural networks.
data – the data to use when making calls to model for activation calculation or for gradient calculation.
data_type – str It can either
training
,validation
ortest
orall
.x – input, alternative to data. If given it will override data argument.
y – target/observed/label, alternative to data. If given it will override data argument.
examples_to_view – the examples to view.
show – whether to show the plot or not!
- Returns:
An isntance of Visualize
ai4water.postprocessing.visualize.Visualize
class.
Model subclassing
Model subclassing is different from functional API in the way the model (neural network) is constructed. To understand the difference between model-subclassing API and functional API see Model subclassing vs functional API
This class Inherits from BaseModel. This class is a subclass of keras.Model/torch.nn.Module depending upon the backend used. For scikit-learn/xgboost/catboost type models, this class only inherits from BaseModel. For deep learning/neural network based models, this class directly exposes all the functionalities of underlying Model. Thus `self is now a keras Model or torch.nn.Module. If the user wishes to create his/her own NN architecture, he/she should overwrite initialize_layers and call/forward methods.
- ai4water.main.Model.__init__(self, verbosity=1, model=None, path=None, prefix=None, **kwargs)
Initializes the layers of NN model using initialize_layers method. All other input arguments goes to BaseModel.
- ai4water.main.Model.fit_pytorch(self, x, **kwargs)
Trains the pytorch model.
- ai4water.main.Model.forward(self, *inputs: Any, **kwargs: Any)
implements forward pass implementation for pytorch based NN models.
- ai4water.main.Model.initialize_layers(self, layers_config: dict, inputs=None)
Initializes the layers/weights/variables which are to be used in forward or call method.
- Parameters:
layers_config (python dictionary to define neural network. For details) – [see](https://ai4water.readthedocs.io/en/latest/build_dl_models.html)
inputs (if None, it will be supposed the the Input layer either) – exists in layers_config or an Input layer will be created withing this method before adding any other layer. If not None, then it must be in Input layer and the remaining NN architecture will be built as defined in layers_config. This can be handy when we want to use this method several times to build a complex or parallel NN structure. Avoid Input in layer names.
Model for functional API
- class ai4water.functional.Model(*args, **kwargs)[source]
Model class with Functional API and inherits from BaseModel.
For ML/non-Neural Network based models, there is no difference in functional or sub-clsasing api. For DL/NN-based models, this class implements functional api and differs from subclassing api in internal implementation of NN. This class is usefull, if you want to use the functional API of keras to build your own NN structure. In such as case you can construct your NN structure by overwriting add_layers. Another advantage of this class is that sometimes, model_subclsasing is not possible for example due to some bugs in tensorflow. In such a case this class can be used. Otherwise all the features of ai4water are available in this class as well.
Example
>>>from ai4water.functional import Model
- add_layers(layers_config: dict, inputs=None)[source]
Builds the NN from dictionary.
- Parameters:
layers_config –
wholse keys can be one of the following: config: dict/lambda, Every layer must contain initializing
arguments as config dictionary. The config dictionary for every layer can contain name key and its value must be str type. If name key is not provided in the config, the provided layer name will be used as its name e.g in following case
layers = {‘LSTM’: {‘config’: {‘units’: 16}}}
- the name of LSTM layer will be LSTM while in follwoing case
layers = {‘LSTM’: {‘config’: {‘units’: 16, ‘name’: ‘MyLSTM’}}}
the name of the lstm will be MyLSTM.
- inputs: str/list, The calling arguments for the list. If inputs
key is missing for a layer, it will be supposed that either this is an Input layer or it uses previous outputs as inputs.
- outputs: str/list We can specifity the outputs from a layer
by using the outputs key. The value to outputs must be a string or list of strings specifying the name of outputs from current layer which can be used later in the mdoel.
- call_args: str/list We can also specify additional call arguments
by call_args key. The value to call_args must be a string or a list of strings.
inputs – if None, it will be supposed the the Input layer either exists in layers_config or an Input layer will be created within this method before adding any other layer. If not None, then it must be in Input layer and the remaining NN architecture will be built as defined in layers_config. This can be handy when we want to use this method several times to build a complex or parallel NN structure. avoid Input in layer names.
- Returns:
outputs :
- Return type:
inputs
Pytorch Learner
This module can be used to train models which are built outside AI4Water’s model class. Thus, this module does not do any pre-processing, model building and post-processing of results.
This module is inspired from fastai’s Learner and keras’s Model class.
- class ai4water.models._torch.Learner(model, batch_size: int = 32, num_epochs: int = 14, patience: int = 100, shuffle: bool = True, to_monitor: Optional[list] = None, use_cuda: bool = False, path: Optional[str] = None, wandb_config: Optional[dict] = None, verbosity=1, **kwargs)[source]
Bases:
AttributeContainer
Trains the pytorch model. Motivated from fastai
- __init__(model, batch_size: int = 32, num_epochs: int = 14, patience: int = 100, shuffle: bool = True, to_monitor: Optional[list] = None, use_cuda: bool = False, path: Optional[str] = None, wandb_config: Optional[dict] = None, verbosity=1, **kwargs)[source]
Initializes the Learner class
- Parameters:
model –
a pytorch model having following attributes and methods
num_outs
w_path
loss
get_optimizer
batch_size – batch size
num_epochs – Number of epochs for which to train the model
patience – how many epochs to wait before stopping the training in case to_monitor does not improve.
shuffle –
use_cuda – whether to use cuda or not
to_monitor – list of metrics to monitor
path – path to save results/weights
wandb_config – config for wandb
Example
>>> from torch import nn >>> import torch >>> from ai4water.models._torch import Learner ... >>> class Net(nn.Module): >>> def __init__(self, D_in, H, D_out): ... super(Net, self).__init__() ... # hidden layer ... self.linear1 = nn.Linear(D_in, H) ... self.linear2 = nn.Linear(H, D_out) >>> def forward(self, x): ... l1 = self.linear1(x) ... a1 = torch.sigmoid(l1) ... yhat = torch.sigmoid(self.linear2(a1)) ... return yhat ... >>> learner = Learner(model=Net(1, 2, 1), ... num_epochs=501, ... patience=50, ... batch_size=1, ... shuffle=False) ... >>> learner.optimizer = torch.optim.SGD(learner.model.parameters(), lr=0.1) >>> def criterion_cross(labels, outputs): ... out = -1 * torch.mean(labels * torch.log(outputs) + (1 - labels) * torch.log(1 - outputs)) ... return out >>> learner.loss = criterion_cross ... >>> X = torch.arange(-20, 20, 1).view(-1, 1).type(torch.FloatTensor) >>> Y = torch.zeros(X.shape[0]) >>> Y[(X[:, 0] > -4) & (X[:, 0] < 4)] = 1.0 ... >>> learner.fit(X, Y) >>> metrics = learner.evaluate(X, y=Y, metrics=['r2', 'nse', 'mape']) >>> t = learner.predict(X, y=Y, name='training')
- evaluate(x, y, batch_size: Optional[int] = None, metrics: Union[str, list] = 'r2', **kwargs)[source]
Evaluates the model on the given data.
- Parameters:
x –
data on which to evalute. It can be
a torch.utils.data.Dataset
a torch.utils.data.DataLoader
a torch.Tensor
a numpy.ndarray
a list of torch tensors numpy arrays
y – It comprises labels for correspoing x.
batch_size – None means make prediction on whole data in one go
metrics – name of performance metric to measure. It can be a single metric or a list of metrics. Allowed metrics are anyone from ai4water.post_processing.SeqMetrics.RegressionMetrics
kwargs –
- Returns:
if metrics is string the returned value is float otherwise it will be a dictionary
- fit(x, y=None, validation_data=None, **kwargs)[source]
Runs the training loop for pytorch model.
- Parameters:
x –
Can be one of following
an instance of torch.Dataset, y will be ignored
an instance of torch.DataLoader, y will be ignored
a torch tensor containing input data for each example
a numpy array or pandas DataFrame
a list of torch tensors or numpy arrays
y – if x is torch tensor, then y is the label/target for each corresponding example.
validation_data – can be one of following: - an instance of torch.Dataset - an instance of torch.DataLoader - a tuple of x,y pairs where x and y are tensors Default is None, which means no validation is performed.
kwargs –
can be callbacks For example to use a callable as callback use following
>>> callbacks = [{'after_epochs': 300, 'func': PlotStuff}]
where PlotStuff is a callable. Each callable is provided with following keyword arguments
epoch : the current epoch at which callable is called.
model : the model
train_data : training data_loader
val_data : validation data_loader
- plot_model(y=None)[source]
Helper function to plot dot diagram of model using torchviz module.
- Parameters:
y (torch.Tensor) – output tensor
- predict(x, y=None, batch_size: Optional[int] = None, reg_plot: bool = True, name: Optional[str] = None, **kwargs) ndarray [source]
Makes prediction on the given data
- Parameters:
x –
data on which to evalute. It can be
a torch.utils.data.Dataset
a torch.utils.data.DataLoader
a torch.Tensor
a numpy array
a list of torch tensors numpy arrays
y – only relevent if x is torch.Tensor. It comprises labels for correspoing x.
batch_size – None means make prediction on whole data in one go
reg_plot – whether to plot regression line or not
name – string to be used for title and name of saved plot
- Returns:
predicted output as numpy array