utility functions

Some utility functions

prepare_data

ai4water.utils.utils.prepare_data(data: ndarray, lookback: int, num_inputs: Optional[int] = None, num_outputs: Optional[int] = None, input_steps: int = 1, forecast_step: int = 0, forecast_len: int = 1, known_future_inputs: bool = False, output_steps: int = 1, mask: Optional[Union[int, float, ndarray]] = None) Tuple[ndarray, ndarray, ndarray][source]

converts a numpy nd array into a supervised machine learning problem.

Parameters:
  • data – nd numpy array whose first dimension represents the number of examples and the second dimension represents the number of features. Some of those features will be used as inputs and some will be considered as outputs depending upon the values of num_inputs and num_outputs.

  • lookback – number of previous steps/values to be used at one step.

  • num_inputs – default None, number of input features in data. If None, it will be calculated as features-outputs. The input data will be all from start till num_outputs in second dimension.

  • num_outputs – number of columns (from last) in data to be used as output. If None, it will be caculated as features-inputs.

  • input_steps – strides/number of steps in input data

  • forecast_step – must be greater than equal to 0, which t+ith value to use as target where i is the horizon. For time series prediction, we can say, which horizon to predict.

  • forecast_len – number of horizons/future values to predict.

  • known_future_inputs – Only useful if forecast_len>1. If True, this means, we know and use ‘future inputs’ while making predictions at t>0

  • output_steps – step size in outputs. If =2, it means we want to predict every second value from the targets

  • mask – If int, then the examples with these values in the output will be skipped. If array then it must be a boolean mask indicating which examples to include/exclude. The length of mask should be equal to the number of generated examples. The number of generated examples is difficult to prognose because it depend upon lookback, input_steps, and forecast_step. Thus it is better to provide an integer indicating which values in outputs are to be considered as invalid. Default is None, which indicates all the generated examples will be returned.

Returns:

  • x (numpy array of shape (examples, lookback, ins) consisting of) – input examples

  • prev_y (numpy array consisting of previous outputs)

  • y (numpy array consisting of target values)

Given following data consisting of input/output pairs

input1

input2

output1

output2

output 3

1

11

21

31

41

2

12

22

32

42

3

13

23

33

43

4

14

24

34

44

5

15

25

35

45

6

16

26

36

46

7

17

27

37

47

If we use following 2 time series as input

input1

input2

1

11

2

12

3

13

4

14

5

15

6

16

7

17

then num_inputs =2, lookback =7, input_steps =1

and if we want to predict

output1

output2

output 3

27

37

47

then num_outputs =3, forecast_len =1, forecast_step =0,

if we want to predict

output1

output2

output 3

28

38

48

then num_outputs =3, forecast_len =1, forecast_step =1,

if we want to predict

output1

output2

output 3

27

37

47

28

38

48

then num_outputs =3, forecast_len =2, horizon/forecast_step=0,

if we want to predict

output1

output2

output 3

28

38

48

29

39

49

30

40

50

then num_outputs =3, forecast_len =3, forecast_step =1,

if we want to predict

output2

38

39

40

then num_outputs =1, forecast_len =3, forecast_step =0

if we predict

output2

39

then num_outputs =1, forecast_len =1, forecast_step =2

if we predict

output2

39

40

41

then num_outputs =1, forecast_len =3, forecast_step =2

If we use following two time series as input

input1

input2

1

11

3

13

5

15

7

17

then num_inputs =2, lookback =4, input_steps =2

If the input is

input1

input2

1

11

2

12

3

13

4

14

5

15

6

16

7

17

and target/output is

output1

output2

output 3

25

35

45

26

36

46

27

37

47

This means we make use of known future inputs. This can be achieved using following configuration num_inputs=2, num_outputs=3, lookback=4, forecast_len=3, forecast_step=1, known_future_inputs=True

The general shape of output/target/label is (examples, num_outputs, forecast_len)

The general shape of inputs/x is (examples, lookback + forecast_len-1, ….num_inputs)

Examples

>>> import numpy as np
>>> from ai4water.utils.utils import prepare_data
>>> num_examples = 50
>>> dataframe = np.arange(int(num_examples*5)).reshape(-1, num_examples).transpose()
>>> dataframe[0:10]
array([[  0,  50, 100, 150, 200],
       [  1,  51, 101, 151, 201],
       [  2,  52, 102, 152, 202],
       [  3,  53, 103, 153, 203],
       [  4,  54, 104, 154, 204],
       [  5,  55, 105, 155, 205],
       [  6,  56, 106, 156, 206],
       [  7,  57, 107, 157, 207],
       [  8,  58, 108, 158, 208],
       [  9,  59, 109, 159, 209]])
>>> x, prevy, y = prepare_data(dataframe, num_outputs=2, lookback=4,
...    input_steps=2, forecast_step=2, forecast_len=4)
>>> x[0]
array([[  0.,  50., 100.],
      [  2.,  52., 102.],
      [  4.,  54., 104.],
      [  6.,  56., 106.]], dtype=float32)
>>> y[0]
array([[158., 159., 160., 161.],
      [208., 209., 210., 211.]], dtype=float32)
>>> x, prevy, y = prepare_data(dataframe, num_outputs=2, lookback=4,
...    forecast_len=3, known_future_inputs=True)
>>> x[0]
array([[  0,  50, 100],
       [  1,  51, 101],
       [  2,  52, 102],
       [  3,  53, 103],
       [  4,  54, 104],
       [  5,  55, 105],
       [  6,  56, 106]])       # (7, 3)
>>> # it is important to note that although lookback=4 but x[0] has shape of 7
>>> y[0]
array([[154., 155., 156.],
       [204., 205., 206.]], dtype=float32)  # (2, 3)

get_attributes

tensorflow, torch, numpy, matplotlib, random and other libraries are imported here once and then used all over ai4water. This file does not import anything from other files of ai4water.

ai4water.backend.get_attributes(aus, what: str, retain: Optional[str] = None, case_sensitive: bool = False) dict[source]

gets all callable attributes of aus from what and saves them in dictionary with their names as keys. If case_sensitive is True, then the all keys are capitalized so that calling them becomes case insensitive. It is possible that some of the attributes of tf.keras.layers are callable but still not a valid layer, sor some attributes of tf.keras.losses are callable but still not valid losses, in that case the error will be generated from tensorflow. We are not catching those error right now.

Parameters:
  • aus – parent module

  • what (str) – child module/package

  • retain (str, optional (default=None)) – if duplicates of ‘what’ exist then whether to prefer class or function. For example, fastica and FastICA exist in sklearn.decomposition then if retain is ‘function’ then fastica will be kept, if retain is ‘class’ then FastICA is kept. If retain is None, then what comes later will overwrite the previously kept object.

  • case_sensitive (bool, optional (default=False)) – whether to consider what as case-sensitive or not. In such a case, fastica and FastICA will both be saved as separate objects.

Example

>>> get_attributes(tf.keras, 'layers')  # will get all layers from tf.keras.layers

murphy_diagram

ai4water.utils.visualizations.murphy_diagram(observed: Union[list, ndarray, Series, DataFrame], predicted: Union[list, ndarray, Series, DataFrame], reference: Optional[Union[list, ndarray, Series, DataFrame]] = None, reference_model: Optional[Union[str, Callable]] = None, inputs=None, plot_type: str = 'scores', xaxis: str = 'theta', ax: Optional[Axes] = None, line_colors: Optional[tuple] = None, fill_color: str = 'lightgray', show: bool = True) Axes[source]
Murphy diagram as introducted by Ehm et al., 2015

and illustrated by Rob Hyndman

Parameters:
  • observed – observed or true values

  • predicted – model’s prediction

  • reference – reference prediction

  • reference_model – The model for reference prediction. Only relevent if reference is None and plot_type is diff. It can be callable or a string. If it is a string, then it can be any model name from sklearn.linear_model

  • inputs – inputs for reference model. Only relevent if reference_model is not None and plot_type is diff

  • plot_type – either of scores or diff

  • xaxis – either of theta or time

  • ax – the axis to use for plotting

  • line_colors – colors of line

  • fill_color – color to fill confidence interval

  • show – whether to show the plot or not

Returns:

matplotlib axes

Example

>>> import numpy as np
>>> from ai4water.utils.visualizations import murphy_diagram
>>> yy = np.random.randint(1, 1000, 100)
>>> ff1 = np.random.randint(1, 1000, 100)
>>> ff2 = np.random.randint(1, 1000, 100)
>>> murphy_diagram(yy, ff1, ff2)
...
>>> murphy_diagram(yy, ff1, ff2, plot_type="diff")

fdc_plot

ai4water.utils.visualizations.fdc_plot(sim: Union[list, ndarray, Series, DataFrame], obs: Union[list, ndarray, Series, DataFrame], ax: Optional[Axes] = None, legend: bool = True, xlabel: str = 'Exceedence [%]', ylabel: str = 'Flow', show: bool = True) Axes[source]

Plots flow duration curve

Parameters:
  • sim – simulated flow

  • obs – observed flow

  • ax – axis on which to plot

  • legend – whether to apply legend or not

  • xlabel – label to set on x-axis. set to None for no x-label

  • ylabel – label to set on y-axis

  • show – whether to show the plot or not

Returns:

matplotlib axes

Example

>>> import numpy as np
>>> import matplotlib.pyplot as plt
>>> from ai4water.utils.visualizations import fdc_plot
>>> simulated = np.random.random(100)
>>> observed = np.random.random(100)
>>> fdc_plot(simulated, observed)
>>> plt.show()

edf_plot

ai4water.utils.visualizations.edf_plot(y: ndarray, num_points: int = 100, xlabel='Objective Value', marker: str = '-', ax: Optional[Axes] = None, show: bool = True, **kwargs) Axes[source]

Plots the empirical distribution function.

Parameters:
  • y (np.ndarray) – array of values

  • num_points (int) –

  • xlabel (str) –

  • marker (str) –

  • ax (plt.Axes, optional) –

  • show (bool, optional (default=True)) – whether to show the plot or not

  • **kwargs – key word arguments for plot

Return type:

plt.Axes

jsonize

ai4water.utils.utils.jsonize(obj, type_converters: Optional[dict] = None)[source]

Serializes an object to python’s native types so that it can be saved in json file format. If the object is a sequence, then each member of th sequence is serialized. Same goes for nested sequences like lists of lists or list of dictionaries.

Parameters:
  • obj – any python object that needs to be serialized.

  • type_converters (dict) – a dictionary definiting how to serialize any particular type The keys of the dictionary should be type the the values should be callable to serialize that type.

Return type:

a serialized python object

Examples

>>> import numpy as np
>>> from ai4water.utils import jsonize
>>> a = np.array([2.0])
>>> b = jsonize(a)
>>> type(b)  # int
... # if a data container consists of mix of native and third party types
... # only third party types are converted into native types
>>> print(jsonize({1: [1, None, True, np.array(3)], 'b': np.array([1, 3])}))
... {1: [1, None, True, 3], 'b': [1, 2, 3]}

The user can define the methods to serialize some types e. g., we can serialize tensorflow’s tensors using serialize method

>>> from tensorflow.keras.layers import Lambda, serialize
>>> tensor = Lambda(lambda _x: _x[Ellipsis, -1, :])
>>> jsonize({'my_tensor': tensor}, {Lambda: serialize})

TrainTestSplit

class ai4water.utils.utils.TrainTestSplit(test_fraction: float = 0.3, seed: Optional[int] = None, train_indices: Optional[Union[list, ndarray]] = None, test_indices: Optional[Union[list, ndarray]] = None)[source]

train_test_split of sklearn can not be used for list of arrays so here we go

Examples

>>> import numpy as np
>>> from ai4water.utils.utils import TrainTestSplit
>>> x1 = np.random.random((100, 10, 4))
>>> x2 = np.random.random((100, 4))
>>> x = [x1, x2]
>>> y = np.random.random(100)
...
>>> train_x, test_x, train_y, test_y = TrainTestSplit().split_by_random(x, y)
>>> # works as well when only a single array i.e. is provided
>>> train_x, test_x, _, _ = TrainTestSplit().split_by_random(x)
... # if we have a time-series like data, where we want to use earlier samples
... # for training and later samples for test then we can do slice based
>>> train_x, test_x, train_y, test_y = TrainTestSplit().split_by_slicing(x, y)
split_by_indices(x: Union[list, ndarray, Series, DataFrame, List[ndarray]], y: Optional[Union[list, ndarray, Series, DataFrame, List[ndarray]]] = None)[source]

splits the x and y by user defined train_indices and test_indices

split_by_random(x: Union[list, ndarray, Series, DataFrame, List[ndarray]], y: Optional[Union[list, ndarray, Series, DataFrame, List[ndarray]]] = None) Tuple[Any, Any, Any, Any][source]

splits the x and y by random splitting. :param x: arrays to split

  • array like such as list, numpy array or pandas dataframe/series

  • list of array like objects

Parameters:

y

array like

  • array like such as list, numpy array or pandas dataframe/series

  • list of array like objects

split_by_slicing(x: Union[list, ndarray, Series, DataFrame, List[ndarray]], y: Optional[Union[list, ndarray, Series, DataFrame, List[ndarray]]] = None)[source]

splits the x and y by slicing which is defined by test_fraction :param x: arrays to split

  • array like such as list, numpy array or pandas dataframe/series

  • list of array like objects

Parameters:

y

array like

  • array like such as list, numpy array or pandas dataframe/series

  • list of array like objects