Data Splitting

This notebook describes how to split data into training, validation and test sets.

[24]:

import numpy as np

from ai4water.datasets import busan_beach
from ai4water.preprocessing import DataSet
from ai4water.utils.utils import get_version_info

[2]:

for lib, ver in get_version_info().items():
    print(lib, ver)

python 3.9.7 | packaged by conda-forge | (default, Sep 29 2021, 19:20:16) [MSC v.1916 64 bit (AMD64)]
os nt
ai4water 1.07
lightgbm 3.3.1
tcn 3.4.0
catboost 0.26
xgboost 1.5.0
easy_mpl 0.21.3
SeqMetrics 1.3.3
tensorflow 2.7.0
keras.api._v2.keras 2.7.0
numpy 1.21.0
pandas 1.3.4
matplotlib 3.4.3
h5py 3.5.0
sklearn 1.0.1
shapefile 2.3.0
fiona 1.8.22
xarray 0.20.1
netCDF4 1.5.7
optuna 2.10.1
skopt 0.9.0
hyperopt 0.2.7
plotly 5.3.1
lime NotDefined
seaborn 0.11.2

[3]:

data = busan_beach()
data.shape

[3]:

(1446, 14)

[4]:

data.dropna().shape

[4]:

(218, 14)

In DataSet class, 70% of the total examples are considered for training while the remaining 30% examples are reserved for test. The validation data is 20% of the training data.

[6]:

ds = DataSet(data=data)

train_x, train_y = ds.training_data()
val_x, val_y = ds.validation_data()
test_x, test_y = ds.test_data()

len(train_x) + len(val_x) + len(test_x)


********** Removing Examples with nan in labels  **********

***** Training *****
input_x shape:  (121, 13)
target shape:  (121, 1)

********** Removing Examples with nan in labels  **********

***** Validation *****
input_x shape:  (31, 13)
target shape:  (31, 1)

********** Removing Examples with nan in labels  **********

***** Test *****
input_x shape:  (66, 13)
target shape:  (66, 1)

[6]:

We had 218 valid examples, 70% of which i.e. 152 were considered for training and remaining 30% i.e. 66 examples were reserved for test. Since the validation fraction was 0.2 (by default) that means we need to put 20% of 152 for validation and thus when we called training_data, we got 121 examples.

train fraction

We can confirm that the default train fraction was 0.7.

[7]:

ds.train_fraction

[7]:

0.7

If we don’t want to separate any data for test set, we can set the train_fraction to 1.0. This means we want to consider the whole data for training. Now the validation data (which is 20%) will be taken from the total examples.

[9]:

ds = DataSet(data=data, train_fraction=1.0)

train_x, train_y = ds.training_data()
val_x, val_y = ds.validation_data()
test_x, test_y = ds.test_data()

len(train_x) + len(val_x) + len(test_x)


********** Removing Examples with nan in labels  **********

***** Training *****
input_x shape:  (174, 13)
target shape:  (174, 1)

********** Removing Examples with nan in labels  **********

***** Validation *****
input_x shape:  (44, 13)
target shape:  (44, 1)
***** Test *****
input_x shape:  (0,)
target shape:  (0,)

[9]:

20% of examples of 218 examples are 44. We did not get any example for test set, because train fraction was 1.0.

val_fraction

We can also control the number of examples for the valiation set by making use of val_fraction. Remember that the val_fraction is always considered as fraction of training set.

[10]:

ds = DataSet(data=data, train_fraction=1.0, val_fraction=0.5)

train_x, train_y = ds.training_data()
val_x, val_y = ds.validation_data()
test_x, test_y = ds.test_data()

len(train_x) + len(val_x) + len(test_x)


********** Removing Examples with nan in labels  **********

***** Training *****
input_x shape:  (109, 13)
target shape:  (109, 1)

********** Removing Examples with nan in labels  **********

***** Validation *****
input_x shape:  (109, 13)
target shape:  (109, 1)
***** Test *****
input_x shape:  (0,)
target shape:  (0,)

[10]:

[11]:

ds = DataSet(data=data, train_fraction=0.7, val_fraction=0.5)

train_x, train_y = ds.training_data()
val_x, val_y = ds.validation_data()
test_x, test_y = ds.test_data()

len(train_x) + len(val_x) + len(test_x)


********** Removing Examples with nan in labels  **********

***** Training *****
input_x shape:  (76, 13)
target shape:  (76, 1)

********** Removing Examples with nan in labels  **********

***** Validation *****
input_x shape:  (76, 13)
target shape:  (76, 1)

********** Removing Examples with nan in labels  **********

***** Test *****
input_x shape:  (66, 13)
target shape:  (66, 1)

[11]:

random splitting

The default splitting strategy is sequential. This means, the first examples (determined by train_fraction) are used for training and the later examples are considered for test. We can confirm this by checking the inputs and outputs from the first example. They both correspond to the first non-nan row.

[12]:

train_x[0]

[12]:

array([ -22.245026 ,   19.457182 ,   34.00429  ,   24.28     ,
          0.       ,    0.       ,    0.       ,    6.       ,
        205.00667  ,    1.6533333,  998.61334  , 1002.9133   ,
         75.1      ], dtype=float32)

[13]:

train_y[0]

[13]:

array([444866.9004])

However, in many machine learning problems, where the data is not time-series, one may wishes to split the data randomly into training and test sets. Thi can be acheived by setting the split_random to True. Now the first example from training data is not necessarily from the first row.

[21]:

ds = DataSet(data=data, split_random=True, verbosity=0)

train_x, train_y = ds.training_data()
val_x, val_y = ds.validation_data()
test_x, test_y = ds.test_data()

train_y[0]

[21]:

array([1363760.645])

[22]:

train_y[1]

[22]:

array([2356366.032])

reproducibility

If we create examples from our the same data again and again using random splitting, the examples reserved for training (or for validation and test) are same. This is because, random seed is always set to 313 to ensure reproducibility by default. This is carried out to acheive reproducible results.

[18]:

for i in range(10):
    ds = DataSet(data=data, split_random=True, verbosity=0)

    train_x, train_y = ds.training_data()
    val_x, val_y = ds.validation_data()
    test_x, test_y = ds.test_data()

    print(train_y[0])

[1363760.645]
[1363760.645]
[1363760.645]
[1363760.645]
[1363760.645]
[1363760.645]
[1363760.645]
[1363760.645]
[1363760.645]
[1363760.645]

If however, we do not set a seed, then different examples will be considered for training/validation/test sets. This is shown below by setting the seed to None.

[19]:

for i in range(10):
    ds = DataSet(data=data, split_random=True, seed=None, verbosity=0)

    train_x, train_y = ds.training_data()
    val_x, val_y = ds.validation_data()
    test_x, test_y = ds.test_data()

    print(train_y[0])

[332194.7001]
[275100.6769]
[4881349.328]
[278787.0508]
[844769.5803]
[216767.3381]
[2391848.163]
[141357.2781]
[761979.6799]
[20083984.46]

We can also set the seed of our choice.

[20]:

for i in range(10):
    ds = DataSet(data=data, split_random=True, seed=i, verbosity=0)

    train_x, train_y = ds.training_data()
    val_x, val_y = ds.validation_data()
    test_x, test_y = ds.test_data()

    print(train_y[0])

[774076.752]
[2060160.801]
[202449.1352]
[21289.67181]
[36045668.64]
[14976057.52]
[3291674.776]
[2356366.032]
[836261.1064]
[3256878.75]

spliting using indices

Sometimes, we want to have more control over splitting strategy. We want to specify the examples to be considered for training/validation/test. One way to acheiving this using DataSet class is by specifying the indices for training or for training, validation and test sets.

[25]:

indices = {
    'training': np.arange(50)
}
ds = DataSet(data=data, indices=indices, verbosity=0)


train_x, train_y = ds.training_data()
val_x, val_y = ds.validation_data()
test_x, test_y = ds.test_data()

Above we are specifying the training examples by saying that the first 50 examples are to be considered for training. Since the validation data is taken from training set, 10 examples (20%) are taken for validation.

[28]:

len(train_x), len(val_x), len(test_x)

[28]:

(40, 10, 168)

[29]:

train_y[0]

[29]:

array([444866.9004])

spliting using intervals

[ ]: