Open In Colab

View Source on GitHub

Data Preparation for Regression task

This notebook describes how to prepare data for a single machine learning regression problem.

[ ]:
try:
    import ai4water
except ModuleNotFoundError:
    !pip install ai4water
[104]:
import site
site.addsitedir("D:\\mytools\\AI4Water")

from ai4water.datasets import busan_beach
from ai4water.preprocessing import DataSet
from ai4water.utils.utils import get_version_info

[51]:
for lib, ver in get_version_info().items():
    print(lib, ver)
python 3.9.7 | packaged by conda-forge | (default, Sep 29 2021, 19:20:16) [MSC v.1916 64 bit (AMD64)]
os nt
ai4water 1.06
lightgbm 3.3.1
tcn 3.4.0
catboost 0.26
xgboost 1.5.0
easy_mpl 0.21.3
SeqMetrics 1.3.3
tensorflow 2.7.0
keras.api._v2.keras 2.7.0
numpy 1.21.0
pandas 1.3.4
matplotlib 3.4.3
h5py 3.5.0
sklearn 1.0.1
shapefile 2.3.0
fiona 1.8.22
xarray 0.20.1
netCDF4 1.5.7
optuna 2.10.1
skopt 0.9.0
hyperopt 0.2.7
plotly 5.3.1
lime NotDefined
seaborn 0.11.2

When it comes to preparing data for supervised machine learning (prediction) problem, we need to think in terms of input features, output features, examples and dimensions of input and output features. This is because a machine learning model/algorithm is a box which takes something (input features) as input and retuns something else (output feature) as output. We should however keep in mind that the inputs and outputs should be related to each other in one way or the other. For example, we should not be using chocolate consumption as input in order to predict number of nobel laureates. Apparently they are correlated with each other. An example consists of input-output pairs. This means one example as one input and the corresponding output. The input can consist of one or more input features. Usually it is considered that the larger the examples, the better. However the more diverse the examples, the better. All the examples should have exactly similar inputs and outputs. For example if we want to predict next day’s temperature, we can not have some examples with humidity+wind speed as input and some examples with humidity + sunlight as input. If some input features are not available for some examples, we must either fill them using some rule or discard those examples. It should be noted that the word data-points or samples is also used for what we are calling examples. Similarly, the word target, label or true output is also used for output.

When it comes to dimensions of input and output features, we need to think that whether an input feature in an example is scaler or vector and if it is vector then whether it is one dimensional or multi-dimensional. Consider the problem of prediction of next day’s temperature using climate variables from the previous day. If we are using the value of temperature at a single point of time at a place, then it can be a scalar value. If we want to use temperature from previous five days as an input feature, then this feature is a one dimensional input feature. If however, we consider the raster map of temperature of a city as an input feature, then this feautre is a multi-dimensional input feature. If all the input and output features are scalar values, this means we can fit our data into table/excel sheet/csv file. Such a data is called tabular data. In such a data, each column corresponds to an input feature and each row corresponds to an example. We can however also combine scalar and multi-dimensional features as input in a single model. Consider for example, predicting next day’s temperature in a city using previous day’s temperature (image data) and latitude and longitude of the city (scalar data) as input features.

Now consider the following tabular data in the form of pandas DataFrame.

[52]:
data = busan_beach()

This data consists of 1446 rows and 14 columns.

[53]:
data.shape
[53]:
(1446, 14)

If we check the names of columns in data, we find out some are hydrological features while others are climate features.

[54]:
data.columns
[54]:
Index(['tide_cm', 'wat_temp_c', 'sal_psu', 'air_temp_c', 'pcp_mm', 'pcp3_mm',
       'pcp6_mm', 'pcp12_mm', 'wind_dir_deg', 'wind_speed_mps', 'air_p_hpa',
       'mslp_hpa', 'rel_hum', 'tetx_coppml'],
      dtype='object')
[55]:
data.head()
[55]:
tide_cm wat_temp_c sal_psu air_temp_c pcp_mm pcp3_mm pcp6_mm pcp12_mm wind_dir_deg wind_speed_mps air_p_hpa mslp_hpa rel_hum tetx_coppml
index
2018-06-19 00:00:00 36.407149 19.321232 33.956058 19.780000 0.0 0.0 0.0 0.0 159.533333 0.960000 1002.856667 1007.256667 95.000000 NaN
2018-06-19 00:30:00 35.562515 19.320124 33.950508 19.093333 0.0 0.0 0.0 0.0 86.596667 0.163333 1002.300000 1006.700000 95.000000 NaN
2018-06-19 01:00:00 34.808016 19.319666 33.942532 18.733333 0.0 0.0 0.0 0.0 2.260000 0.080000 1001.973333 1006.373333 95.000000 NaN
2018-06-19 01:30:00 30.645216 19.320406 33.931263 18.760000 0.0 0.0 0.0 0.0 62.710000 0.193333 1001.776667 1006.120000 95.006667 NaN
2018-06-19 02:00:00 26.608980 19.326729 33.917961 18.633333 0.0 0.0 0.0 0.0 63.446667 0.510000 1001.743333 1006.103333 95.006667 NaN
[56]:
data.tail()
[56]:
tide_cm wat_temp_c sal_psu air_temp_c pcp_mm pcp3_mm pcp6_mm pcp12_mm wind_dir_deg wind_speed_mps air_p_hpa mslp_hpa rel_hum tetx_coppml
index
2019-09-07 22:00:00 -3.989912 20.990612 33.776449 23.700000 0.0 0.0 0.0 0.5 203.760000 6.506667 1003.446667 1007.746667 88.170000 NaN
2019-09-07 22:30:00 -2.807042 21.012014 33.702310 23.620000 0.0 0.0 0.0 0.0 205.353333 5.633333 1003.520000 1007.820000 88.256667 NaN
2019-09-07 23:00:00 -3.471326 20.831739 33.726177 23.666667 0.0 0.0 0.0 0.0 202.540000 4.480000 1003.610000 1007.910000 87.833333 NaN
2019-09-07 23:30:00 0.707771 21.006086 33.716274 23.633333 0.0 0.0 0.0 0.0 207.206667 4.946667 1003.633333 1007.933333 88.370000 NaN
2019-09-08 00:00:00 1.011731 20.896149 33.729773 23.600000 0.0 0.0 0.0 0.0 210.200000 4.400000 1003.700000 1008.000000 87.700000 NaN

We see in the several missing values in the last column of the data. If we want to quantify the exact amount of missing values in each column in the data, we can do as below

[57]:
data.isna().sum()
[57]:
tide_cm              0
wat_temp_c           0
sal_psu              0
air_temp_c           0
pcp_mm               0
pcp3_mm              0
pcp6_mm              0
pcp12_mm             0
wind_dir_deg         0
wind_speed_mps       0
air_p_hpa            0
mslp_hpa             0
rel_hum              0
tetx_coppml       1228
dtype: int64

So, if we exclude all the rows with any missing value in it, then we end up with 218 rows.

[58]:
data.dropna().shape
[58]:
(218, 14)

This means we can make, at maximum 218 examples from this data.

[59]:
data.dropna().head()
[59]:
tide_cm wat_temp_c sal_psu air_temp_c pcp_mm pcp3_mm pcp6_mm pcp12_mm wind_dir_deg wind_speed_mps air_p_hpa mslp_hpa rel_hum tetx_coppml
index
2018-06-20 09:00:00 -22.245026 19.457182 34.004292 24.280000 0.0 0.0 0.0 6.0 205.006667 1.653333 998.613333 1002.913333 75.100000 444866.9004
2018-06-20 12:00:00 10.906243 19.511044 34.044975 26.076667 0.0 0.0 0.0 0.0 201.593333 2.993333 998.830000 1003.130000 67.423333 193368.2195
2018-06-20 15:00:00 15.025008 19.582047 34.134964 25.043333 0.0 0.0 0.0 0.0 188.976667 2.010000 998.190000 1002.490000 67.136667 287920.3535
2018-06-20 18:00:00 -7.755828 19.579559 34.106552 22.826667 0.0 0.0 0.0 0.0 209.493333 1.480000 998.416667 1002.716667 77.413333 246005.6510
2018-06-20 21:00:00 -18.817711 19.570045 34.100220 20.910000 0.0 0.0 0.0 0.0 260.616667 1.080000 999.843333 1004.143333 79.093333 273757.5439
[60]:
data.dropna().tail()
[60]:
tide_cm wat_temp_c sal_psu air_temp_c pcp_mm pcp3_mm pcp6_mm pcp12_mm wind_dir_deg wind_speed_mps air_p_hpa mslp_hpa rel_hum tetx_coppml
index
2019-09-06 11:00:00 15.146028 19.247823 33.746046 27.666667 0.0 0.0 0.0 0.0 71.336667 1.666667 1006.450000 1010.750000 75.393333 1.320332e+07
2019-09-06 12:00:00 24.810148 20.357189 33.778996 27.383333 0.0 0.0 0.0 0.0 49.626667 1.386667 1006.106667 1010.406667 75.896667 2.437392e+06
2019-09-06 13:00:00 25.666843 19.362318 33.810041 27.533333 0.0 0.0 0.0 0.0 43.590000 2.076667 1005.316667 1009.616667 76.056667 2.927098e+06
2019-09-06 14:00:00 25.712396 19.317668 33.727930 28.213333 0.0 0.0 0.0 0.0 42.160000 2.603333 1004.246667 1008.546667 71.943333 4.699929e+06
2019-09-06 15:00:00 18.448916 20.592932 33.831501 27.896667 0.0 0.0 0.0 0.0 29.850000 2.743333 1003.846667 1008.146667 72.740000 3.506092e+06

Now in order to prepare examples from our data, we make use of DataSet class. The first argument to the DataSet class should be the pandas DataFrame or numpy array.

[61]:

ds = DataSet( data = data )

Now we can ask the instance of DataSet class to give us input/output pairs for the training data. It is common practice to denote inputs with x and outputs with y.

[62]:
train_x, train_y = ds.training_data()

********** Removing Examples with nan in labels  **********

***** Training *****
input_x shape:  (121, 13)
target shape:  (121, 1)

The training_data method of DataSet class returns tuple of numpy arrays. The first array is the input and second array is the output.

We see that with default 121 examples were chosen for training. We can also confirm this by checking the shape of train_x and train_y arrays.

[63]:
train_x.shape, train_y.shape
[63]:
((121, 13), (121, 1))

If we check the inputs from the first example, we can see that these correspond to the first row in data when nans were removed from it.

[64]:
train_x[0]
[64]:
array([ -22.245026 ,   19.457182 ,   34.00429  ,   24.28     ,
          0.       ,    0.       ,    0.       ,    6.       ,
        205.00667  ,    1.6533333,  998.61334  , 1002.9133   ,
         75.1      ], dtype=float32)

The target/output for the first example correspond to the same row as that of inputs.

[65]:
train_y[0]
[65]:
array([444866.9004])
[66]:
train_x[1]
[66]:
array([  10.906243 ,   19.511044 ,   34.044975 ,   26.076666 ,
          0.       ,    0.       ,    0.       ,    0.       ,
        201.59334  ,    2.9933333,  998.83     , 1003.13     ,
         67.42333  ], dtype=float32)
[67]:
train_y[1]
[67]:
array([193368.2195])
[68]:
test_x, test_y = ds.test_data()

********** Removing Examples with nan in labels  **********

***** Test *****
input_x shape:  (66, 13)
target shape:  (66, 1)

The last values in test_x and test_y correspond to the last valid row i.e. the last row in data without any None value.

[69]:
test_x[-1]
[69]:
array([  18.448915 ,   20.592932 ,   33.8315   ,   27.896667 ,
          0.       ,    0.       ,    0.       ,    0.       ,
         29.85     ,    2.7433333, 1003.8467   , 1008.14667  ,
         72.74     ], dtype=float32)
[70]:
test_y[-1]
[70]:
array([3506092.003])

By default, all the columns from start till second last are considered as input columns or input features and the last/remaining column is considered as output. We can however, specify input columns using the input_features keyword argument.

defining inputs

[71]:
ds = DataSet(
    data=data,
    input_features=['tide_cm', 'wat_temp_c', 'sal_psu', 'air_temp_c', 'pcp_mm', 'pcp3_mm']
)
[72]:
train_x, train_y = ds.training_data()
train_x.shape, train_y.shape

********** Removing Examples with nan in labels  **********

***** Training *****
input_x shape:  (121, 6)
target shape:  (121, 8)
[72]:
((121, 6), (121, 8))

Now we see that the input data consists of 6 features and output data consists of 8 features. This is because by default, all the columns other than those specified for input are considered as output features.

[73]:
train_x[0]
[73]:
array([-22.245026,  19.457182,  34.00429 ,  24.28    ,   0.      ,
         0.      ], dtype=float32)

We can however specify the columns for the output by making use of output_features keyword argument.

defining outputs

[74]:
ds = DataSet(
    data=data,
    input_features=['tide_cm', 'wat_temp_c', 'sal_psu', 'air_temp_c', 'pcp_mm', 'pcp3_mm'],
    output_features=["tetx_coppml"]
)
[75]:
train_x, train_y = ds.training_data()

********** Removing Examples with nan in labels  **********

***** Training *****
input_x shape:  (121, 6)
target shape:  (121, 1)

If we check the values of input and output values for the first example, they correspond to the first row in data with nans removed.

[76]:
train_x[0]
[76]:
array([-22.245026,  19.457182,  34.00429 ,  24.28    ,   0.      ,
         0.      ], dtype=float32)
[77]:
train_y[0]
[77]:
array([444866.9004])
[78]:
train_x[1]
[78]:
array([10.906243, 19.511044, 34.044975, 26.076666,  0.      ,  0.      ],
      dtype=float32)
[79]:
train_y[1]
[79]:
array([193368.2195])
[80]:
test_x, test_y = ds.test_data()
test_x.shape, test_y.shape

********** Removing Examples with nan in labels  **********

***** Test *****
input_x shape:  (66, 6)
target shape:  (66, 1)
[80]:
((66, 6), (66, 1))
[81]:
test_x[-1]
[81]:
array([18.448915, 20.592932, 33.8315  , 27.896667,  0.      ,  0.      ],
      dtype=float32)
[82]:
test_y[-1]
[82]:
array([3506092.003])

We can also have more than one output features. We just need to specify the names of columns to be used as outputs using the output_features argument.

[83]:
data = busan_beach(target=["blaTEM_coppml", "tetx_coppml"])
data.shape
[83]:
(1446, 15)
[84]:
data.columns
[84]:
Index(['tide_cm', 'wat_temp_c', 'sal_psu', 'air_temp_c', 'pcp_mm', 'pcp3_mm',
       'pcp6_mm', 'pcp12_mm', 'wind_dir_deg', 'wind_speed_mps', 'air_p_hpa',
       'mslp_hpa', 'rel_hum', 'blaTEM_coppml', 'tetx_coppml'],
      dtype='object')
[85]:
data.dropna().head()
[85]:
tide_cm wat_temp_c sal_psu air_temp_c pcp_mm pcp3_mm pcp6_mm pcp12_mm wind_dir_deg wind_speed_mps air_p_hpa mslp_hpa rel_hum blaTEM_coppml tetx_coppml
index
2018-06-20 09:00:00 -22.245026 19.457182 34.004292 24.280000 0.0 0.0 0.0 6.0 205.006667 1.653333 998.613333 1002.913333 75.100000 9.665350e+05 444866.9004
2018-06-20 12:00:00 10.906243 19.511044 34.044975 26.076667 0.0 0.0 0.0 0.0 201.593333 2.993333 998.830000 1003.130000 67.423333 3.834816e+05 193368.2195
2018-06-20 15:00:00 15.025008 19.582047 34.134964 25.043333 0.0 0.0 0.0 0.0 188.976667 2.010000 998.190000 1002.490000 67.136667 1.673262e+06 287920.3535
2018-06-20 18:00:00 -7.755828 19.579559 34.106552 22.826667 0.0 0.0 0.0 0.0 209.493333 1.480000 998.416667 1002.716667 77.413333 5.645747e+06 246005.6510
2018-06-20 21:00:00 -18.817711 19.570045 34.100220 20.910000 0.0 0.0 0.0 0.0 260.616667 1.080000 999.843333 1004.143333 79.093333 1.630322e+06 273757.5439
[88]:
ds = DataSet(
    data=data,
    output_features=["blaTEM_coppml", "tetx_coppml"],
    verbosity=0  # setting the verbosity to 0 will not print any information.
)
[90]:
train_x, train_y = ds.training_data()
train_x.shape, train_y.shape
[90]:
((121, 13), (121, 2))

We can see that all the columns starting from first till the last except the ones we defined as output features, are used as input features.

[91]:
train_x[0]
[91]:
array([ -22.245026 ,   19.457182 ,   34.00429  ,   24.28     ,
          0.       ,    0.       ,    0.       ,    6.       ,
        205.00667  ,    1.6533333,  998.61334  , 1002.9133   ,
         75.1      ], dtype=float32)
[92]:
train_y[0]
[92]:
array([966535.0042, 444866.9004])
[93]:
test_x, test_y = ds.test_data()
test_x.shape, test_y.shape
[93]:
((66, 13), (66, 2))
[94]:
test_x[-1]
[94]:
array([  18.448915 ,   20.592932 ,   33.8315   ,   27.896667 ,
          0.       ,    0.       ,    0.       ,    0.       ,
         29.85     ,    2.7433333, 1003.8467   , 1008.14667  ,
         72.74     ], dtype=float32)
[95]:
test_y[-1]
[95]:
array([8473063.881, 3506092.003])

You might be curious that why there were 121 examples in the training data and 66 samples in the test set? And where the remaining samples. This actually depends how we split our data and it will be covered in detail in next lesson.

saving prepared data in h5 file

Data preparation i.e. converting data into examples can be costly so sometimes, it is better to save the prepared data so that we don’t have to prepare it again and again. We can save the prepared data (input-output pairs). One way of doing this using the DataSet class is to save the input-output pairs into h5 file. You only need to set the save to True.

[99]:
ds = DataSet(
    data=data,
    output_features=["blaTEM_coppml", "tetx_coppml"],
    save=True
)

********** Removing Examples with nan in labels  **********

***** Training *****
input_x shape:  (121, 13)
target shape:  (121, 2)

********** Removing Examples with nan in labels  **********

***** Validation *****
input_x shape:  (31, 13)
target shape:  (31, 2)

********** Removing Examples with nan in labels  **********

***** Test *****
input_x shape:  (66, 13)
target shape:  (66, 2)

Now we have a data.h5 file in our disk. We can look into this file if we have the proper editor/viewer for h5 file. One such viewer is `hdf5viewer <>`__. Using this viewer, we can see the input-output pairs for training, validation and test sets.

Loading from h5 file

We can use the pre-existing h5 file i.e. data.h5 file to construct DataSet class. And we can get training and test data from it as well. We have to use from_h5 constructor/class method for this purpose. The input to from_h5 method is the path of .h5 file.

[100]:
ds = DataSet.from_h5("data.h5")
[102]:
train_x, train_y = ds.training_data()
train_x.shape, train_y.shape
[102]:
((121, 13), (121, 2))
[103]:
train_x, train_y = ds.test_data()
train_x.shape, train_y.shape
[103]:
((66, 13), (66, 2))

Multiple Inputs

The examples presented so far had only one kind of input data or one input source. In machine learning, and especially in deep learning, we can however, very often have examples combining inputs from multiple sources. For example, instead of input data consisting of numpy array, our input data can consist of list of numpy arrays. These numpy arrays then can represent data from different sources.

Higher dimensional features