Data Preparation for Regression task
This notebook describes how to prepare data for a single machine learning regression problem.
[ ]:
try:
import ai4water
except ModuleNotFoundError:
!pip install ai4water
[104]:
import site
site.addsitedir("D:\\mytools\\AI4Water")
from ai4water.datasets import busan_beach
from ai4water.preprocessing import DataSet
from ai4water.utils.utils import get_version_info
[51]:
for lib, ver in get_version_info().items():
print(lib, ver)
python 3.9.7 | packaged by conda-forge | (default, Sep 29 2021, 19:20:16) [MSC v.1916 64 bit (AMD64)]
os nt
ai4water 1.06
lightgbm 3.3.1
tcn 3.4.0
catboost 0.26
xgboost 1.5.0
easy_mpl 0.21.3
SeqMetrics 1.3.3
tensorflow 2.7.0
keras.api._v2.keras 2.7.0
numpy 1.21.0
pandas 1.3.4
matplotlib 3.4.3
h5py 3.5.0
sklearn 1.0.1
shapefile 2.3.0
fiona 1.8.22
xarray 0.20.1
netCDF4 1.5.7
optuna 2.10.1
skopt 0.9.0
hyperopt 0.2.7
plotly 5.3.1
lime NotDefined
seaborn 0.11.2
When it comes to preparing data for supervised machine learning (prediction) problem, we need to think in terms of input features
, output features
, examples
and dimensions
of input and output features. This is because a machine learning model/algorithm is a box which takes something (input features) as input and retuns something else (output feature) as output. We should however keep in mind that the inputs and outputs should be related to each other in one way or the other. For
example, we should not be using chocolate consumption as input in order to predict number of nobel laureates. Apparently they are correlated with each other. An example
consists of input-output pairs. This means one example as one input and the corresponding output. The input can consist of one or more input features. Usually it is considered that the larger the examples, the
better. However the more diverse the examples, the better. All the examples should have exactly similar inputs and outputs. For example if we want to predict next day’s temperature, we can not have some examples with humidity+wind speed as input and some examples with humidity + sunlight as input. If some input features are not available for some examples, we must either fill them using some rule or discard those examples. It should be noted that the word data-points or samples is also used for
what we are calling examples
. Similarly, the word target, label or true output is also used for output.
When it comes to dimensions of input and output features, we need to think that whether an input feature in an example is scaler or vector and if it is vector then whether it is one dimensional or multi-dimensional. Consider the problem of prediction of next day’s temperature using climate variables from the previous day. If we are using the value of temperature at a single point of time at a place, then it can be a scalar value. If we want to use temperature from previous five days as an input feature, then this feature is a one dimensional input feature. If however, we consider the raster map of temperature of a city as an input feature, then this feautre is a multi-dimensional input feature. If all the input and output features are scalar values, this means we can fit our data into table/excel sheet/csv file. Such a data is called tabular data. In such a data, each column corresponds to an input feature and each row corresponds to an example. We can however also combine scalar and multi-dimensional features as input in a single model. Consider for example, predicting next day’s temperature in a city using previous day’s temperature (image data) and latitude and longitude of the city (scalar data) as input features.
Now consider the following tabular data in the form of pandas DataFrame.
[52]:
data = busan_beach()
This data consists of 1446 rows and 14 columns.
[53]:
data.shape
[53]:
(1446, 14)
If we check the names of columns in data, we find out some are hydrological features while others are climate features.
[54]:
data.columns
[54]:
Index(['tide_cm', 'wat_temp_c', 'sal_psu', 'air_temp_c', 'pcp_mm', 'pcp3_mm',
'pcp6_mm', 'pcp12_mm', 'wind_dir_deg', 'wind_speed_mps', 'air_p_hpa',
'mslp_hpa', 'rel_hum', 'tetx_coppml'],
dtype='object')
[55]:
data.head()
[55]:
tide_cm | wat_temp_c | sal_psu | air_temp_c | pcp_mm | pcp3_mm | pcp6_mm | pcp12_mm | wind_dir_deg | wind_speed_mps | air_p_hpa | mslp_hpa | rel_hum | tetx_coppml | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
index | ||||||||||||||
2018-06-19 00:00:00 | 36.407149 | 19.321232 | 33.956058 | 19.780000 | 0.0 | 0.0 | 0.0 | 0.0 | 159.533333 | 0.960000 | 1002.856667 | 1007.256667 | 95.000000 | NaN |
2018-06-19 00:30:00 | 35.562515 | 19.320124 | 33.950508 | 19.093333 | 0.0 | 0.0 | 0.0 | 0.0 | 86.596667 | 0.163333 | 1002.300000 | 1006.700000 | 95.000000 | NaN |
2018-06-19 01:00:00 | 34.808016 | 19.319666 | 33.942532 | 18.733333 | 0.0 | 0.0 | 0.0 | 0.0 | 2.260000 | 0.080000 | 1001.973333 | 1006.373333 | 95.000000 | NaN |
2018-06-19 01:30:00 | 30.645216 | 19.320406 | 33.931263 | 18.760000 | 0.0 | 0.0 | 0.0 | 0.0 | 62.710000 | 0.193333 | 1001.776667 | 1006.120000 | 95.006667 | NaN |
2018-06-19 02:00:00 | 26.608980 | 19.326729 | 33.917961 | 18.633333 | 0.0 | 0.0 | 0.0 | 0.0 | 63.446667 | 0.510000 | 1001.743333 | 1006.103333 | 95.006667 | NaN |
[56]:
data.tail()
[56]:
tide_cm | wat_temp_c | sal_psu | air_temp_c | pcp_mm | pcp3_mm | pcp6_mm | pcp12_mm | wind_dir_deg | wind_speed_mps | air_p_hpa | mslp_hpa | rel_hum | tetx_coppml | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
index | ||||||||||||||
2019-09-07 22:00:00 | -3.989912 | 20.990612 | 33.776449 | 23.700000 | 0.0 | 0.0 | 0.0 | 0.5 | 203.760000 | 6.506667 | 1003.446667 | 1007.746667 | 88.170000 | NaN |
2019-09-07 22:30:00 | -2.807042 | 21.012014 | 33.702310 | 23.620000 | 0.0 | 0.0 | 0.0 | 0.0 | 205.353333 | 5.633333 | 1003.520000 | 1007.820000 | 88.256667 | NaN |
2019-09-07 23:00:00 | -3.471326 | 20.831739 | 33.726177 | 23.666667 | 0.0 | 0.0 | 0.0 | 0.0 | 202.540000 | 4.480000 | 1003.610000 | 1007.910000 | 87.833333 | NaN |
2019-09-07 23:30:00 | 0.707771 | 21.006086 | 33.716274 | 23.633333 | 0.0 | 0.0 | 0.0 | 0.0 | 207.206667 | 4.946667 | 1003.633333 | 1007.933333 | 88.370000 | NaN |
2019-09-08 00:00:00 | 1.011731 | 20.896149 | 33.729773 | 23.600000 | 0.0 | 0.0 | 0.0 | 0.0 | 210.200000 | 4.400000 | 1003.700000 | 1008.000000 | 87.700000 | NaN |
We see in the several missing values in the last column of the data. If we want to quantify the exact amount of missing values in each column in the data, we can do as below
[57]:
data.isna().sum()
[57]:
tide_cm 0
wat_temp_c 0
sal_psu 0
air_temp_c 0
pcp_mm 0
pcp3_mm 0
pcp6_mm 0
pcp12_mm 0
wind_dir_deg 0
wind_speed_mps 0
air_p_hpa 0
mslp_hpa 0
rel_hum 0
tetx_coppml 1228
dtype: int64
So, if we exclude all the rows with any missing value in it, then we end up with 218 rows.
[58]:
data.dropna().shape
[58]:
(218, 14)
This means we can make, at maximum 218 examples from this data.
[59]:
data.dropna().head()
[59]:
tide_cm | wat_temp_c | sal_psu | air_temp_c | pcp_mm | pcp3_mm | pcp6_mm | pcp12_mm | wind_dir_deg | wind_speed_mps | air_p_hpa | mslp_hpa | rel_hum | tetx_coppml | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
index | ||||||||||||||
2018-06-20 09:00:00 | -22.245026 | 19.457182 | 34.004292 | 24.280000 | 0.0 | 0.0 | 0.0 | 6.0 | 205.006667 | 1.653333 | 998.613333 | 1002.913333 | 75.100000 | 444866.9004 |
2018-06-20 12:00:00 | 10.906243 | 19.511044 | 34.044975 | 26.076667 | 0.0 | 0.0 | 0.0 | 0.0 | 201.593333 | 2.993333 | 998.830000 | 1003.130000 | 67.423333 | 193368.2195 |
2018-06-20 15:00:00 | 15.025008 | 19.582047 | 34.134964 | 25.043333 | 0.0 | 0.0 | 0.0 | 0.0 | 188.976667 | 2.010000 | 998.190000 | 1002.490000 | 67.136667 | 287920.3535 |
2018-06-20 18:00:00 | -7.755828 | 19.579559 | 34.106552 | 22.826667 | 0.0 | 0.0 | 0.0 | 0.0 | 209.493333 | 1.480000 | 998.416667 | 1002.716667 | 77.413333 | 246005.6510 |
2018-06-20 21:00:00 | -18.817711 | 19.570045 | 34.100220 | 20.910000 | 0.0 | 0.0 | 0.0 | 0.0 | 260.616667 | 1.080000 | 999.843333 | 1004.143333 | 79.093333 | 273757.5439 |
[60]:
data.dropna().tail()
[60]:
tide_cm | wat_temp_c | sal_psu | air_temp_c | pcp_mm | pcp3_mm | pcp6_mm | pcp12_mm | wind_dir_deg | wind_speed_mps | air_p_hpa | mslp_hpa | rel_hum | tetx_coppml | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
index | ||||||||||||||
2019-09-06 11:00:00 | 15.146028 | 19.247823 | 33.746046 | 27.666667 | 0.0 | 0.0 | 0.0 | 0.0 | 71.336667 | 1.666667 | 1006.450000 | 1010.750000 | 75.393333 | 1.320332e+07 |
2019-09-06 12:00:00 | 24.810148 | 20.357189 | 33.778996 | 27.383333 | 0.0 | 0.0 | 0.0 | 0.0 | 49.626667 | 1.386667 | 1006.106667 | 1010.406667 | 75.896667 | 2.437392e+06 |
2019-09-06 13:00:00 | 25.666843 | 19.362318 | 33.810041 | 27.533333 | 0.0 | 0.0 | 0.0 | 0.0 | 43.590000 | 2.076667 | 1005.316667 | 1009.616667 | 76.056667 | 2.927098e+06 |
2019-09-06 14:00:00 | 25.712396 | 19.317668 | 33.727930 | 28.213333 | 0.0 | 0.0 | 0.0 | 0.0 | 42.160000 | 2.603333 | 1004.246667 | 1008.546667 | 71.943333 | 4.699929e+06 |
2019-09-06 15:00:00 | 18.448916 | 20.592932 | 33.831501 | 27.896667 | 0.0 | 0.0 | 0.0 | 0.0 | 29.850000 | 2.743333 | 1003.846667 | 1008.146667 | 72.740000 | 3.506092e+06 |
Now in order to prepare examples from our data, we make use of DataSet
class. The first argument to the DataSet class should be the pandas DataFrame or numpy array.
[61]:
ds = DataSet(
data = data
)
Now we can ask the instance of DataSet class to give us input/output pairs for the training data. It is common practice to denote inputs with x
and outputs with y
.
[62]:
train_x, train_y = ds.training_data()
********** Removing Examples with nan in labels **********
***** Training *****
input_x shape: (121, 13)
target shape: (121, 1)
The training_data
method of DataSet
class returns tuple of numpy arrays. The first array is the input and second array is the output.
We see that with default 121 examples were chosen for training. We can also confirm this by checking the shape of train_x
and train_y
arrays.
[63]:
train_x.shape, train_y.shape
[63]:
((121, 13), (121, 1))
If we check the inputs from the first example, we can see that these correspond to the first row in data when nans were removed from it.
[64]:
train_x[0]
[64]:
array([ -22.245026 , 19.457182 , 34.00429 , 24.28 ,
0. , 0. , 0. , 6. ,
205.00667 , 1.6533333, 998.61334 , 1002.9133 ,
75.1 ], dtype=float32)
The target/output for the first example correspond to the same row as that of inputs.
[65]:
train_y[0]
[65]:
array([444866.9004])
[66]:
train_x[1]
[66]:
array([ 10.906243 , 19.511044 , 34.044975 , 26.076666 ,
0. , 0. , 0. , 0. ,
201.59334 , 2.9933333, 998.83 , 1003.13 ,
67.42333 ], dtype=float32)
[67]:
train_y[1]
[67]:
array([193368.2195])
[68]:
test_x, test_y = ds.test_data()
********** Removing Examples with nan in labels **********
***** Test *****
input_x shape: (66, 13)
target shape: (66, 1)
The last values in test_x and test_y correspond to the last valid row i.e. the last row in data without any None value.
[69]:
test_x[-1]
[69]:
array([ 18.448915 , 20.592932 , 33.8315 , 27.896667 ,
0. , 0. , 0. , 0. ,
29.85 , 2.7433333, 1003.8467 , 1008.14667 ,
72.74 ], dtype=float32)
[70]:
test_y[-1]
[70]:
array([3506092.003])
By default, all the columns from start till second last are considered as input columns or input features and the last/remaining column is considered as output. We can however, specify input columns using the input_features
keyword argument.
defining inputs
[71]:
ds = DataSet(
data=data,
input_features=['tide_cm', 'wat_temp_c', 'sal_psu', 'air_temp_c', 'pcp_mm', 'pcp3_mm']
)
[72]:
train_x, train_y = ds.training_data()
train_x.shape, train_y.shape
********** Removing Examples with nan in labels **********
***** Training *****
input_x shape: (121, 6)
target shape: (121, 8)
[72]:
((121, 6), (121, 8))
Now we see that the input data consists of 6 features and output data consists of 8 features. This is because by default, all the columns other than those specified for input are considered as output features.
[73]:
train_x[0]
[73]:
array([-22.245026, 19.457182, 34.00429 , 24.28 , 0. ,
0. ], dtype=float32)
We can however specify the columns for the output by making use of output_features
keyword argument.
defining outputs
[74]:
ds = DataSet(
data=data,
input_features=['tide_cm', 'wat_temp_c', 'sal_psu', 'air_temp_c', 'pcp_mm', 'pcp3_mm'],
output_features=["tetx_coppml"]
)
[75]:
train_x, train_y = ds.training_data()
********** Removing Examples with nan in labels **********
***** Training *****
input_x shape: (121, 6)
target shape: (121, 1)
If we check the values of input and output values for the first example, they correspond to the first row in data with nans removed.
[76]:
train_x[0]
[76]:
array([-22.245026, 19.457182, 34.00429 , 24.28 , 0. ,
0. ], dtype=float32)
[77]:
train_y[0]
[77]:
array([444866.9004])
[78]:
train_x[1]
[78]:
array([10.906243, 19.511044, 34.044975, 26.076666, 0. , 0. ],
dtype=float32)
[79]:
train_y[1]
[79]:
array([193368.2195])
[80]:
test_x, test_y = ds.test_data()
test_x.shape, test_y.shape
********** Removing Examples with nan in labels **********
***** Test *****
input_x shape: (66, 6)
target shape: (66, 1)
[80]:
((66, 6), (66, 1))
[81]:
test_x[-1]
[81]:
array([18.448915, 20.592932, 33.8315 , 27.896667, 0. , 0. ],
dtype=float32)
[82]:
test_y[-1]
[82]:
array([3506092.003])
We can also have more than one output features. We just need to specify the names of columns to be used as outputs using the output_features
argument.
[83]:
data = busan_beach(target=["blaTEM_coppml", "tetx_coppml"])
data.shape
[83]:
(1446, 15)
[84]:
data.columns
[84]:
Index(['tide_cm', 'wat_temp_c', 'sal_psu', 'air_temp_c', 'pcp_mm', 'pcp3_mm',
'pcp6_mm', 'pcp12_mm', 'wind_dir_deg', 'wind_speed_mps', 'air_p_hpa',
'mslp_hpa', 'rel_hum', 'blaTEM_coppml', 'tetx_coppml'],
dtype='object')
[85]:
data.dropna().head()
[85]:
tide_cm | wat_temp_c | sal_psu | air_temp_c | pcp_mm | pcp3_mm | pcp6_mm | pcp12_mm | wind_dir_deg | wind_speed_mps | air_p_hpa | mslp_hpa | rel_hum | blaTEM_coppml | tetx_coppml | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
index | |||||||||||||||
2018-06-20 09:00:00 | -22.245026 | 19.457182 | 34.004292 | 24.280000 | 0.0 | 0.0 | 0.0 | 6.0 | 205.006667 | 1.653333 | 998.613333 | 1002.913333 | 75.100000 | 9.665350e+05 | 444866.9004 |
2018-06-20 12:00:00 | 10.906243 | 19.511044 | 34.044975 | 26.076667 | 0.0 | 0.0 | 0.0 | 0.0 | 201.593333 | 2.993333 | 998.830000 | 1003.130000 | 67.423333 | 3.834816e+05 | 193368.2195 |
2018-06-20 15:00:00 | 15.025008 | 19.582047 | 34.134964 | 25.043333 | 0.0 | 0.0 | 0.0 | 0.0 | 188.976667 | 2.010000 | 998.190000 | 1002.490000 | 67.136667 | 1.673262e+06 | 287920.3535 |
2018-06-20 18:00:00 | -7.755828 | 19.579559 | 34.106552 | 22.826667 | 0.0 | 0.0 | 0.0 | 0.0 | 209.493333 | 1.480000 | 998.416667 | 1002.716667 | 77.413333 | 5.645747e+06 | 246005.6510 |
2018-06-20 21:00:00 | -18.817711 | 19.570045 | 34.100220 | 20.910000 | 0.0 | 0.0 | 0.0 | 0.0 | 260.616667 | 1.080000 | 999.843333 | 1004.143333 | 79.093333 | 1.630322e+06 | 273757.5439 |
[88]:
ds = DataSet(
data=data,
output_features=["blaTEM_coppml", "tetx_coppml"],
verbosity=0 # setting the verbosity to 0 will not print any information.
)
[90]:
train_x, train_y = ds.training_data()
train_x.shape, train_y.shape
[90]:
((121, 13), (121, 2))
We can see that all the columns starting from first till the last except the ones we defined as output features, are used as input features.
[91]:
train_x[0]
[91]:
array([ -22.245026 , 19.457182 , 34.00429 , 24.28 ,
0. , 0. , 0. , 6. ,
205.00667 , 1.6533333, 998.61334 , 1002.9133 ,
75.1 ], dtype=float32)
[92]:
train_y[0]
[92]:
array([966535.0042, 444866.9004])
[93]:
test_x, test_y = ds.test_data()
test_x.shape, test_y.shape
[93]:
((66, 13), (66, 2))
[94]:
test_x[-1]
[94]:
array([ 18.448915 , 20.592932 , 33.8315 , 27.896667 ,
0. , 0. , 0. , 0. ,
29.85 , 2.7433333, 1003.8467 , 1008.14667 ,
72.74 ], dtype=float32)
[95]:
test_y[-1]
[95]:
array([8473063.881, 3506092.003])
You might be curious that why there were 121 examples in the training data and 66 samples in the test set? And where the remaining samples. This actually depends how we split our data and it will be covered in detail in next lesson.
saving prepared data in h5 file
Data preparation i.e. converting data into examples can be costly so sometimes, it is better to save the prepared data so that we don’t have to prepare it again and again. We can save the prepared data (input-output pairs). One way of doing this using the DataSet
class is to save the input-output pairs into h5
file. You only need to set the save
to True.
[99]:
ds = DataSet(
data=data,
output_features=["blaTEM_coppml", "tetx_coppml"],
save=True
)
********** Removing Examples with nan in labels **********
***** Training *****
input_x shape: (121, 13)
target shape: (121, 2)
********** Removing Examples with nan in labels **********
***** Validation *****
input_x shape: (31, 13)
target shape: (31, 2)
********** Removing Examples with nan in labels **********
***** Test *****
input_x shape: (66, 13)
target shape: (66, 2)
Now we have a data.h5
file in our disk. We can look into this file if we have the proper editor/viewer for h5 file. One such viewer is `hdf5viewer <>`__. Using this viewer, we can see the input-output pairs for training, validation and test sets.
Loading from h5 file
We can use the pre-existing h5 file i.e. data.h5
file to construct DataSet
class. And we can get training and test data from it as well. We have to use from_h5
constructor/class method for this purpose. The input to from_h5
method is the path of .h5 file.
[100]:
ds = DataSet.from_h5("data.h5")
[102]:
train_x, train_y = ds.training_data()
train_x.shape, train_y.shape
[102]:
((121, 13), (121, 2))
[103]:
train_x, train_y = ds.test_data()
train_x.shape, train_y.shape
[103]:
((66, 13), (66, 2))
Multiple Inputs
The examples presented so far had only one kind of input data or one input source. In machine learning, and especially in deep learning, we can however, very often have examples combining inputs from multiple sources. For example, instead of input data consisting of numpy array, our input data can consist of list of numpy arrays. These numpy arrays then can represent data from different sources.