DataSet
DataSet
- class ai4water.preprocessing.DataSet(data, input_features: Optional[Union[str, list]] = None, output_features: Optional[Union[str, list]] = None, dataset_args: Optional[dict] = None, ts_args: Optional[dict] = None, split_random: bool = False, train_fraction: float = 0.7, val_fraction: float = 0.2, indices: Optional[dict] = None, intervals=None, shuffle: bool = True, allow_nan_labels: int = 0, nan_filler: Optional[dict] = None, batch_size: int = 32, drop_remainder: bool = False, teacher_forcing: bool = False, allow_input_nans: bool = False, seed: int = 313, verbosity: int = 1, mode: Optional[str] = None, category: Optional[str] = None, save: bool = False)[source]
Bases:
_DataSet
The purpose of DataSet is to convert unprepared/raw data into prepared data. A prepared data consists of x,y pairs where x is inputs and y is outputs. There are >1 examples in a DataSet. Both inputs and outputs consists of same number of examples. An example consists of one input, output pair which can be given to a supervised machine learning algorithm for training. For tabular data, the number of examples does not necessarily match number of rows. The number of examples depend upon multiple factors such as presence of intervals, how nans are handled and the arguments related to time series data preparation which are listed in detail in prepare_data function.
DataSet class can accept the raw, unprepared data in a variety of formats such as .csv, .xlsx, .parquet, .mat, .n5 etc. For details see this. The DataSet class can save the prepared data into an hdf5 file which can susequently be used to load the data and save the time.
- - training_data: returns training data
- - validation_data: returns validation data
- - test_data: returns test data
- - from_h5:
- - to_disk
- - KFold_splits: creates splits using `KFold` of sklearn
- - LeaveOneOut_splits: creates splits using `LeaveOneOut` of sklearn
- - TimeSeriesSplit_splits: creates splits using `TimeSeriesSplit` of sklearn
- - total_exs
- __init__(data, input_features: Optional[Union[str, list]] = None, output_features: Optional[Union[str, list]] = None, dataset_args: Optional[dict] = None, ts_args: Optional[dict] = None, split_random: bool = False, train_fraction: float = 0.7, val_fraction: float = 0.2, indices: Optional[dict] = None, intervals=None, shuffle: bool = True, allow_nan_labels: int = 0, nan_filler: Optional[dict] = None, batch_size: int = 32, drop_remainder: bool = False, teacher_forcing: bool = False, allow_input_nans: bool = False, seed: int = 313, verbosity: int = 1, mode: Optional[str] = None, category: Optional[str] = None, save: bool = False)[source]
Initializes the DataSet class
- Parameters:
data –
source from which to make the data. It can be one of the following:
pandas dataframe: each columns is a feature and each row is an example
numpy array
xarray dataset: it can be xarray dataset
- path like: if the path is the path of a file, then this file can
be a csv/xlsx/nc/npz/mat/parquet/feather file. The .nc file will be read using xarray to load datasets. If the path refers to a directory, it is supposed that each file in the directory refers to one example.
ai4water dataset : name of any of dataset name from ai4water.datasets
name of .h5 file
input_features (Union[list, dict, str, None]) – features to use as input. If data is pandas dataframe then this is list of column names from data to be used as input.
output_features (Union[list, dict, str, None]) – features to use as output. When data is dataframe then it is list of column names from data to be used as output. If data is dict, then it must be consistent with data. Default is None,which means the last column of data will be used as output. In case of multi-class classification, the output column is not supposed to be one-hot-encoded rather in the form of [0,1,2,0,1,2,1,2,0] for 3 classes. One-hot-encoding is done inside the model.
dataset_args (dict) – additional arguments for AI4Water’s [datasets][ai4water.datasets]
ts_args (dict, optional) –
This argument should only be used if the data is time series data. It must be a dictionary which is then passed to
ai4water.utils.prepare_data()
for data preparation. Possible keys in dictionay are:lookback
forecast_len
forecast_step
input_steps
split_random (bool, optional) – whether to split the data into training and test randomly or not.
train_fraction (float) – Fraction of the complete data to be used for training purpose. Must be greater than 0.0.
val_fraction (float) – The fraction of the training data to be used for validation. Set to 0.0 if no validation data is to be used.
indices (dict, optional) –
A dictionary with two possible keys, ‘training’, ‘validation’. It determines the indices to be used to select training, validation and test data. If indices are given for training, then train_fraction must not be given. If indices are given for validation, then indices for training must also be given and val_fraction must not be given. Therefore, the possible keys in indices dictionary are follwoing
training
training
andvalidation
intervals – tuple of tuples where each tuple consits of two integers, marking the start and end of interval. An interval here means indices from the data. Only rows within those indices will be used when preparing data/batches for NN. This is handy when our input data contains chunks of missing values or when we don’t want to consider several rows in input data during data_preparation. For further usage see examples/using_intervals
shuffle (bool) – whether to shuffle the samples or not
allow_nan_labels (bool) – whether to allow examples with nan labels or not. if it is > 0, and if target values contain Nans, those examples will not be ignored and will be used as it is. In such a case a customized training and evaluation step is performed where the loss is not calculated for predictions corresponding to nan observations. Thus this option can be useful when we are predicting more than 1 target and some of the examples have some of their labels missing. In such a scenario, if we set this option to >0, we don’t need to ignore those samples at all during data preparation. This option should be set to > 0 only when using tensorflow for deep learning models. if == 1, then if an example has label [nan, 1] it will not be removed while the example with label [nan, nan] will be ignored/removed. If ==2, both examples (mentioned before) will be considered/will not be removed. This means for multi-outputs, we can end up having examples whose all labels are nans. if the number of outputs are just one. Then this must be set to 2 in order to use samples with nan labels.
nan_filler (dict) –
This argument determines the imputation technique used to fill the nans in the data. The imputation is actually performed by
ai4water.preprocessing.Imputation
class. Therefore this argument determines the interaction with Imputation class. The default value is None, which will raise error if missing/nan values are encountered in the input data. The user can however specify a dictionary whose one key must be method. The value of ‘method’ key can be fillna or interpolate. For example, to do forward filling, the user can do as following>>> {'method': 'fillna', 'imputer_args': {'method': 'ffill'}}
For details about fillna keyword options see fillna
For interpolate, the user can specify the type of interpolation for example
>>> {'method': 'interpolate', 'imputer_args': {'method': 'spline', 'order': 2}}
will perform spline interpolation with 2nd order. For other possible options/keyword arguments for interpolate [see]() The filling or interpolation is done columnwise, however, the user can specify how to do for each column by providing the above mentioned arguments as dictionary or list. The sklearn based imputation methods can also be used in a similar fashion. For KNN
>>> {'method': 'KNNImputer', 'imputer_args': {'n_neighbors': 3}}
or for iterative imputation
>>> {'method': 'IterativeImputer', 'imputer_args': {'n_nearest_features': 2}}
To pass additional arguments one can make use of imputer_args keyword argument
>>> {'method': 'KNNImputer', 'features': ['b'], 'imputer_args': {'n_neighbors': 4}},
For more on sklearn based imputation methods see this blog
batch_size (int) – size of one batch. Only relevent if drop_remainder is True.
drop_remainder (bool) – whether to drop the remainder if len(data) % batch_size != 0 or not?
teacher_forcing (bool) – whether to return previous output/target/ground truth or not. This is useful when the user wants to feed output at t-1 as input at timestep t. For details about this technique see this article
allow_input_nans (bool, optional) – If False, the examples containing nans in inputs will be removed. Setting this to True will result in feeding nan containing data to your algorithm unless nans are filled with nan_filler.
seed (int) – random seed for reproducibility
verbosity (int) –
mode (str) – either
regression
orclassification
category (str) –
save (bool) – whether to save the data in an h5 file or not.
Example
>>> import pandas as pd >>> import numpy as np >>> from ai4water.preprocessing import DataSet >>> data_ = pd.DataFrame(np.random.randint(0, 1000, (50, 2)), columns=['input', 'output']) >>> data_set = DataSet(data=data_, ts_args={'lookback':5}) >>> x,y = data_set.training_data()
Note
The word ‘index’ is not allowed as column name, input_features or output_features
- KFold_splits(n_splits=5)[source]
returns an iterator for kfold cross validation.
The iterator yields two tuples of training and test x,y pairs. The iterator on every iteration returns following (train_x, train_y), (test_x, test_y) Note: only training_data and validation_data are used to make kfolds.
Example
>>> import numpy as np >>> import pandas as pd >>> from ai4water.preprocessing import DataSet >>> data = pd.DataFrame(np.random.randint(0, 10, (20, 3)), columns=['a', 'b', 'c']) >>> data_set = DataSet(data=data) >>> kfold_splits = data_set.KFold_splits() >>> for (train_x, train_y), (test_x, test_y) in kfold_splits: ... print(train_x, train_y, test_x, test_y)
- LeaveOneOut_splits()[source]
Yields leave one out splits The iterator on every iteration returns following (train_x, train_y), (test_x, test_y)
- ShuffleSplit_splits(**kwargs)[source]
Yields ShuffleSplit splits The iterator on every iteration returns following (train_x, train_y), (test_x, test_y)
- TimeSeriesSplit_splits(n_splits=5, **kwargs)[source]
returns an iterator for TimeSeriesSplit. The iterator on every iteration returns following (train_x, train_y), (test_x, test_y)
- property batch_dim
- check_nans(data, input_x, input_y, label_y)[source]
Checks whether anns are present or not and checks shapes of arrays being prepared.
- property classes
- get_indices()[source]
If the data is to be divded into train/test based upon indices, here we create train_indices and test_indices. The train_indices contain indices for both training and validation data.
- property input_features
- property lookback
- property num_classes
- property num_ins
- property num_outs
- property output_features
for external use
- plot_LeaveOneOut_splits(show=True, **kwargs)[source]
Plots the indices obtained from LeaveOneOut strategy
- plot_TimeSeriesSplit_splits(n_splits=5, show=True, **kwargs)[source]
Plots the indices obtained from TimeSeriesSplit strategy
- property teacher_forcing
- property ts_args
DataSetUnion
- class ai4water.preprocessing.DataSetUnion(*datasets, stack_y: bool = False, verbosity: int = 1, **named_datasets)[source]
Bases:
_DataSet
A Union of datasets concatenated in parallel. A DataSetUnion of four DataSets will be as follows:
DataSet1
DataSet2
DataSet3
DataSet4
- __init__(*datasets, stack_y: bool = False, verbosity: int = 1, **named_datasets) None [source]
DataSets must be passed either as positional arguments or as keyword arguments but not both.
- Parameters:
Examples
>>> import pandas as pd >>> from ai4water.preprocessing import DataSet, DataSetUnion >>> df1 = pd.DataFrame(np.random.random((100, 10)), ... columns=[f"Feat_{i}" for i in range(10)]) >>> df2 = pd.DataFrame(np.random.random((200, 10)), ... columns=[f"Feat_{i}" for i in range(10)]) >>> ds1 = DataSet(df1) >>> ds2 = DataSet(df2) >>> ds = DataSetUnion(ds1, ds2) >>> train_x, train_y = ds.training_data() >>> val_x, val_y = ds.validation_data() >>> test_x, test_y = ds.test_data()
Note
DataSets must be provided either as positional arguments or as keyword arguments using named_datasets and not both.
- property indexes
- property input_features
- property is_binary
- property is_multiclass
- property is_multilabel
- property mode
- property output_features
- property teacher_forcing
DataSetPipeline
- class ai4water.preprocessing.DataSetPipeline(*datasets: _DataSet, verbosity=1)[source]
Bases:
_DataSet
A collection of DataSets concatenated one after the other. A DataSetPipeLine of four DataSets will be as follows:
DataSet1
DataSet2
DataSet3
DataSet4
The only condition for different datasets is that they have the same output dimension.
- __init__(*datasets: _DataSet, verbosity=1) None [source]
- Parameters:
*datasets – the datasets to be combined
verbosity – controls the output information being printed.
Examples
>>> import pandas as pd >>> from ai4water.preprocessing import DataSet, DataSetPipeline >>> df1 = pd.DataFrame(np.random.random((100, 10)), ... columns=[f"Feat_{i}" for i in range(10)]) >>> df2 = pd.DataFrame(np.random.random((200, 10)), ... columns=[f"Feat_{i}" for i in range(10)]) >>> ds1 = DataSet(df1) >>> ds2 = DataSet(df2) >>> ds = DataSetPipeline(ds1, ds2) >>> train_x, train_y = ds.training_data() >>> val_x, val_y = ds.validation_data() >>> test_x, test_y = ds.test_data()
- property input_features
- property is_binary
- property mode
- property output_features
- property teacher_forcing