DataSet

DataSet

class ai4water.preprocessing.DataSet(data, input_features: Optional[Union[str, list]] = None, output_features: Optional[Union[str, list]] = None, dataset_args: Optional[dict] = None, ts_args: Optional[dict] = None, split_random: bool = False, train_fraction: float = 0.7, val_fraction: float = 0.2, indices: Optional[dict] = None, intervals=None, shuffle: bool = True, allow_nan_labels: int = 0, nan_filler: Optional[dict] = None, batch_size: int = 32, drop_remainder: bool = False, teacher_forcing: bool = False, allow_input_nans: bool = False, seed: int = 313, verbosity: int = 1, mode: Optional[str] = None, category: Optional[str] = None, save: bool = False)[source]

Bases: _DataSet

The purpose of DataSet is to convert unprepared/raw data into prepared data. A prepared data consists of x,y pairs where x is inputs and y is outputs. There are >1 examples in a DataSet. Both inputs and outputs consists of same number of examples. An example consists of one input, output pair which can be given to a supervised machine learning algorithm for training. For tabular data, the number of examples does not necessarily match number of rows. The number of examples depend upon multiple factors such as presence of intervals, how nans are handled and the arguments related to time series data preparation which are listed in detail in prepare_data function.

DataSet class can accept the raw, unprepared data in a variety of formats such as .csv, .xlsx, .parquet, .mat, .n5 etc. For details see this. The DataSet class can save the prepared data into an hdf5 file which can susequently be used to load the data and save the time.

- training_data: returns training data
- validation_data: returns validation data
- test_data: returns test data
- from_h5:
- to_disk
- KFold_splits: creates splits using `KFold` of sklearn
- LeaveOneOut_splits: creates splits using `LeaveOneOut` of sklearn
- TimeSeriesSplit_splits: creates splits using `TimeSeriesSplit` of sklearn
- total_exs
__init__(data, input_features: Optional[Union[str, list]] = None, output_features: Optional[Union[str, list]] = None, dataset_args: Optional[dict] = None, ts_args: Optional[dict] = None, split_random: bool = False, train_fraction: float = 0.7, val_fraction: float = 0.2, indices: Optional[dict] = None, intervals=None, shuffle: bool = True, allow_nan_labels: int = 0, nan_filler: Optional[dict] = None, batch_size: int = 32, drop_remainder: bool = False, teacher_forcing: bool = False, allow_input_nans: bool = False, seed: int = 313, verbosity: int = 1, mode: Optional[str] = None, category: Optional[str] = None, save: bool = False)[source]

Initializes the DataSet class

Parameters:
  • data

    source from which to make the data. It can be one of the following:

    • pandas dataframe: each columns is a feature and each row is an example

    • numpy array

    • xarray dataset: it can be xarray dataset

    • path like: if the path is the path of a file, then this file can

      be a csv/xlsx/nc/npz/mat/parquet/feather file. The .nc file will be read using xarray to load datasets. If the path refers to a directory, it is supposed that each file in the directory refers to one example.

    • ai4water dataset : name of any of dataset name from ai4water.datasets

    • name of .h5 file

  • input_features (Union[list, dict, str, None]) – features to use as input. If data is pandas dataframe then this is list of column names from data to be used as input.

  • output_features (Union[list, dict, str, None]) – features to use as output. When data is dataframe then it is list of column names from data to be used as output. If data is dict, then it must be consistent with data. Default is None,which means the last column of data will be used as output. In case of multi-class classification, the output column is not supposed to be one-hot-encoded rather in the form of [0,1,2,0,1,2,1,2,0] for 3 classes. One-hot-encoding is done inside the model.

  • dataset_args (dict) – additional arguments for AI4Water’s [datasets][ai4water.datasets]

  • ts_args (dict, optional) –

    This argument should only be used if the data is time series data. It must be a dictionary which is then passed to ai4water.utils.prepare_data() for data preparation. Possible keys in dictionay are:

    • lookback

    • forecast_len

    • forecast_step

    • input_steps

  • split_random (bool, optional) – whether to split the data into training and test randomly or not.

  • train_fraction (float) – Fraction of the complete data to be used for training purpose. Must be greater than 0.0.

  • val_fraction (float) – The fraction of the training data to be used for validation. Set to 0.0 if no validation data is to be used.

  • indices (dict, optional) –

    A dictionary with two possible keys, ‘training’, ‘validation’. It determines the indices to be used to select training, validation and test data. If indices are given for training, then train_fraction must not be given. If indices are given for validation, then indices for training must also be given and val_fraction must not be given. Therefore, the possible keys in indices dictionary are follwoing

    • training

    • training and validation

  • intervals – tuple of tuples where each tuple consits of two integers, marking the start and end of interval. An interval here means indices from the data. Only rows within those indices will be used when preparing data/batches for NN. This is handy when our input data contains chunks of missing values or when we don’t want to consider several rows in input data during data_preparation. For further usage see examples/using_intervals

  • shuffle (bool) – whether to shuffle the samples or not

  • allow_nan_labels (bool) – whether to allow examples with nan labels or not. if it is > 0, and if target values contain Nans, those examples will not be ignored and will be used as it is. In such a case a customized training and evaluation step is performed where the loss is not calculated for predictions corresponding to nan observations. Thus this option can be useful when we are predicting more than 1 target and some of the examples have some of their labels missing. In such a scenario, if we set this option to >0, we don’t need to ignore those samples at all during data preparation. This option should be set to > 0 only when using tensorflow for deep learning models. if == 1, then if an example has label [nan, 1] it will not be removed while the example with label [nan, nan] will be ignored/removed. If ==2, both examples (mentioned before) will be considered/will not be removed. This means for multi-outputs, we can end up having examples whose all labels are nans. if the number of outputs are just one. Then this must be set to 2 in order to use samples with nan labels.

  • nan_filler (dict) –

    This argument determines the imputation technique used to fill the nans in the data. The imputation is actually performed by ai4water.preprocessing.Imputation class. Therefore this argument determines the interaction with Imputation class. The default value is None, which will raise error if missing/nan values are encountered in the input data. The user can however specify a dictionary whose one key must be method. The value of ‘method’ key can be fillna or interpolate. For example, to do forward filling, the user can do as following

    >>> {'method': 'fillna', 'imputer_args': {'method': 'ffill'}}
    

    For details about fillna keyword options see fillna

    For interpolate, the user can specify the type of interpolation for example

    >>> {'method': 'interpolate', 'imputer_args': {'method': 'spline', 'order': 2}}
    

    will perform spline interpolation with 2nd order. For other possible options/keyword arguments for interpolate [see]() The filling or interpolation is done columnwise, however, the user can specify how to do for each column by providing the above mentioned arguments as dictionary or list. The sklearn based imputation methods can also be used in a similar fashion. For KNN

    >>> {'method': 'KNNImputer', 'imputer_args': {'n_neighbors': 3}}
    

    or for iterative imputation

    >>> {'method': 'IterativeImputer', 'imputer_args': {'n_nearest_features': 2}}
    

    To pass additional arguments one can make use of imputer_args keyword argument

    >>> {'method': 'KNNImputer', 'features': ['b'], 'imputer_args': {'n_neighbors': 4}},
    

    For more on sklearn based imputation methods see this blog

  • batch_size (int) – size of one batch. Only relevent if drop_remainder is True.

  • drop_remainder (bool) – whether to drop the remainder if len(data) % batch_size != 0 or not?

  • teacher_forcing (bool) – whether to return previous output/target/ground truth or not. This is useful when the user wants to feed output at t-1 as input at timestep t. For details about this technique see this article

  • allow_input_nans (bool, optional) – If False, the examples containing nans in inputs will be removed. Setting this to True will result in feeding nan containing data to your algorithm unless nans are filled with nan_filler.

  • seed (int) – random seed for reproducibility

  • verbosity (int) –

  • mode (str) – either regression or classification

  • category (str) –

  • save (bool) – whether to save the data in an h5 file or not.

Example

>>> import pandas as pd
>>> import numpy as np
>>> from ai4water.preprocessing import DataSet
>>> data_ = pd.DataFrame(np.random.randint(0, 1000, (50, 2)), columns=['input', 'output'])
>>> data_set = DataSet(data=data_, ts_args={'lookback':5})
>>> x,y = data_set.training_data()

Note

The word ‘index’ is not allowed as column name, input_features or output_features

KFold_splits(n_splits=5)[source]

returns an iterator for kfold cross validation.

The iterator yields two tuples of training and test x,y pairs. The iterator on every iteration returns following (train_x, train_y), (test_x, test_y) Note: only training_data and validation_data are used to make kfolds.

Example

>>> import numpy as np
>>> import pandas as pd
>>> from ai4water.preprocessing import DataSet
>>> data = pd.DataFrame(np.random.randint(0, 10, (20, 3)), columns=['a', 'b', 'c'])
>>> data_set = DataSet(data=data)
>>> kfold_splits = data_set.KFold_splits()
>>> for (train_x, train_y), (test_x, test_y) in kfold_splits:
...     print(train_x, train_y, test_x, test_y)
LeaveOneOut_splits()[source]

Yields leave one out splits The iterator on every iteration returns following (train_x, train_y), (test_x, test_y)

ShuffleSplit_splits(**kwargs)[source]

Yields ShuffleSplit splits The iterator on every iteration returns following (train_x, train_y), (test_x, test_y)

TimeSeriesSplit_splits(n_splits=5, **kwargs)[source]

returns an iterator for TimeSeriesSplit. The iterator on every iteration returns following (train_x, train_y), (test_x, test_y)

property batch_dim
check_for_batch_size(x, prev_y=None, y=None)[source]
check_nans(data, input_x, input_y, label_y)[source]

Checks whether anns are present or not and checks shapes of arrays being prepared.

property classes
deindexify(data: ndarray, key)[source]
deindexify_nparray(data, key)[source]
classmethod from_h5(path)[source]

Creates an instance of DataSet from .h5 file.

get_2d_batches(data)[source]
get_batches(data)[source]
get_indices()[source]

If the data is to be divded into train/test based upon indices, here we create train_indices and test_indices. The train_indices contain indices for both training and validation data.

impute(data)[source]

Imputes the missing values in the data using Imputation module

indexify(data: DataFrame, key)[source]
init_paras() dict[source]

Returns the initializing parameters of this class

property input_features
property is_binary: bool

Returns True if the porblem is binary classification

property is_multiclass: bool

Returns True if the porblem is multiclass classification

property is_multilabel: bool

Returns True if the porblem is multilabel classification

property lookback
property num_classes
property num_ins
property num_outs
property output_features

for external use

plot_KFold_splits(n_splits=5, show=True, **kwargs)[source]

Plots the indices of kfold splits

plot_LeaveOneOut_splits(show=True, **kwargs)[source]

Plots the indices obtained from LeaveOneOut strategy

plot_TimeSeriesSplit_splits(n_splits=5, show=True, **kwargs)[source]

Plots the indices obtained from TimeSeriesSplit strategy

property teacher_forcing
test_data(key='test', **kwargs)[source]

test data

to_disk(path: Optional[str] = None)[source]
total_exs(lookback, forecast_step=0, forecast_len=1, **ts_args)[source]
training_data(key='train', **kwargs)[source]

training data excluding validation data

property ts_args
validation_data(key='val', **kwargs)[source]

validation data

DataSetUnion

class ai4water.preprocessing.DataSetUnion(*datasets, stack_y: bool = False, verbosity: int = 1, **named_datasets)[source]

Bases: _DataSet

A Union of datasets concatenated in parallel. A DataSetUnion of four DataSets will be as follows:

DataSet1

DataSet2

DataSet3

DataSet4

__init__(*datasets, stack_y: bool = False, verbosity: int = 1, **named_datasets) None[source]

DataSets must be passed either as positional arguments or as keyword arguments but not both.

Parameters:
  • datasets – DataSets to be concatenated in parallel.

  • stack_y (bool) – whether to stack y/outputs of individual datasets as one array or not

  • verbosity (bool) – controls amount of information being printed

  • named_datasets – DataSets to be concatenated in parallel.

Examples

>>> import pandas as pd
>>> from ai4water.preprocessing import DataSet, DataSetUnion
>>> df1 = pd.DataFrame(np.random.random((100, 10)),
...              columns=[f"Feat_{i}" for i in range(10)])
>>> df2 = pd.DataFrame(np.random.random((200, 10)),
...              columns=[f"Feat_{i}" for i in range(10)])
>>> ds1 = DataSet(df1)
>>> ds2 = DataSet(df2)
>>> ds = DataSetUnion(ds1, ds2)
>>> train_x, train_y = ds.training_data()
>>> val_x, val_y = ds.validation_data()
>>> test_x, test_y = ds.test_data()

Note

DataSets must be provided either as positional arguments or as keyword arguments using named_datasets and not both.

property indexes
property input_features
property is_binary
property is_multiclass
property is_multilabel
property mode
property num_datasets: int
property output_features
property teacher_forcing
test_data(key='test', **kwargs)[source]
training_data(key='train', **kwargs) Union[list, dict][source]
property ts_args: dict
validation_data(key='val', **kwargs)[source]

DataSetPipeline

class ai4water.preprocessing.DataSetPipeline(*datasets: _DataSet, verbosity=1)[source]

Bases: _DataSet

A collection of DataSets concatenated one after the other. A DataSetPipeLine of four DataSets will be as follows:

DataSet1

DataSet2

DataSet3

DataSet4

The only condition for different datasets is that they have the same output dimension.

__init__(*datasets: _DataSet, verbosity=1) None[source]
Parameters:
  • *datasets – the datasets to be combined

  • verbosity – controls the output information being printed.

Examples

>>> import pandas as pd
>>> from ai4water.preprocessing import DataSet, DataSetPipeline
>>> df1 = pd.DataFrame(np.random.random((100, 10)),
...              columns=[f"Feat_{i}" for i in range(10)])
>>> df2 = pd.DataFrame(np.random.random((200, 10)),
...              columns=[f"Feat_{i}" for i in range(10)])
>>> ds1 = DataSet(df1)
>>> ds2 = DataSet(df2)
>>> ds = DataSetPipeline(ds1, ds2)
>>> train_x, train_y = ds.training_data()
>>> val_x, val_y = ds.validation_data()
>>> test_x, test_y = ds.test_data()
property input_features
property is_binary
property mode
property num_datasets: int
property output_features
property teacher_forcing
test_data(key='test', **kwargs)[source]
training_data(key='train', **kwargs)[source]
validation_data(key='val', **kwargs)[source]