Data Transformations

Transformations

class ai4water.preprocessing.transformations.Transformation(method: str = 'minmax', features: Optional[list] = None, replace_zeros: bool = False, replace_zeros_with: Union[str, int, float] = 1, treat_negatives: bool = False, **kwargs)[source]

Bases: TransformationsContainer

Applies transformation to tabular data. It is also possible to apply transformation on some selected features/columns of data. This class also performs some optional pre-processing on data before applying transformation on it. Any new transforming methods should define two methods one starting with transform_with_ and inverse_transofrm_with_

Currently following methods are available for transformation and inverse transformation

Transformation methods

  • minmax

  • maxabs

  • robust

  • power same as yeo-johnson

  • yeo-johnson power transformation using Yeo-Johnson method

  • box-cox power transformation using box-cox method

  • zscore also known as standard scalers

  • scale division by standard deviation

  • center by subtracting mean

  • quantile

  • quantile_normal quantile with normal distribution as target

  • log natural logrithmic

  • log10 log with base 10

  • log2 log with base 2

  • sqrt square root

  • tan tangent

  • cumsum cummulative sum

  • mmax median and median absolute deviation

  • pareto

  • vast Variable Stability Scaling

  • sigmoid logistic sigmoid

  • tanh hyperbolic tangent

To transform a datafrmae using any of the above methods use

Examples

>>> from ai4water.preprocessing import Transformation
>>> transformer = Transformation(method='zscore')
>>> transformer.fit_transform(data=[1,2,3,5])

or

>>> transformer = Transformation(method='minmax')
>>> normalized_df = transformer.fit_transform(data=pd.DataFrame([1,2,3]))
>>> transformer = Transformation(method='log', replace_zeros=True)
>>> trans_df, proc = transformer.fit_transform(data=pd.DataFrame([1,0,2,3]),
>>>                                                 return_proc=True)
>>> detransfomred_df = transformer.inverse_transform(trans_df, postprocessor=proc)

or using one liner

>>> normalized_df = Transformation(method='minmax',
...                       features=['a'])(data=pd.DataFrame([[1,2],[3,4], [5,6]],
...                                       columns=['a', 'b']))

where method can be any of the above mentioned methods.

Note

tan, tanh, sigmoid and cumsum do not return original data upon inverse transformation.

__init__(method: str = 'minmax', features: Optional[list] = None, replace_zeros: bool = False, replace_zeros_with: Union[str, int, float] = 1, treat_negatives: bool = False, **kwargs)[source]
Parameters:
  • method – method by which to transform and consequencly inversely transform the data. default is ‘minmax’. see Transformations.available_transformers for full list.

  • features – string or list of strings. Only applicable if data is dataframe. It defines the columns on which we want to apply transformation. The remaining columns will remain same/unchanged.

  • replace_zeros – If true, then setting this argument to True will replace the zero values in data with some fixed value replace_zeros_with before transformation. The zero values will be put back at their places after transformation so this replacement/implacement is done only to avoid error during transformation for example during Box-Cox.

  • replace_zeros_with – if replace_zeros is True, then this value will be used to replace zeros in dataframe before doing transformation. You can define the method with which to replace nans for exaple by setting this argument to ‘mean’ will replace zeros with ‘mean’ of the array/column which contains zeros. Allowed string values are ‘mean’, ‘max’, ‘min’. see

  • treat_negatives – If true, and if data contains negative values, then the absolute values of these negative values will be considered for transformation. For inverse transformation, the -ve sign is removed, to return the original data. This option is necessary for log, sqrt and box-cox transformations with -ve values in data.

  • kwargs – any arguments which are to be provided to transformer on INTIALIZATION and not during transform or inverse transform

Example

>>> from ai4water.preprocessing.transformations import Transformation
>>> from ai4water.datasets import busan_beach
>>> df = busan_beach()
>>> inputs = ['tide_cm', 'wat_temp_c', 'sal_psu', 'air_temp_c', 'pcp_mm', 'pcp3_mm']
>>> transformer = Transformation(method='minmax', features=['sal_psu', 'air_temp_c'])
>>> new_data = transformer.fit_transform(df[inputs])

Following shows how to apply log transformation on an array containing zeros by making use of the argument replace_zeros. The zeros in the input array will be replaced internally but will be inserted back afterwards.

>>> from ai4water.preprocessing.transformations import Transformation
>>> transformer = Transformation(method='log', replace_zeros=True)
>>> transformed_data = transformer.fit_transform([1,2,3,0.0, 5, np.nan, 7])
... [0.0, 0.6931, 1.0986, 0.0, 1.609, None, 1.9459]
>>> original_data = transformer.inverse_transform(data=transformed_data)
available_transformers = {'box-cox': <class 'ai4water.preprocessing.transformations._transformations.PowerTransformer'>, 'center': <class 'ai4water.preprocessing.transformations._transformations.Center'>, 'cumsum': <class 'ai4water.preprocessing.transformations._transformations.CumsumScaler'>, 'log': <class 'ai4water.preprocessing.transformations._transformations.LogScaler'>, 'log10': <class 'ai4water.preprocessing.transformations._transformations.Log10Scaler'>, 'log2': <class 'ai4water.preprocessing.transformations._transformations.Log2Scaler'>, 'maxabs': <class 'ai4water.preprocessing.transformations._transformations.MaxAbsScaler'>, 'minmax': <class 'ai4water.preprocessing.transformations._transformations.MinMaxScaler'>, 'mmad': <class 'ai4water.preprocessing.transformations._transformations.MmadTransformer'>, 'pareto': <class 'ai4water.preprocessing.transformations._transformations.ParetoTransformer'>, 'power': <class 'ai4water.preprocessing.transformations._transformations.PowerTransformer'>, 'quantile': <class 'ai4water.preprocessing.transformations._transformations.QuantileTransformer'>, 'quantile_normal': <class 'ai4water.preprocessing.transformations._transformations.QuantileTransformer'>, 'robust': <class 'ai4water.preprocessing.transformations._transformations.RobustScaler'>, 'scale': <class 'ai4water.preprocessing.transformations._transformations.StandardScaler'>, 'sigmoid': <class 'ai4water.preprocessing.transformations._transformations.LogisticSigmoidTransformer'>, 'sqrt': <class 'ai4water.preprocessing.transformations._transformations.SqrtScaler'>, 'tan': <class 'ai4water.preprocessing.transformations._transformations.TanScaler'>, 'tanh': <class 'ai4water.preprocessing.transformations._transformations.HyperbolicTangentTransformer'>, 'vast': <class 'ai4water.preprocessing.transformations._transformations.VastTransformer'>, 'yeo-johnson': <class 'ai4water.preprocessing.transformations._transformations.PowerTransformer'>, 'zscore': <class 'ai4water.preprocessing.transformations._transformations.StandardScaler'>}
config() dict[source]

returns a dictionary which can be used to reconstruct Transformation class using from_config. :returns: a dictionary

property features
fit(data, **kwargs)[source]

fits the data according the transformation methods.

fit_transform(data, return_proc=False, **kwargs)[source]

Transforms the data

Parameters:
  • data – a dataframe or numpy ndarray or array like. The transformed or inversely transformed value will have the same type as data and will have the same index as data (in case data is dataframe). The shape of data is supposed to be (num_examples, num_features).

  • return_proc – whether to return the processer or not. If True, then a tuple is returned which consists of transformed data and second is the preprocessor.

  • kwargs

classmethod from_config(config: dict) Transformation[source]

constructs the Transformation class from config which has already been fitted/transformed.

Parameters:

config – a dicionary which is the output of config() method.

Returns:

an instance of Transformation class.

get_features(data) DataFrame[source]
get_transformer()[source]
get_transformer_from_dict(**kwargs)[source]
inverse_transform(data, postprocessor: Optional[_Processor] = None, without_fit=False, **kwargs)[source]

Inverse transforms the data.

Parameters:
  • data (-) –

  • postprocessor

  • without_fit (bool) –

  • kwargs (any of the folliwng keyword arguments) –

  • data

  • key (-) –

  • transformer (-) – the available transformer is used.

maybe_insert_features(original_df, trans_df)[source]
property num_features
plot_comparison(data, plot_type: str = 'hist', show: bool = True, figsize: Optional[tuple] = None, **kwargs) Figure[source]

compares original and transformed data

Parameters:
  • data – the data on which to apply transformation. It can list, numpy array or pandas dataframe

  • plot_type (str, optional (default="hist")) – either hist, probplot or line

  • show (bool, optional (default=True)) – whether to show the plot or not

  • figsize (tuple, optional (default=None)) – figure size (width, height)

  • **kwargs – any keyword arguments for easy_mpl.hist or easy_mpl.plot when plot_type is “hist” or “probplot” respectively.

Return type:

plt.Figure

Examples

>>> from ai4water.preprocessing import Transformation
>>> import numpy as np
>>> t = Transformation()
>>> t.plot_comparison(np.random.randint(1, 100, (100, 2)))
...  # compare using probability plot
>>> t.plot_comparison(np.random.randint(1, 100, (100, 2)), "probplot")
... # or a simple line plot
>>> t.plot_comparison(np.random.randint(1, 100, (100, 2)), "line", figsize=(14, 6))
serialize_transformer(transformer)[source]
transform(data, return_proc=False, **kwargs)[source]

transforms the data according to fitted transformers.

property transformed_features
class ai4water.preprocessing.transformations.Transformations(feature_names: Union[list, dict], config: Optional[Union[str, dict, list]] = None)[source]

Bases: object

While the [Transformation][ai4water.preprocessing.transformations.Transformation] class is useful to apply a single transformation to a single data source, this class is helpful to apply multple transformations to a single data or multiple transformations to multiple data. This class is especially designed to be applied as part of model inside the fit, predict or evaluate methods. The fit_transform method should be applied before feeding the data to the algorithm and inverse_transform method should be called after algorithm has worked with data.

Examples

>>> import numpy as np
>>> from ai4water.preprocessing.transformations import Transformations
>>> x = np.arange(50).reshape(25, 2)
>>> transformer = Transformations(['a', 'b'], config=['minmax', 'zscore'])
>>> x_ = transformer.fit_transform(x)
>>> _x = transformer.inverse_transform(x_)
...
... # Apply multiple transformations on multiple arrays which are passed as list
>>> transformer = Transformations([['a', 'b'], ['a', 'b']],
...                              config=['minmax', 'zscore'])
>>> x1 = np.arange(50).reshape(25, 2)
>>> x2 = np.arange(50, 100).reshape(25, 2)
>>> x1_transformed = transformer.fit_transform([x1, x2])
>>> _x1 = transformer.inverse_transform(x1_transformed)

We can also do more complicated stuff as following

>>> transformer = Transformations({'x1': ['a', 'b'], 'x2': ['a', 'b']},
...        config={'x1': ['minmax', 'zscore'],
...                'x2': [{'method': 'log', 'features': ['a', 'b']},
...                       {'method': 'robust', 'features': ['a', 'b']}]
...                                      })
>>> x1 = np.arange(20).reshape(10, 2)
>>> x2 = np.arange(100, 120).reshape(10, 2)
>>> x = {'x1': x1, 'x2': x2}
>>> x_transformed = transformer.fit_transform(x)
>>> _x = transformer.inverse_transform(x_transformed)

In above example we apply minmax and zscore transformations on x1 and log and robust transformations on x2 array

__init__(feature_names: Union[list, dict], config: Optional[Union[str, dict, list]] = None)[source]
Parameters:
  • feature_names – names of features in data

  • config

    Determines the type of transformation to be applied on data. It can be one of the following types

    • string when you want to apply single transformation

    >>> config='minmax'
    
    • dict: to pass additional arguments to the ai4water.preprocessing.Transformation

      class

    >>> config = {"method": 'log', 'treat_negatives': True, 'features': ['features']}
    
    • list when we want to apply multiple transformations

    >>> ['minmax', 'zscore']
    

    or

    >>> [{"method": 'log', 'treat_negatives': True, 'features': ['features']},
    >>> {'method': 'sqrt', 'treat_negatives': True}]
    

config() dict[source]

returns a python dictionary which can be used to construct this class in fitted form i.e as if the fit_transform method has already been applied. :returns: a dictionary from which Transformations class can be constructed

fit_transform(data: Union[ndarray, List, Dict])[source]

Transforms the data according the the config.

Parameters:

data

The data on which to apply transformations. It can be one of following

  • a (2d or 3d) numpy array

  • a list of numpy arrays

  • a dictionary of numpy arrays

Returns:

The transformed data which has same type and dimensions as the input data

classmethod from_config(config: dict) Transformations[source]

constructs the Transformations class which may has already been fitted.

inverse_transform(data, postprocess=True)[source]

inverse transforms data where data can be dictionary, list or numpy array.

Parameters:
  • data – the data which is to be inverse transformed. The output of fit_transform method.

  • postprocess – bool

Returns:

The original data which was given to fit_transform method.

inverse_transform_without_fit(data, postprocess=True) ndarray[source]
transform(data: Union[ndarray, List, Dict])[source]

Transforms the data according the the config.

Parameters:

data

The data on which to apply transformations. It can be one of following

  • a (2d or 3d) numpy array

  • a list of numpy arrays

  • a dictionary of numpy arrays

Returns:

The transformed data which has same type and dimensions as the input data

class ai4water.preprocessing.transformations.ScalerWithConfig[source]

Bases: object

Extends the sklearn’s scalers in such a way that they can be saved to a json file an d loaded from a json file

- config
- form_config
__init__()
config() dict[source]

Returns all the parameters in scaler/transformer in a dictionary

property config_paras: list
classmethod from_config(config: dict)[source]

Build the scaler/transformer from config

Parameters:

config – dictionary of parameters which can be used to build transformer/scaler.

Returns :

An instance of scaler/transformer

get_params()[source]
class ai4water.preprocessing.transformations.PowerTransformer(method='yeo-johnson', *, rescale=False, pre_center: bool = False, standardize=True, copy=True, lambdas=None)[source]

Bases: PowerTransformer, ScalerWithConfig

This transformation enhances scikit-learn’s PowerTransformer by allowing the user to define lambdas parameter for each input feature. The default behaviour of this transformer is same as that of scikit-learn’s.

__init__(method='yeo-johnson', *, rescale=False, pre_center: bool = False, standardize=True, copy=True, lambdas=None)[source]
lambdas: float or 1d array like for each feature. If not given, it is

calculated from scipy.stats.boxcox(X, lmbda=None). Only available if method is box-cox.

pre_center:

center the data before applying power transformation. see github [1] for more discussion

rescale: For complete documentation see scikit-learn’s documentation [2]

property config_paras
classmethod from_config(config: dict)[source]

Build the scaler/transformer from config

Parameters:

config – dictionary of parameters which can be used to build transformer/scaler.

Returns :

An instance of scaler/transformer

inverse_transform(X)[source]

Apply the inverse power transformation using the fitted lambdas.

The inverse of the Box-Cox transformation is given by:

if lambda_ == 0:
    X = exp(X_trans)
else:
    X = (X_trans * lambda_ + 1) ** (1 / lambda_)

The inverse of the Yeo-Johnson transformation is given by:

if X >= 0 and lambda_ == 0:
    X = exp(X_trans) - 1
elif X >= 0 and lambda_ != 0:
    X = (X_trans * lambda_ + 1) ** (1 / lambda_) - 1
elif X < 0 and lambda_ != 2:
    X = 1 - (-(2 - lambda_) * X_trans + 1) ** (1 / (2 - lambda_))
elif X < 0 and lambda_ == 2:
    X = 1 - exp(-X_trans)
Parameters:

X (array-like of shape (n_samples, n_features)) – The transformed data.

Returns:

X – The original data.

Return type:

ndarray of shape (n_samples, n_features)

class ai4water.preprocessing.transformations.FunctionTransformer(func=None, inverse_func=None, validate=False, accept_sparse=False, check_inverse=True, kw_args=None, inv_kw_args=None)[source]

Bases: FunctionTransformer

Serializing a custom func/inverse_func is difficult. Therefore we expect the func/inverse_func to be either numpy function or the code as a string.

from_config()[source]
inverse_func_ser

Example

>>> array = np.random.randint(1, 100, (20, 2))
>>> transformer = FunctionTransformer(func=np.log2,
>>>                inverse_func="lambda _x: 2**_x", validate=True)
>>> t_array = transformer.fit_transform(array)
>>> transformer.config()
>>> new_transformer = FunctionTransformer.from_config(transformer.config())
>>> original_array = new_transformer.inverse_transform(t_array)
__init__(func=None, inverse_func=None, validate=False, accept_sparse=False, check_inverse=True, kw_args=None, inv_kw_args=None)[source]
config() dict[source]

Returns all the parameters in scaler in a dictionary

static deserialize(**kwargs)[source]
static deserialize_func(func)[source]
classmethod from_config(config: dict)[source]

Build the estimator from config file

property inverse_func
property inverse_func_ser
static serialize_func(func)[source]