Exploratory Data Analysis

The purpose of this module is to quickly explore the data with as many plots as possible.

class ai4water.eda.EDA(data: Union[DataFrame, List[DataFrame], Dict, ndarray], in_cols=None, out_cols=None, path=None, dpi=300, save=True, show=True)[source]

Bases: Plot

Performns a comprehensive exploratory data analysis on a tabular/structured data. It is meant to be a one stop shop for eda.

- heatmap
- box_plot
- plot_missing
- plot_histograms
- plot_index
- plot_data
- plot_pcs
- grouped_scatter
- correlation
- stats
- autocorrelation
- partial_autocorrelation
- probability_plots
- lag_plot
- plot_ecdf
- normality_test
- parallel_coordinates
- show_unique_vals
Example:
>>> from ai4water.datasets import busan_beach
>>> eda = EDA(data=busan_beach())
>>> eda()  # to plot all available plots with single line
__init__(data: Union[DataFrame, List[DataFrame], Dict, ndarray], in_cols=None, out_cols=None, path=None, dpi=300, save=True, show=True)[source]
Parameters:
  • data (DataFrame, array, dict, list) – either a dataframe, or list of dataframes or a dictionary whose values are dataframes or a numpy arrays

  • in_cols (str, list, optional) – columns to consider as input features

  • out_cols (str, optional) – columns to consider as output features

  • path (str, optional) – the path where to save the figures. If not given, plots will be saved in ‘data’ folder in current working directory.

  • save (bool, optional) – whether to save the plots or not

  • show (bool, optional) – whether to show the plots or not

  • dpi (int, optional) – the resolution with which to save the image

__call__(methods: Union[str, list] = 'all', cols=None)[source]

Shortcut to draw maximum possible plots.

Parameters:
  • methods (str, list, optional) – the methods to call. If ‘all’, all available methods will be called.

  • cols (str, list, optional) – columns to use for plotting. If None, all columns will be used.

autocorrelation(n_lags: int = 10, cols: Optional[Union[str, list]] = None, figsize: Optional[tuple] = None)[source]

autocorrelation of individual features of data

Parameters:
  • n_lags (int, optional) – number of lag steps to consider

  • cols (str, list, optional) – columns to use. If not defined then all the columns are used

  • figsize (tuple, optional) – figure size

box_plot(st=None, en=None, cols: Optional[Union[str, list]] = None, violen=False, normalize=True, figsize=(12, 8), max_features=8, show_datapoints=False, freq=None, **kwargs)[source]

Plots box whisker or violen plot of data.

Parameters:
  • st (optional) – starting row/index in data to be used for plotting

  • en (optional) – end row/index in data to be used for plotting

  • cols (list,) – the name of columns from data to be plotted.

  • normalize – If True, then each feature/column is rescaled between 0 and 1.

  • figsize – figure size

  • freq (str,) – one of ‘weekly’, ‘monthly’, ‘yearly’. If given, box plot will be plotted for these intervals.

  • max_features (int,) – maximum number of features to appear in one plot.

  • violen (bool,) – if True, then violen plot will be plotted else box_whisker plot

  • show_datapoints (bool) – if True, sns.swarmplot() will be plotted. Will be time consuming for bigger data.

  • **kwargs – any args for seaborn.boxplot/seaborn.violenplot or seaborn.swarmplot.

correlation(remove_targets=False, st=None, en=None, cols=None, method: str = 'pearson', split: Optional[str] = None, **kwargs)[source]

Plots correlation between features.

Parameters:
  • remove_targets (bool, optional) – whether to remove the output/target column or not

  • st – starting row/index in data to be used for plotting

  • en – end row/index in data to be used for plotting

  • cols – columns to use

  • method (str, optional) – {“pearson”, “spearman”, “kendall”, “covariance”}, by default “pearson”

  • split (str) – To plot only positive correlations, set it to “pos” or to plot only negative correlations, set it to “neg”.

  • **kwargs (keyword Args) – Any additional keyword arguments for seaborn.heatmap

Example

>>> from ai4water.eda import EDA
>>> from ai4water.datasets import busan_beach
>>> vis = EDA(busan_beach())
>>> vis.correlation()
grouped_scatter(cols=None, st=None, en=None, max_subplots: int = 8, **kwargs)[source]

Makes scatter plot for each of feature in data.

Parameters:
  • st – starting row/index in data to be used for plotting

  • en – end row/index in data to be used for plotting

  • cols

  • max_subplots (int, optional) – it can be set to large number to show all the scatter plots on one axis.

  • kwargs – keyword arguments for sns.pariplot

heatmap(st=None, en=None, cols=None, figsize: Optional[tuple] = None, **kwargs)[source]

Plots data as heatmap which depicts missing values.

Parameters:
  • st (int, str, optional) – starting row/index in data to be used for plotting

  • en (int, str, optional) – end row/index in data to be used for plotting

  • cols (str, list) – columns to use to draw heatmap

  • figsize (tuple, optional) – figure size

  • **kwargs – Keyword arguments for sns.heatmap

Return type:

None

Example

>>> from ai4water.datasets import busan_beach
>>> data = busan_beach()
>>> vis = EDA(data)
>>> vis.heatmap()
property in_cols
lag_plot(n_lags: Union[int, list] = 1, cols=None, figsize=None, **kwargs)[source]

lag plot between an array and its lags

Parameters:
  • n_lags – lag step against which to plot the data, it can be integer or a list of integers

  • cols – columns to use

  • figsize – figure size

  • kwargs (any keyword arguments for axis.scatter) –

normality_test(method='shapiro', cols=None, st=None, en=None, orientation='h', color=None, figsize: Optional[tuple] = None)[source]

plots the statistics of nromality test as bar charts. The statistics for each feature are calculated either Shapiro-wilke test or Anderson-Darling test][] or Kolmogorov-Smirnov test using scipy.stats.shapiro or scipy.stats.anderson functions respectively.

Parameters:
  • method – either “shapiro” or “anderson”, or “kolmogorov” default is “shapiro”

  • cols – columns to use

  • st (optional) – start of data

  • en (optional) – end of data to use

  • orientation (optional) – orientation of bars

  • color – color to use

  • figsize (tuple, optional) – figure size (width, height)

Example

>>> from ai4water.eda import EDA
>>> from ai4water.datasets import busan_beach
>>> eda = EDA(data=busan_beach())
>>> eda.normality_test()
property out_cols
parallel_corrdinates(cols=None, st=None, en=100, color=None, **kwargs)[source]

Plots data as parallel coordinates.

Parameters:
  • st – start of data to be considered

  • en – end of data to be considered

  • cols – columns from data to be considered.

  • color – color or colormap to be used.

  • **kwargs – any additional keyword arguments to be passed to easy_mpl.parallel_coordinates

partial_autocorrelation(n_lags: int = 10, cols: Optional[Union[str, list]] = None)[source]

Partial autocorrelation of individual features of data

Parameters:
  • n_lags (int, optional) – number of lag steps to consider

  • cols (str, list, optional) – columns to use. If not defined then all the columns are used

plot_data(st=None, en=None, freq: Optional[str] = None, cols=None, max_cols_in_plot: int = 10, ignore_datetime_index=False, **kwargs)[source]

Plots the data.

Parameters:
  • st (int, str, optional) – starting row/index in data to be used for plotting

  • en (int, str, optional) – end row/index in data to be used for plotting

  • cols (str, list, optional) – columns in data to consider for plotting

  • max_cols_in_plot (int, optional) – Maximum number of columns in one plot. Maximum number of plots depends upon this value and number of columns in data.

  • freq (str, optional) – one of ‘daily’, ‘weekly’, ‘monthly’, ‘yearly’, determines interval of plot of data. It is valid for only time-series data.

  • ignore_datetime_index (bool, optional) – only valid if dataframe’s index is pd.DateTimeIndex. In such a case, if you want to ignore time index on x-axis, set this to True.

  • **kwargs – ary arguments for pandas plot method

Example

>>> from ai4water.datasets import busan_beach
>>> eda = EDA(busan_beach())
>>> eda.plot_data(subplots=True, figsize=(12, 14), sharex=True)
>>> eda.plot_data(freq='monthly', subplots=True, figsize=(12, 14), sharex=True)
plot_ecdf(cols=None, figsize=None, **kwargs)[source]

plots empirical cummulative distribution function

Parameters:
  • cols – columns to use

  • figsize

  • kwargs – any keyword argument for axis.plot

plot_histograms(st=None, en=None, cols=None, max_subplots: int = 40, figsize: tuple = (20, 14), **kwargs)[source]

Plots distribution of data as histogram.

Parameters:
  • st – starting index of data to use

  • en – end index of data to use

  • cols – columns to use

  • max_subplots (int, optional) – maximum number of subplots in one figure

  • figsize – figure size

  • **kwargs (anykeyword argument for pandas.DataFrame.hist function) –

plot_index(st=None, en=None, **kwargs)[source]

plots the datetime index of dataframe

plot_missing(st=None, en=None, cols=None, **kwargs)[source]

plot data to indicate missingness in data

Parameters:
  • cols (list, str, optional) – columns to be used.

  • st (int, str, optional) – starting row/index in data to be used for plotting

  • en (int, str, optional) – end row/index in data to be used for plotting

  • **kwargs – Keyword Args such as figsize

Example

>>> from ai4water.datasets import busan_beach
>>> data = busan_beach()
>>> vis = EDA(data)
>>> vis.plot_missing()
plot_pcs(num_pcs=None, st=None, en=None, save_as_csv=False, figsize=(12, 8), **kwargs)[source]

Plots principle components.

Parameters:
  • num_pcs

  • st (starting row/index in data to be used for plotting) –

  • en (end row/index in data to be used for plotting) –

  • save_as_csv

  • figsize

  • kwargs (will go to sns.pairplot.) –

probability_plots(cols: Optional[Union[str, list]] = None)[source]

draws prbability plot using scipy.stats.probplot . See scipy distributions

show_unique_vals(threshold: int = 10, st=None, en=None, cols=None, max_subplots: int = 9, figsize: Optional[tuple] = None, **kwargs)[source]

Shows percentage of unique/categorical values in data. Only those columns are used in which unique values are below threshold.

Parameters:
  • threshold (int, optional) –

  • st (int, str, optional) –

  • en (int, str, optional) –

  • cols (str, list, optional) –

  • max_subplots (int, optional) –

  • figsize (tuple, optional) –

  • **kwargs – Any keyword arguments for easy_mpl.pie

stats(precision=3, inputs=True, outputs=True, st=None, en=None, out_fmt='csv')[source]

Finds the stats of inputs and outputs and puts them in a json file.

inputs: bool fpath: str, path like out_fmt: str, in which format to save. csv or json