Exploratory Data Analysis

The purpose of this module is to quickly explore the data with as many plots as possible.

class ai4water.eda.EDA(data: Union[DataFrame, List[DataFrame], Dict, ndarray], in_cols=None, out_cols=None, path=None, dpi=300, save=True, show=True)[source]

Bases: Plot

Performns a comprehensive exploratory data analysis on a tabular/structured data. It is meant to be a one stop shop for eda.

- heatmap

- box_plot

- plot_missing

- plot_histograms

- plot_index

- plot_data

- plot_pcs

- grouped_scatter

- correlation

- stats

- autocorrelation

- partial_autocorrelation

- probability_plots

- lag_plot

- plot_ecdf

- normality_test

- parallel_coordinates

- show_unique_vals

Example:

>>> from ai4water.datasets import busan_beach
>>> eda = EDA(data=busan_beach())
>>> eda()  # to plot all available plots with single line

__init__(data: Union[DataFrame, List[DataFrame], Dict, ndarray], in_cols=None, out_cols=None, path=None, dpi=300, save=True, show=True)[source]

Parameters:

data (DataFrame, array, dict, list) – either a dataframe, or list of dataframes or a dictionary whose values are dataframes or a numpy arrays
in_cols (str, list, optional) – columns to consider as input features
out_cols (str, optional) – columns to consider as output features
path (str, optional) – the path where to save the figures. If not given, plots will be saved in ‘data’ folder in current working directory.
save (bool, optional) – whether to save the plots or not
show (bool, optional) – whether to show the plots or not
dpi (int, optional) – the resolution with which to save the image

__call__(methods: Union[str, list] = 'all', cols=None)[source]

Shortcut to draw maximum possible plots.

Parameters:

methods (str, list, optional) – the methods to call. If ‘all’, all available methods will be called.
cols (str, list, optional) – columns to use for plotting. If None, all columns will be used.

autocorrelation(n_lags: int = 10, cols: Optional[Union[str, list]] = None, figsize: Optional[tuple] = None)[source]

autocorrelation of individual features of data

Parameters:

n_lags (int, optional) – number of lag steps to consider
cols (str, list, optional) – columns to use. If not defined then all the columns are used
figsize (tuple, optional) – figure size

box_plot(st=None, en=None, cols: Optional[Union[str, list]] = None, violen=False, normalize=True, figsize=(12, 8), max_features=8, show_datapoints=False, freq=None, **kwargs)[source]

Plots box whisker or violen plot of data.

Parameters:

st (optional) – starting row/index in data to be used for plotting
en (optional) – end row/index in data to be used for plotting
cols (list,) – the name of columns from data to be plotted.
normalize – If True, then each feature/column is rescaled between 0 and 1.
figsize – figure size
freq (str,) – one of ‘weekly’, ‘monthly’, ‘yearly’. If given, box plot will be plotted for these intervals.
max_features (int,) – maximum number of features to appear in one plot.
violen (bool,) – if True, then violen plot will be plotted else box_whisker plot
show_datapoints (bool) – if True, sns.swarmplot() will be plotted. Will be time consuming for bigger data.
**kwargs – any args for seaborn.boxplot/seaborn.violenplot or seaborn.swarmplot.

correlation(remove_targets=False, st=None, en=None, cols=None, method: str = 'pearson', split: Optional[str] = None, **kwargs)[source]

Plots correlation between features.

Parameters:

remove_targets (bool, optional) – whether to remove the output/target column or not
st – starting row/index in data to be used for plotting
en – end row/index in data to be used for plotting
cols – columns to use
method (str, optional) – {“pearson”, “spearman”, “kendall”, “covariance”}, by default “pearson”
split (str) – To plot only positive correlations, set it to “pos” or to plot only negative correlations, set it to “neg”.
**kwargs (keyword Args) – Any additional keyword arguments for seaborn.heatmap

Example

>>> from ai4water.eda import EDA
>>> from ai4water.datasets import busan_beach
>>> vis = EDA(busan_beach())
>>> vis.correlation()

grouped_scatter(cols=None, st=None, en=None, max_subplots: int = 8, **kwargs)[source]

Makes scatter plot for each of feature in data.

Parameters:

st – starting row/index in data to be used for plotting
en – end row/index in data to be used for plotting
cols –
max_subplots (int, optional) – it can be set to large number to show all the scatter plots on one axis.
kwargs – keyword arguments for sns.pariplot

heatmap(st=None, en=None, cols=None, figsize: Optional[tuple] = None, **kwargs)[source]

Plots data as heatmap which depicts missing values.

Parameters:

st (int, str, optional) – starting row/index in data to be used for plotting
en (int, str, optional) – end row/index in data to be used for plotting
cols (str, list) – columns to use to draw heatmap
figsize (tuple, optional) – figure size
**kwargs – Keyword arguments for sns.heatmap

Return type:

None

Example

>>> from ai4water.datasets import busan_beach
>>> data = busan_beach()
>>> vis = EDA(data)
>>> vis.heatmap()

property in_cols

lag_plot(n_lags: Union[int, list] = 1, cols=None, figsize=None, **kwargs)[source]

lag plot between an array and its lags

Parameters:

n_lags – lag step against which to plot the data, it can be integer or a list of integers
cols – columns to use
figsize – figure size
kwargs (any keyword arguments for axis.scatter) –

normality_test(method='shapiro', cols=None, st=None, en=None, orientation='h', color=None, figsize: Optional[tuple] = None)[source]

plots the statistics of nromality test as bar charts. The statistics for each feature are calculated either Shapiro-wilke test or Anderson-Darling test][] or Kolmogorov-Smirnov test using scipy.stats.shapiro or scipy.stats.anderson functions respectively.

Parameters:

method – either “shapiro” or “anderson”, or “kolmogorov” default is “shapiro”
cols – columns to use
st (optional) – start of data
en (optional) – end of data to use
orientation (optional) – orientation of bars
color – color to use
figsize (tuple, optional) – figure size (width, height)

Example

>>> from ai4water.eda import EDA
>>> from ai4water.datasets import busan_beach
>>> eda = EDA(data=busan_beach())
>>> eda.normality_test()

property out_cols

parallel_corrdinates(cols=None, st=None, en=100, color=None, **kwargs)[source]

Plots data as parallel coordinates.

Parameters:

st – start of data to be considered
en – end of data to be considered
cols – columns from data to be considered.
color – color or colormap to be used.
**kwargs – any additional keyword arguments to be passed to easy_mpl.parallel_coordinates

partial_autocorrelation(n_lags: int = 10, cols: Optional[Union[str, list]] = None)[source]

Partial autocorrelation of individual features of data

Parameters:

n_lags (int, optional) – number of lag steps to consider
cols (str, list, optional) – columns to use. If not defined then all the columns are used

plot_data(st=None, en=None, freq: Optional[str] = None, cols=None, max_cols_in_plot: int = 10, ignore_datetime_index=False, **kwargs)[source]

Plots the data.

Parameters:

st (int, str, optional) – starting row/index in data to be used for plotting
en (int, str, optional) – end row/index in data to be used for plotting
cols (str, list, optional) – columns in data to consider for plotting
max_cols_in_plot (int, optional) – Maximum number of columns in one plot. Maximum number of plots depends upon this value and number of columns in data.
freq (str, optional) – one of ‘daily’, ‘weekly’, ‘monthly’, ‘yearly’, determines interval of plot of data. It is valid for only time-series data.
ignore_datetime_index (bool, optional) – only valid if dataframe’s index is pd.DateTimeIndex. In such a case, if you want to ignore time index on x-axis, set this to True.
**kwargs – ary arguments for pandas plot method

Example

>>> from ai4water.datasets import busan_beach
>>> eda = EDA(busan_beach())
>>> eda.plot_data(subplots=True, figsize=(12, 14), sharex=True)
>>> eda.plot_data(freq='monthly', subplots=True, figsize=(12, 14), sharex=True)

plot_ecdf(cols=None, figsize=None, **kwargs)[source]

plots empirical cummulative distribution function

Parameters:

cols – columns to use
figsize –
kwargs – any keyword argument for axis.plot

plot_histograms(st=None, en=None, cols=None, max_subplots: int = 40, figsize: tuple = (20, 14), **kwargs)[source]

Plots distribution of data as histogram.

Parameters:

st – starting index of data to use
en – end index of data to use
cols – columns to use
max_subplots (int, optional) – maximum number of subplots in one figure
figsize – figure size
**kwargs (anykeyword argument for pandas.DataFrame.hist function) –

plot_index(st=None, en=None, **kwargs)[source]: plots the datetime index of dataframe

plot_missing(st=None, en=None, cols=None, **kwargs)[source]

plot data to indicate missingness in data

Parameters:

cols (list, str, optional) – columns to be used.
st (int, str, optional) – starting row/index in data to be used for plotting
en (int, str, optional) – end row/index in data to be used for plotting
**kwargs – Keyword Args such as figsize

Example

>>> from ai4water.datasets import busan_beach
>>> data = busan_beach()
>>> vis = EDA(data)
>>> vis.plot_missing()

plot_pcs(num_pcs=None, st=None, en=None, save_as_csv=False, figsize=(12, 8), **kwargs)[source]

Plots principle components.

Parameters:

num_pcs –
st (starting row/index in data to be used for plotting) –
en (end row/index in data to be used for plotting) –
save_as_csv –
figsize –
kwargs (will go to sns.pairplot.) –

probability_plots(cols: Optional[Union[str, list]] = None)[source]: draws prbability plot using scipy.stats.probplot . See scipy distributions

show_unique_vals(threshold: int = 10, st=None, en=None, cols=None, max_subplots: int = 9, figsize: Optional[tuple] = None, **kwargs)[source]

Shows percentage of unique/categorical values in data. Only those columns are used in which unique values are below threshold.

Parameters:

threshold (int, optional) –
st (int, str, optional) –
en (int, str, optional) –
cols (str, list, optional) –
max_subplots (int, optional) –
figsize (tuple, optional) –
**kwargs – Any keyword arguments for easy_mpl.pie

stats(precision=3, inputs=True, outputs=True, st=None, en=None, out_fmt='csv')[source]

Finds the stats of inputs and outputs and puts them in a json file.

inputs: bool fpath: str, path like out_fmt: str, in which format to save. csv or json