Exploratory Data Analysis
The purpose of this module is to quickly explore the data with as many plots as possible.
- class ai4water.eda.EDA(data: Union[DataFrame, List[DataFrame], Dict, ndarray], in_cols=None, out_cols=None, path=None, dpi=300, save=True, show=True)[source]
Bases:
Plot
Performns a comprehensive exploratory data analysis on a tabular/structured data. It is meant to be a one stop shop for eda.
- - heatmap
- - box_plot
- - plot_missing
- - plot_histograms
- - plot_index
- - plot_data
- - plot_pcs
- - grouped_scatter
- - correlation
- - stats
- - autocorrelation
- - partial_autocorrelation
- - probability_plots
- - lag_plot
- - plot_ecdf
- - normality_test
- - parallel_coordinates
- - show_unique_vals
- Example:
>>> from ai4water.datasets import busan_beach >>> eda = EDA(data=busan_beach()) >>> eda() # to plot all available plots with single line
- __init__(data: Union[DataFrame, List[DataFrame], Dict, ndarray], in_cols=None, out_cols=None, path=None, dpi=300, save=True, show=True)[source]
- Parameters:
data (DataFrame, array, dict, list) – either a dataframe, or list of dataframes or a dictionary whose values are dataframes or a numpy arrays
in_cols (str, list, optional) – columns to consider as input features
out_cols (str, optional) – columns to consider as output features
path (str, optional) – the path where to save the figures. If not given, plots will be saved in ‘data’ folder in current working directory.
save (bool, optional) – whether to save the plots or not
show (bool, optional) – whether to show the plots or not
dpi (int, optional) – the resolution with which to save the image
- __call__(methods: Union[str, list] = 'all', cols=None)[source]
Shortcut to draw maximum possible plots.
- autocorrelation(n_lags: int = 10, cols: Optional[Union[str, list]] = None, figsize: Optional[tuple] = None)[source]
autocorrelation of individual features of data
- box_plot(st=None, en=None, cols: Optional[Union[str, list]] = None, violen=False, normalize=True, figsize=(12, 8), max_features=8, show_datapoints=False, freq=None, **kwargs)[source]
Plots box whisker or violen plot of data.
- Parameters:
st (optional) – starting row/index in data to be used for plotting
en (optional) – end row/index in data to be used for plotting
cols (list,) – the name of columns from data to be plotted.
normalize – If True, then each feature/column is rescaled between 0 and 1.
figsize – figure size
freq (str,) – one of ‘weekly’, ‘monthly’, ‘yearly’. If given, box plot will be plotted for these intervals.
max_features (int,) – maximum number of features to appear in one plot.
violen (bool,) – if True, then violen plot will be plotted else box_whisker plot
show_datapoints (bool) – if True, sns.swarmplot() will be plotted. Will be time consuming for bigger data.
**kwargs – any args for seaborn.boxplot/seaborn.violenplot or seaborn.swarmplot.
- correlation(remove_targets=False, st=None, en=None, cols=None, method: str = 'pearson', split: Optional[str] = None, **kwargs)[source]
Plots correlation between features.
- Parameters:
remove_targets (bool, optional) – whether to remove the output/target column or not
st – starting row/index in data to be used for plotting
en – end row/index in data to be used for plotting
cols – columns to use
method (str, optional) – {“pearson”, “spearman”, “kendall”, “covariance”}, by default “pearson”
split (str) – To plot only positive correlations, set it to “pos” or to plot only negative correlations, set it to “neg”.
**kwargs (keyword Args) – Any additional keyword arguments for seaborn.heatmap
Example
>>> from ai4water.eda import EDA >>> from ai4water.datasets import busan_beach >>> vis = EDA(busan_beach()) >>> vis.correlation()
- grouped_scatter(cols=None, st=None, en=None, max_subplots: int = 8, **kwargs)[source]
Makes scatter plot for each of feature in data.
- Parameters:
st – starting row/index in data to be used for plotting
en – end row/index in data to be used for plotting
cols –
max_subplots (int, optional) – it can be set to large number to show all the scatter plots on one axis.
kwargs – keyword arguments for sns.pariplot
- heatmap(st=None, en=None, cols=None, figsize: Optional[tuple] = None, **kwargs)[source]
Plots data as heatmap which depicts missing values.
- Parameters:
- Return type:
None
Example
>>> from ai4water.datasets import busan_beach >>> data = busan_beach() >>> vis = EDA(data) >>> vis.heatmap()
- property in_cols
- lag_plot(n_lags: Union[int, list] = 1, cols=None, figsize=None, **kwargs)[source]
lag plot between an array and its lags
- Parameters:
n_lags – lag step against which to plot the data, it can be integer or a list of integers
cols – columns to use
figsize – figure size
kwargs (any keyword arguments for axis.scatter) –
- normality_test(method='shapiro', cols=None, st=None, en=None, orientation='h', color=None, figsize: Optional[tuple] = None)[source]
plots the statistics of nromality test as bar charts. The statistics for each feature are calculated either Shapiro-wilke test or Anderson-Darling test][] or Kolmogorov-Smirnov test using scipy.stats.shapiro or scipy.stats.anderson functions respectively.
- Parameters:
method – either “shapiro” or “anderson”, or “kolmogorov” default is “shapiro”
cols – columns to use
st (optional) – start of data
en (optional) – end of data to use
orientation (optional) – orientation of bars
color – color to use
figsize (tuple, optional) – figure size (width, height)
Example
>>> from ai4water.eda import EDA >>> from ai4water.datasets import busan_beach >>> eda = EDA(data=busan_beach()) >>> eda.normality_test()
- property out_cols
- parallel_corrdinates(cols=None, st=None, en=100, color=None, **kwargs)[source]
Plots data as parallel coordinates.
- Parameters:
st – start of data to be considered
en – end of data to be considered
cols – columns from data to be considered.
color – color or colormap to be used.
**kwargs – any additional keyword arguments to be passed to easy_mpl.parallel_coordinates
- partial_autocorrelation(n_lags: int = 10, cols: Optional[Union[str, list]] = None)[source]
Partial autocorrelation of individual features of data
- plot_data(st=None, en=None, freq: Optional[str] = None, cols=None, max_cols_in_plot: int = 10, ignore_datetime_index=False, **kwargs)[source]
Plots the data.
- Parameters:
st (int, str, optional) – starting row/index in data to be used for plotting
en (int, str, optional) – end row/index in data to be used for plotting
cols (str, list, optional) – columns in data to consider for plotting
max_cols_in_plot (int, optional) – Maximum number of columns in one plot. Maximum number of plots depends upon this value and number of columns in data.
freq (str, optional) – one of ‘daily’, ‘weekly’, ‘monthly’, ‘yearly’, determines interval of plot of data. It is valid for only time-series data.
ignore_datetime_index (bool, optional) – only valid if dataframe’s index is pd.DateTimeIndex. In such a case, if you want to ignore time index on x-axis, set this to True.
**kwargs – ary arguments for pandas plot method
Example
>>> from ai4water.datasets import busan_beach >>> eda = EDA(busan_beach()) >>> eda.plot_data(subplots=True, figsize=(12, 14), sharex=True) >>> eda.plot_data(freq='monthly', subplots=True, figsize=(12, 14), sharex=True)
- plot_ecdf(cols=None, figsize=None, **kwargs)[source]
plots empirical cummulative distribution function
- Parameters:
cols – columns to use
figsize –
kwargs – any keyword argument for axis.plot
- plot_histograms(st=None, en=None, cols=None, max_subplots: int = 40, figsize: tuple = (20, 14), **kwargs)[source]
Plots distribution of data as histogram.
- Parameters:
st – starting index of data to use
en – end index of data to use
cols – columns to use
max_subplots (int, optional) – maximum number of subplots in one figure
figsize – figure size
**kwargs (anykeyword argument for pandas.DataFrame.hist function) –
- plot_missing(st=None, en=None, cols=None, **kwargs)[source]
plot data to indicate missingness in data
- Parameters:
Example
>>> from ai4water.datasets import busan_beach >>> data = busan_beach() >>> vis = EDA(data) >>> vis.plot_missing()
- plot_pcs(num_pcs=None, st=None, en=None, save_as_csv=False, figsize=(12, 8), **kwargs)[source]
Plots principle components.
- Parameters:
num_pcs –
st (starting row/index in data to be used for plotting) –
en (end row/index in data to be used for plotting) –
save_as_csv –
figsize –
kwargs (will go to sns.pairplot.) –
- probability_plots(cols: Optional[Union[str, list]] = None)[source]
draws prbability plot using scipy.stats.probplot . See scipy distributions