Water Quality

SWatCh

class ai4water.datasets.Swatch(remove_csv_after_download=False, path=None, **kwargs)[source]

Bases: Datasets

The Surface Water Chemistry (SWatCh) database as introduced in Franz and Lobke, 2022.

__init__(remove_csv_after_download=False, path=None, **kwargs)[source]
Parameters:

remove_csv_after_download (bool (default=False)) – if True, the csv will be removed after downloading and processing.

property csv_name: str
fetch(parameters: Optional[Union[str, list]] = None, station_id: Optional[Union[str, list]] = None, station_names: Optional[Union[str, list]] = None) DataFrame[source]
Parameters:
  • parameters (str/list (default=None)) –

    Names of parameters to fetch. By default, name, value, val_unit, location,

    lat, and long are read.

  • station_id (str/list (default=None)) – name/names of station id for which the data is to be fetched. By default, the data for all stations is fetched. If given, then station_names should not be given.

  • station_names (str/list (default=None)) – name/names of station id for which the data is to be fetched. By default, the data for all stations is fetched. If given, then station_id should not be given.

Return type:

pd.DataFrame

Examples

>>> from ai4water.datasets import Swatch
>>> ds = Swatch()
>>> df = ds.fetch()
property names: dict

tells the names of parameters in this class and their original names in SWatCh dataset in the form of a python dictionary

property npy_files: list
num_samples(parameter, station_id=None) int[source]
Parameters:
  • parameter (str) – name of the water quality parameter whose samples are to be quantified.

  • station_id – if given, samples of parameter will be returned for only this site/sites otherwise for all sites

property parameters: list

list of water quality parameters available

property site_names: list

list of site names

property sites: list

list of site names

url = 'https://zenodo.org/record/6484939'

GRQA

class ai4water.datasets.GRQA(download_source: bool = False, path=None, **kwargs)[source]

Bases: Datasets

Global River Water Quality Archive following the work of Virro et al., 2021 [21].

__init__(download_source: bool = False, path=None, **kwargs)[source]
Parameters:

download_source (bool) – whether to download source data or not

fetch_parameter(parameter: str = 'COD', site_name: Optional[Union[str, List[str]]] = None, country: Optional[Union[str, List[str]]] = None, st: Optional[Union[str, int, DatetimeIndex]] = None, en: Optional[Union[str, int, DatetimeIndex]] = None) DataFrame[source]
Parameters:
  • parameter (str, optional) – name of parameter

  • site_name (str/list, optional) – location for which data is to be fetched.

  • country (str/list optional (default=None)) –

  • st (str) – starting date date or index

  • en (str) – end date or index

Returns:

a pandas dataframe

Return type:

pd.DataFrame

Example

>>> from ai4water.datasets import GRQA
>>> dataset = GRQA()
>>> df = dataset.fetch_parameter()
fetch data for only one country
>>> cod_pak = dataset.fetch_parameter("COD", country="Pakistan")
fetch data for only one site
>>> cod_kotri = dataset.fetch_parameter("COD", site_name="Indus River - at Kotri")
we can find out the number of data points and sites available for a specific country as below
>>> for para in dataset.parameters:
>>>     data = dataset.fetch_parameter(para, country="Germany")
>>>     if len(data)>0:
>>>         print(f"{para}, {df.shape}, {len(df['site_name'].unique())}")
property files
property parameters
url = 'https://zenodo.org/record/7056647#.YzBzDHZByUk'

Quadica

class ai4water.datasets.Quadica(path=None, **kwargs)[source]

Bases: Datasets

This is dataset of water quality parameters of Germany from 828 stations from 1950 to 2018 following the work of Ebeling et al., 2022. The time-step is monthly and annual but the monthly timeseries data is not continuous.

__init__(path=None, **kwargs)[source]
Parameters:
  • name – str (default=None) name of dataset

  • units – str, (default=None) the unit system being used

  • path – str (default=None) path where the data is available (manually downloaded). If None, it will be downloaded

annual_medians() DataFrame[source]

Annual medians over the whole time series of water quality variables and discharge

Returns:

a dataframe of shape (24393, 18)

Return type:

pd.DataFrame

avg_temp(stations: Optional[Union[List[int], int]] = None, st: Optional[Union[str, int, DatetimeIndex]] = None, en: Optional[Union[str, int, DatetimeIndex]] = None) DataFrame[source]

monthly median average temperatures starting from 1950-01 to 2018-09

Parameters:
  • stations – name of stations for which data is to be retrieved. By default, data for all stations is retrieved.

  • st (optional) – starting point of data. By default, the data starts from 1950-01

  • en (optional) – end point of data. By default, the data ends at 2018-09

Returns:

a pandas dataframe of shape (time_steps, stations). With default input arguments, the shape is (828, 1386)

Return type:

pd.DataFrame

Examples

>>> from ai4water.datasets import Quadica
>>> dataset = Quadica()
>>> df = dataset.avg_temp() # -> (828, 1388)
catchment_attributes(features: Optional[Union[str, List[str]]] = None, stations: Optional[Union[List[int], int]] = None) DataFrame[source]

Returns static physical catchment attributes in the form of dataframe.

Parameters:
  • features (list/str, optional, (default=None)) – name/names of static attributes to fetch

  • stations (list/int, optional (default=None)) – name/names of stations whose static/physical features are to be read

Returns:

a pandas dataframe of shape (stations, features). With default input arguments, shape is (1386, 113)

Return type:

pd.DataFrame

Examples

>>> from ai4water.datasets import Quadica
>>> dataset = Quadica()
>>> cat_features = dataset.catchment_attributes()
... # get attributes of only selected stations
>>> dataset.catchment_attributes(stations=[1,2,3])
property features: list

names of water quality parameters available in this dataset

fetch_annual()[source]
fetch_monthly(features: Optional[Union[str, List[str]]] = None, stations: Optional[Union[List[int], int]] = None, median: bool = True, fnc: bool = True, fluxes: bool = True, precipitation: bool = True, avg_temp: bool = True, pet: bool = True, only_continuous: bool = True, cat_features: bool = True, max_nan_tol: Optional[int] = 0) Tuple[DataFrame, DataFrame][source]

Fetches monthly concentrations of water quality parameters.

Parameters:
  • features (str/list, optional (default=None)) –

    name or names of water quality parameters to fetch. By default following parameters are considered

    • NO3

    • NO3N

    • TN

    • Nmin

    • PO4

    • PO4P

    • TP

    • DOC

    • TOC

  • stations (int/list, optional (default=None)) – name or names of stations whose data is to be fetched

  • median (bool, optional (default=True)) – whether to fetch median concentration values or not

  • fnc (bool, optional (default=True)) – whether to fetch flow normalized concentrations or not

  • fluxes (bool, optional (default=True)) – Setting this to true will add two features i.e. mean_Flux_FEATURE and mean_FNFlux_FEATURE

  • precipitation (bool, optional (default=True)) – whether to fetch average monthly precipitation or not

  • avg_temp (bool, optional (default=True)) – whether to fetch average monthly temperature or not

  • pet (bool, optional (default=True)) – whether to fether potential evapotranspiration data or not

  • only_continuous (bool, optional (default=True)) – If true, will return data for only those stations who have continuos monthly timeseries data from 1993-01-01 to 2013-01-01.

  • cat_features (bool, optional (default=True)) – whether to fetch catchment features or not.

  • max_nan_tol (int, optional (default=0)) – setting this value to 0 will remove the whole time-series with any missing values. If None, no time-series with NaNs values will be removed.

Returns:

two dataframes whose length is same but the columns are different
  • a pandas dataframe of timeseries of parameters (stations*timesteps, dynamic_features)

  • a pandas dataframe of static features (stations*timesteps, catchment_features)

Return type:

tuple

Examples

>>> from ai4water.datasets import Quadica
>>> dataset = Quadica()
>>> mon_dyn, mon_cat = dataset.fetch_monthly(max_nan_tol=None)
... # However, mon_dyn contains data for all parameters and many of which have
... # large number of nans. If we want to fetch data only related to TN without any
... # missing value, we can do as below
>>> mon_dyn_tn, mon_cat_tn = dataset.fetch_monthly(features="TN", max_nan_tol=0)
... # if we want to find out how many catchments are included in mon_dyn_tn
>>> len(mon_dyn_tn['OBJECTID'].unique())
... # 25
metadata() DataFrame[source]

fetches the metadata about the stations as pandas’ dataframe. Each row represents metadata about one station and each column represents one feature. The R2 and pbias are regression coefficients and percent bias of WRTDS models for each parameter.

Returns:

a dataframe of shape (1386, 60)

Return type:

pd.DataFrame

monthly_medians(features: Optional[Union[str, List[str]]] = None, stations: Optional[Union[List[int], int]] = None) DataFrame[source]

This function reads the c_months.csv file which contains the monthly medians over the whole time series of water quality variables and discharge

Parameters:
  • features (list/str, optional, (default=None)) – name/names of features

  • stations (list/int, optional (default=None)) – stations for which

Returns:

a dataframe of shape (16629, 18). 15 of the 18 columns represent a water chemistry parameter. 16629 comes from 1386*12 where 1386 is stations and 12 is months.

Return type:

pd.DataFrame

pet(stations: Optional[Union[List[int], int]] = None, st: Optional[Union[str, int, DatetimeIndex]] = None, en: Optional[Union[str, int, DatetimeIndex]] = None) DataFrame[source]

average monthly potential evapotranspiration starting from 1950-01 to 2018-09

Examples

>>> from ai4water.datasets import Quadica
>>> dataset = Quadica()
>>> df = dataset.pet() # -> (828, 1386)
precipitation(stations: Optional[Union[List[int], int]] = None, st: Optional[Union[str, int, DatetimeIndex]] = None, en: Optional[Union[str, int, DatetimeIndex]] = None) DataFrame[source]

sums of precipitation starting from 1950-01 to 2018-09

Parameters:
  • stations – name of stations for which data is to be retrieved. By default, data for all stations is retrieved.

  • st (optional) – starting point of data. By default, the data starts from 1950-01

  • en (optional) – end point of data. By default, the data ends at 2018-09

Returns:

a dataframe of shape (828, 1388)

Return type:

pd.DataFrame

Examples

>>> from ai4water.datasets import Quadica
>>> dataset = Quadica()
>>> df = dataset.precipitation() # -> (828, 1388)
property station_names

names of stations

property stattions: list

IDs of stations for which data is available

to_DataSet(target: str = 'TP', input_features: Optional[list] = None, split: str = 'temporal', lookback: int = 24, **ds_args)[source]

This function prepares data for machine learning prediction problem. It returns an instance of ai4water.preprocessing.DataSetPipeline which can be given to model.fit or model.predict

Parameters:
  • target (str, optional (default="TN")) – parameter to consider as target

  • input_features (list, optional) – names of input features

  • split (str, optional (default="temporal")) – if temporal, validation and test sets are taken from the data of each station and then concatenated. If spatial, training validation and test is decided based upon stations.

  • lookback (int) –

  • **ds_args – key word arguments

Returns:

an instance of DataSetPipeline

Return type:

ai4water.preprocessing.DataSet

Example

>>> from ai4water.datasets import Quadica
... # initialize the Quadica class
>>> dataset = Quadica()
... # define the input features
>>> inputs = ['median_Q', 'OBJECTID', 'avg_temp', 'precip', 'pet']
... # prepare data for TN as target
>>> dsp = dataset.to_DataSet("TN", inputs, lookback=24)
url = {'catchment_attributes.csv': 'https://www.hydroshare.org/resource/88254bd930d1466c85992a7dea6947a4/data/contents/catchment_attributes.csv', 'metadata.pdf': 'https://www.hydroshare.org/resource/26e8238f0be14fa1a49641cd8a455e29/data/contents/Metadata_QUADICA.pdf', 'quadica.zip': 'https://www.hydroshare.org/resource/26e8238f0be14fa1a49641cd8a455e29/data/contents/QUADICA.zip'}
wrtds_annual(features: Optional[Union[str, list]] = None, st: Optional[Union[str, int, DatetimeIndex]] = None, en: Optional[Union[str, int, DatetimeIndex]] = None) DataFrame[source]

Annual median concentrations, flow-normalized concentrations, and mean fluxes estimated using Weighted Regressions on Time, Discharge, and Season (WRTDS) for stations with enough data availability.

Parameters:
  • features (optional) –

  • st (optional) – starting point of data. By default, the data starts from 1992

  • en (optional) – end point of data. By default, the data ends at 2013

Returns:

a dataframe of shape (4213, 46)

Return type:

pd.DataFrame

Examples

>>> from ai4water.datasets import Quadica
>>> dataset = Quadica()
>>> df = dataset.wrtds_annual()
wrtds_monthly(features: Optional[Union[str, list]] = None, stations: Optional[Union[List[int], int]] = None, st: Optional[Union[str, int, DatetimeIndex]] = None, en: Optional[Union[str, int, DatetimeIndex]] = None) DataFrame[source]

Monthly median concentrations, flow-normalized concentrations and mean fluxes of water chemistry parameters. These are estimated using Weighted Regressions on Time, Discharge, and Season (WRTDS) for stations with enough data availability. This data is available for total 140 stations. The data from all stations does not start and end at the same period. Therefore, some stations have more datapoints while others have less. The maximum datapoints for a station are 576 while smallest datapoints are 244.

Parameters:
  • features (str/list, optional) –

  • stations (int/list optional (default=None)) – name/names of satations whose data is to be retrieved.

  • st (optional) – starting point of data. By default, the data starts from 1992-09

  • en (optional) – end point of data. By default, the data ends at 2013-12

Returns:

a dataframe of shape (50186, 47)

Return type:

pd.DataFrame

Examples

>>> from ai4water.datasets import Quadica
>>> dataset = Quadica()
>>> df = dataset.wrtds_monthly()

RC4USCoast

class ai4water.datasets.RC4USCoast(path=None, *args, **kwargs)[source]

Bases: Datasets

Monthly river water chemistry (N, P, SIO2, DO, … etc), discharge and temperature of 140 monitoring sites of US coasts from 1950 to 2020 following the work of Gomez et al., 2022.

Examples

>>> from ai4water.datasets import RC4USCoast
>>> dataset = RC4USCoast()
__init__(path=None, *args, **kwargs)[source]
Parameters:

path – path where the data is already downloaded. If None, the data will be downloaded into the disk.

property chem_fname: str
property end: Timestamp
fetch_chem(parameter, stations: Union[List[int], int, str] = 'all', as_dataframe: bool = False, st: Optional[Union[str, int, DatetimeIndex]] = None, en: Optional[Union[str, int, DatetimeIndex]] = None)[source]

Returns water chemistry parameters from one or more stations.

Parameters:
  • parameter (list, str) – name/names of parameters to fetch

  • stations (list, str) – name/names of stations from which the parameters are to be fetched

  • as_dataframe (bool (default=False)) – whether to return data as pandas.DataFrame or xarray.Dataset

  • st – start time of data to be fetched. The default starting date is 19500101

  • en – end time of data to be fetched. The default end date is 20201201

Return type:

pandas DataFrame or xarray Dataset

Examples

>>> from ai4water.datasets import RC4USCoast
>>> ds = RC4USCoast()
>>> data = ds.fetch_chem(['temp', 'do'])
>>> data
>>> data = ds.fetch_chem(['temp', 'do'], as_dataframe=True)
>>> data.shape  # this is a multi-indexed dataframe
(119280, 4)
>>> data = ds.fetch_chem(['temp', 'do'], st="19800101", en="20181230")
fetch_q(stations: Union[int, List[int], str, ndarray] = 'all', as_dataframe: bool = True, nv=0, st: Optional[Union[str, int, DatetimeIndex]] = None, en: Optional[Union[str, int, DatetimeIndex]] = None)[source]

returns discharge data

Parameters:
  • stations – stations for which q is to be fetched

  • as_dataframe (bool (default=True)) – whether to return the data as pd.DataFrame or as xarray.Dataset

  • nv (int (default=0)) –

  • st – start time of data to be fetched. The default starting date is 19500101

  • en – end time of data to be fetched. The default end date is 20201201

Examples

>>> from ai4water.datasets import RC4USCoast
>>> ds = RC4USCoast()
# get data of all stations as DataFrame
>>> q = ds.fetch_q("all")
>>> q.shape
(852, 140)  # where 140 is the number of stations
# get data of only two stations
>>> q = ds.fetch_q([1,10])
>>> q.shape
(852, 2)
# get data as xarray Dataset
>>> q = ds.fetch_q("all", as_dataframe=False)
>>> type(q)
xarray.core.dataset.Dataset
# getting data between specific periods
>>> data = ds.fetch_q("all", st="20000101", en="20181230")
property info_fname: str
property parameters: List[str]
>>> from ai4water.datasets import RC4USCoast
>>> ds = RC4USCoast()
>>> len(ds.parameters)
27
property q_fname: str
property start: Timestamp
property stations: ndarray
>>> from ai4water.datasets import RC4USCoast
>>> ds = RC4USCoast(path=r'F:\data\RC4USCoast')
>>> len(ds.stations)
140
url = {'RC4USCoast.zip': 'https://www.ncei.noaa.gov/data/oceans/ncei/ocads/data/0260455/RC4USCoast.zip', 'info.xlsx': 'https://www.ncei.noaa.gov/data/oceans/ncei/ocads/data/0260455/supplemental/dataset_info.xlsx'}