Water Quality

SWatCh

class ai4water.datasets.Swatch(remove_csv_after_download=False, path=None, **kwargs)[source]

Bases: Datasets

The Surface Water Chemistry (SWatCh) database as introduced in Franz and Lobke, 2022.

__init__(remove_csv_after_download=False, path=None, **kwargs)[source]

Parameters:: remove_csv_after_download (bool (default=False)) – if True, the csv will be removed after downloading and processing.

property csv_name: str

fetch(parameters: Optional[Union[str, list]] = None, station_id: Optional[Union[str, list]] = None, station_names: Optional[Union[str, list]] = None) → DataFrame[source]

Parameters:

parameters (str/list (default=None)) –

Names of parameters to fetch. By default, name, value, val_unit, location,
lat, and long are read.
station_id (str/list (default=None)) – name/names of station id for which the data is to be fetched. By default, the data for all stations is fetched. If given, then station_names should not be given.
station_names (str/list (default=None)) – name/names of station id for which the data is to be fetched. By default, the data for all stations is fetched. If given, then station_id should not be given.

Return type:

pd.DataFrame

Examples

>>> from ai4water.datasets import Swatch
>>> ds = Swatch()
>>> df = ds.fetch()

property names: dict: tells the names of parameters in this class and their original names in SWatCh dataset in the form of a python dictionary

property npy_files: list

num_samples(parameter, station_id=None) → int[source]

Parameters:

parameter (str) – name of the water quality parameter whose samples are to be quantified.
station_id – if given, samples of parameter will be returned for only this site/sites otherwise for all sites

property parameters: list: list of water quality parameters available

property site_names: list: list of site names

property sites: list: list of site names

url = 'https://zenodo.org/record/6484939'

GRQA

class ai4water.datasets.GRQA(download_source: bool = False, path=None, **kwargs)[source]

Bases: Datasets

Global River Water Quality Archive following the work of Virro et al., 2021 [21].

__init__(download_source: bool = False, path=None, **kwargs)[source]

Parameters:: download_source (bool) – whether to download source data or not

fetch_parameter(parameter: str = 'COD', site_name: Optional[Union[str, List[str]]] = None, country: Optional[Union[str, List[str]]] = None, st: Optional[Union[str, int, DatetimeIndex]] = None, en: Optional[Union[str, int, DatetimeIndex]] = None) → DataFrame[source]

Parameters:

parameter (str, optional) – name of parameter
site_name (str/list, optional) – location for which data is to be fetched.
country (str/list optional (default=None)) –
st (str) – starting date date or index
en (str) – end date or index

Returns:

a pandas dataframe

Return type:

pd.DataFrame

Example

>>> from ai4water.datasets import GRQA
>>> dataset = GRQA()
>>> df = dataset.fetch_parameter()
fetch data for only one country
>>> cod_pak = dataset.fetch_parameter("COD", country="Pakistan")
fetch data for only one site
>>> cod_kotri = dataset.fetch_parameter("COD", site_name="Indus River - at Kotri")
we can find out the number of data points and sites available for a specific country as below
>>> for para in dataset.parameters:
>>>     data = dataset.fetch_parameter(para, country="Germany")
>>>     if len(data)>0:
>>>         print(f"{para}, {df.shape}, {len(df['site_name'].unique())}")

property files

property parameters

url = 'https://zenodo.org/record/7056647#.YzBzDHZByUk'

Quadica

class ai4water.datasets.Quadica(path=None, **kwargs)[source]

Bases: Datasets

This is dataset of water quality parameters of Germany from 828 stations from 1950 to 2018 following the work of Ebeling et al., 2022. The time-step is monthly and annual but the monthly timeseries data is not continuous.

__init__(path=None, **kwargs)[source]

Parameters:

name – str (default=None) name of dataset
units – str, (default=None) the unit system being used
path – str (default=None) path where the data is available (manually downloaded). If None, it will be downloaded

annual_medians() → DataFrame[source]

Annual medians over the whole time series of water quality variables and discharge

Returns:: a dataframe of shape (24393, 18)
Return type:: pd.DataFrame

avg_temp(stations: Optional[Union[List[int], int]] = None, st: Optional[Union[str, int, DatetimeIndex]] = None, en: Optional[Union[str, int, DatetimeIndex]] = None) → DataFrame[source]

monthly median average temperatures starting from 1950-01 to 2018-09

Parameters:

stations – name of stations for which data is to be retrieved. By default, data for all stations is retrieved.
st (optional) – starting point of data. By default, the data starts from 1950-01
en (optional) – end point of data. By default, the data ends at 2018-09

Returns:

a pandas dataframe of shape (time_steps, stations). With default input arguments, the shape is (828, 1386)

Return type:

pd.DataFrame

Examples

>>> from ai4water.datasets import Quadica
>>> dataset = Quadica()
>>> df = dataset.avg_temp() # -> (828, 1388)

catchment_attributes(features: Optional[Union[str, List[str]]] = None, stations: Optional[Union[List[int], int]] = None) → DataFrame[source]

Returns static physical catchment attributes in the form of dataframe.

Parameters:

features (list/str, optional, (default=None)) – name/names of static attributes to fetch
stations (list/int, optional (default=None)) – name/names of stations whose static/physical features are to be read

Returns:

a pandas dataframe of shape (stations, features). With default input arguments, shape is (1386, 113)

Return type:

pd.DataFrame

Examples

>>> from ai4water.datasets import Quadica
>>> dataset = Quadica()
>>> cat_features = dataset.catchment_attributes()
... # get attributes of only selected stations
>>> dataset.catchment_attributes(stations=[1,2,3])

property features: list: names of water quality parameters available in this dataset

fetch_annual()[source]

fetch_monthly(features: Optional[Union[str, List[str]]] = None, stations: Optional[Union[List[int], int]] = None, median: bool = True, fnc: bool = True, fluxes: bool = True, precipitation: bool = True, avg_temp: bool = True, pet: bool = True, only_continuous: bool = True, cat_features: bool = True, max_nan_tol: Optional[int] = 0) → Tuple[DataFrame, DataFrame][source]

Fetches monthly concentrations of water quality parameters.

Parameters:

features (str/list, optional (default=None)) –
name or names of water quality parameters to fetch. By default following parameters are considered
- NO3
- NO3N
- TN
- Nmin
- PO4
- PO4P
- TP
- DOC
- TOC
stations (int/list, optional (default=None)) – name or names of stations whose data is to be fetched
median (bool, optional (default=True)) – whether to fetch median concentration values or not
fnc (bool, optional (default=True)) – whether to fetch flow normalized concentrations or not
fluxes (bool, optional (default=True)) – Setting this to true will add two features i.e. mean_Flux_FEATURE and mean_FNFlux_FEATURE
precipitation (bool, optional (default=True)) – whether to fetch average monthly precipitation or not
avg_temp (bool, optional (default=True)) – whether to fetch average monthly temperature or not
pet (bool, optional (default=True)) – whether to fether potential evapotranspiration data or not
only_continuous (bool, optional (default=True)) – If true, will return data for only those stations who have continuos monthly timeseries data from 1993-01-01 to 2013-01-01.
cat_features (bool, optional (default=True)) – whether to fetch catchment features or not.
max_nan_tol (int, optional (default=0)) – setting this value to 0 will remove the whole time-series with any missing values. If None, no time-series with NaNs values will be removed.

Returns:

two dataframes whose length is same but the columns are different

a pandas dataframe of timeseries of parameters (stations*timesteps, dynamic_features)
a pandas dataframe of static features (stations*timesteps, catchment_features)

Return type:

tuple

Examples

>>> from ai4water.datasets import Quadica
>>> dataset = Quadica()
>>> mon_dyn, mon_cat = dataset.fetch_monthly(max_nan_tol=None)
... # However, mon_dyn contains data for all parameters and many of which have
... # large number of nans. If we want to fetch data only related to TN without any
... # missing value, we can do as below
>>> mon_dyn_tn, mon_cat_tn = dataset.fetch_monthly(features="TN", max_nan_tol=0)
... # if we want to find out how many catchments are included in mon_dyn_tn
>>> len(mon_dyn_tn['OBJECTID'].unique())
... # 25

metadata() → DataFrame[source]

fetches the metadata about the stations as pandas’ dataframe. Each row represents metadata about one station and each column represents one feature. The R2 and pbias are regression coefficients and percent bias of WRTDS models for each parameter.

Returns:: a dataframe of shape (1386, 60)
Return type:: pd.DataFrame

monthly_medians(features: Optional[Union[str, List[str]]] = None, stations: Optional[Union[List[int], int]] = None) → DataFrame[source]

This function reads the c_months.csv file which contains the monthly medians over the whole time series of water quality variables and discharge

Parameters:

features (list/str, optional, (default=None)) – name/names of features
stations (list/int, optional (default=None)) – stations for which

Returns:

a dataframe of shape (16629, 18). 15 of the 18 columns represent a water chemistry parameter. 16629 comes from 1386*12 where 1386 is stations and 12 is months.

Return type:

pd.DataFrame

pet(stations: Optional[Union[List[int], int]] = None, st: Optional[Union[str, int, DatetimeIndex]] = None, en: Optional[Union[str, int, DatetimeIndex]] = None) → DataFrame[source]

average monthly potential evapotranspiration starting from 1950-01 to 2018-09

Examples

>>> from ai4water.datasets import Quadica
>>> dataset = Quadica()
>>> df = dataset.pet() # -> (828, 1386)

precipitation(stations: Optional[Union[List[int], int]] = None, st: Optional[Union[str, int, DatetimeIndex]] = None, en: Optional[Union[str, int, DatetimeIndex]] = None) → DataFrame[source]

sums of precipitation starting from 1950-01 to 2018-09

Parameters:

stations – name of stations for which data is to be retrieved. By default, data for all stations is retrieved.
st (optional) – starting point of data. By default, the data starts from 1950-01
en (optional) – end point of data. By default, the data ends at 2018-09

Returns:

a dataframe of shape (828, 1388)

Return type:

pd.DataFrame

Examples

>>> from ai4water.datasets import Quadica
>>> dataset = Quadica()
>>> df = dataset.precipitation() # -> (828, 1388)

property station_names: names of stations

property stattions: list: IDs of stations for which data is available

to_DataSet(target: str = 'TP', input_features: Optional[list] = None, split: str = 'temporal', lookback: int = 24, **ds_args)[source]

This function prepares data for machine learning prediction problem. It returns an instance of ai4water.preprocessing.DataSetPipeline which can be given to model.fit or model.predict

Parameters:

target (str, optional (default="TN")) – parameter to consider as target
input_features (list, optional) – names of input features
split (str, optional (default="temporal")) – if temporal, validation and test sets are taken from the data of each station and then concatenated. If spatial, training validation and test is decided based upon stations.
lookback (int) –
**ds_args – key word arguments

Returns:

an instance of DataSetPipeline

Return type:

ai4water.preprocessing.DataSet

Example

>>> from ai4water.datasets import Quadica
... # initialize the Quadica class
>>> dataset = Quadica()
... # define the input features
>>> inputs = ['median_Q', 'OBJECTID', 'avg_temp', 'precip', 'pet']
... # prepare data for TN as target
>>> dsp = dataset.to_DataSet("TN", inputs, lookback=24)

url = {'catchment_attributes.csv': 'https://www.hydroshare.org/resource/88254bd930d1466c85992a7dea6947a4/data/contents/catchment_attributes.csv', 'metadata.pdf': 'https://www.hydroshare.org/resource/26e8238f0be14fa1a49641cd8a455e29/data/contents/Metadata_QUADICA.pdf', 'quadica.zip': 'https://www.hydroshare.org/resource/26e8238f0be14fa1a49641cd8a455e29/data/contents/QUADICA.zip'}

wrtds_annual(features: Optional[Union[str, list]] = None, st: Optional[Union[str, int, DatetimeIndex]] = None, en: Optional[Union[str, int, DatetimeIndex]] = None) → DataFrame[source]

Annual median concentrations, flow-normalized concentrations, and mean fluxes estimated using Weighted Regressions on Time, Discharge, and Season (WRTDS) for stations with enough data availability.

Parameters:

features (optional) –
st (optional) – starting point of data. By default, the data starts from 1992
en (optional) – end point of data. By default, the data ends at 2013

Returns:

a dataframe of shape (4213, 46)

Return type:

pd.DataFrame

Examples

>>> from ai4water.datasets import Quadica
>>> dataset = Quadica()
>>> df = dataset.wrtds_annual()

wrtds_monthly(features: Optional[Union[str, list]] = None, stations: Optional[Union[List[int], int]] = None, st: Optional[Union[str, int, DatetimeIndex]] = None, en: Optional[Union[str, int, DatetimeIndex]] = None) → DataFrame[source]

Monthly median concentrations, flow-normalized concentrations and mean fluxes of water chemistry parameters. These are estimated using Weighted Regressions on Time, Discharge, and Season (WRTDS) for stations with enough data availability. This data is available for total 140 stations. The data from all stations does not start and end at the same period. Therefore, some stations have more datapoints while others have less. The maximum datapoints for a station are 576 while smallest datapoints are 244.

Parameters:

features (str/list, optional) –
stations (int/list optional (default=None)) – name/names of satations whose data is to be retrieved.
st (optional) – starting point of data. By default, the data starts from 1992-09
en (optional) – end point of data. By default, the data ends at 2013-12

Returns:

a dataframe of shape (50186, 47)

Return type:

pd.DataFrame

Examples

>>> from ai4water.datasets import Quadica
>>> dataset = Quadica()
>>> df = dataset.wrtds_monthly()

RC4USCoast

class ai4water.datasets.RC4USCoast(path=None, *args, **kwargs)[source]

Bases: Datasets

Monthly river water chemistry (N, P, SIO2, DO, … etc), discharge and temperature of 140 monitoring sites of US coasts from 1950 to 2020 following the work of Gomez et al., 2022.

Examples

>>> from ai4water.datasets import RC4USCoast
>>> dataset = RC4USCoast()

__init__(path=None, *args, **kwargs)[source]

Parameters:: path – path where the data is already downloaded. If None, the data will be downloaded into the disk.

property chem_fname: str

property end: Timestamp

fetch_chem(parameter, stations: Union[List[int], int, str] = 'all', as_dataframe: bool = False, st: Optional[Union[str, int, DatetimeIndex]] = None, en: Optional[Union[str, int, DatetimeIndex]] = None)[source]

Returns water chemistry parameters from one or more stations.

Parameters:

parameter (list, str) – name/names of parameters to fetch
stations (list, str) – name/names of stations from which the parameters are to be fetched
as_dataframe (bool (default=False)) – whether to return data as pandas.DataFrame or xarray.Dataset
st – start time of data to be fetched. The default starting date is 19500101
en – end time of data to be fetched. The default end date is 20201201

Return type:

pandas DataFrame or xarray Dataset

Examples

>>> from ai4water.datasets import RC4USCoast
>>> ds = RC4USCoast()
>>> data = ds.fetch_chem(['temp', 'do'])
>>> data
>>> data = ds.fetch_chem(['temp', 'do'], as_dataframe=True)
>>> data.shape  # this is a multi-indexed dataframe
(119280, 4)
>>> data = ds.fetch_chem(['temp', 'do'], st="19800101", en="20181230")

fetch_q(stations: Union[int, List[int], str, ndarray] = 'all', as_dataframe: bool = True, nv=0, st: Optional[Union[str, int, DatetimeIndex]] = None, en: Optional[Union[str, int, DatetimeIndex]] = None)[source]

returns discharge data

Parameters:

stations – stations for which q is to be fetched
as_dataframe (bool (default=True)) – whether to return the data as pd.DataFrame or as xarray.Dataset
nv (int (default=0)) –
st – start time of data to be fetched. The default starting date is 19500101
en – end time of data to be fetched. The default end date is 20201201

Examples

>>> from ai4water.datasets import RC4USCoast
>>> ds = RC4USCoast()
# get data of all stations as DataFrame
>>> q = ds.fetch_q("all")
>>> q.shape
(852, 140)  # where 140 is the number of stations
# get data of only two stations
>>> q = ds.fetch_q([1,10])
>>> q.shape
(852, 2)
# get data as xarray Dataset
>>> q = ds.fetch_q("all", as_dataframe=False)
>>> type(q)
xarray.core.dataset.Dataset
# getting data between specific periods
>>> data = ds.fetch_q("all", st="20000101", en="20181230")

property info_fname: str

property parameters: List[str]

>>> from ai4water.datasets import RC4USCoast
>>> ds = RC4USCoast()
>>> len(ds.parameters)
27

property q_fname: str

property start: Timestamp

property stations: ndarray

>>> from ai4water.datasets import RC4USCoast
>>> ds = RC4USCoast(path=r'F:\data\RC4USCoast')
>>> len(ds.stations)
140

url = {'RC4USCoast.zip': 'https://www.ncei.noaa.gov/data/oceans/ncei/ocads/data/0260455/RC4USCoast.zip', 'info.xlsx': 'https://www.ncei.noaa.gov/data/oceans/ncei/ocads/data/0260455/supplemental/dataset_info.xlsx'}