Water Quality
SWatCh
- class ai4water.datasets.Swatch(remove_csv_after_download=False, path=None, **kwargs)[source]
Bases:
Datasets
The Surface Water Chemistry (SWatCh) database as introduced in Franz and Lobke, 2022.
- __init__(remove_csv_after_download=False, path=None, **kwargs)[source]
- Parameters:
remove_csv_after_download (bool (default=False)) – if True, the csv will be removed after downloading and processing.
- fetch(parameters: Optional[Union[str, list]] = None, station_id: Optional[Union[str, list]] = None, station_names: Optional[Union[str, list]] = None) DataFrame [source]
- Parameters:
parameters (str/list (default=None)) –
- Names of parameters to fetch. By default,
name
,value
,val_unit
,location
, lat
, andlong
are read.
- Names of parameters to fetch. By default,
station_id (str/list (default=None)) – name/names of station id for which the data is to be fetched. By default, the data for all stations is fetched. If given, then
station_names
should not be given.station_names (str/list (default=None)) – name/names of station id for which the data is to be fetched. By default, the data for all stations is fetched. If given, then
station_id
should not be given.
- Return type:
pd.DataFrame
Examples
>>> from ai4water.datasets import Swatch >>> ds = Swatch() >>> df = ds.fetch()
- property names: dict
tells the names of parameters in this class and their original names in SWatCh dataset in the form of a python dictionary
- num_samples(parameter, station_id=None) int [source]
- Parameters:
parameter (str) – name of the water quality parameter whose samples are to be quantified.
station_id – if given, samples of parameter will be returned for only this site/sites otherwise for all sites
- url = 'https://zenodo.org/record/6484939'
GRQA
- class ai4water.datasets.GRQA(download_source: bool = False, path=None, **kwargs)[source]
Bases:
Datasets
Global River Water Quality Archive following the work of Virro et al., 2021 [21].
- __init__(download_source: bool = False, path=None, **kwargs)[source]
- Parameters:
download_source (bool) – whether to download source data or not
- fetch_parameter(parameter: str = 'COD', site_name: Optional[Union[str, List[str]]] = None, country: Optional[Union[str, List[str]]] = None, st: Optional[Union[str, int, DatetimeIndex]] = None, en: Optional[Union[str, int, DatetimeIndex]] = None) DataFrame [source]
- Parameters:
- Returns:
a pandas dataframe
- Return type:
pd.DataFrame
Example
>>> from ai4water.datasets import GRQA >>> dataset = GRQA() >>> df = dataset.fetch_parameter() fetch data for only one country >>> cod_pak = dataset.fetch_parameter("COD", country="Pakistan") fetch data for only one site >>> cod_kotri = dataset.fetch_parameter("COD", site_name="Indus River - at Kotri") we can find out the number of data points and sites available for a specific country as below >>> for para in dataset.parameters: >>> data = dataset.fetch_parameter(para, country="Germany") >>> if len(data)>0: >>> print(f"{para}, {df.shape}, {len(df['site_name'].unique())}")
- property files
- property parameters
- url = 'https://zenodo.org/record/7056647#.YzBzDHZByUk'
Quadica
- class ai4water.datasets.Quadica(path=None, **kwargs)[source]
Bases:
Datasets
This is dataset of water quality parameters of Germany from 828 stations from 1950 to 2018 following the work of Ebeling et al., 2022. The time-step is monthly and annual but the monthly timeseries data is not continuous.
- __init__(path=None, **kwargs)[source]
- Parameters:
name – str (default=None) name of dataset
units – str, (default=None) the unit system being used
path – str (default=None) path where the data is available (manually downloaded). If None, it will be downloaded
- annual_medians() DataFrame [source]
Annual medians over the whole time series of water quality variables and discharge
- Returns:
a dataframe of shape (24393, 18)
- Return type:
pd.DataFrame
- avg_temp(stations: Optional[Union[List[int], int]] = None, st: Optional[Union[str, int, DatetimeIndex]] = None, en: Optional[Union[str, int, DatetimeIndex]] = None) DataFrame [source]
monthly median average temperatures starting from 1950-01 to 2018-09
- Parameters:
stations – name of stations for which data is to be retrieved. By default, data for all stations is retrieved.
st (optional) – starting point of data. By default, the data starts from 1950-01
en (optional) – end point of data. By default, the data ends at 2018-09
- Returns:
a pandas dataframe of shape (time_steps, stations). With default input arguments, the shape is (828, 1386)
- Return type:
pd.DataFrame
Examples
>>> from ai4water.datasets import Quadica >>> dataset = Quadica() >>> df = dataset.avg_temp() # -> (828, 1388)
- catchment_attributes(features: Optional[Union[str, List[str]]] = None, stations: Optional[Union[List[int], int]] = None) DataFrame [source]
Returns static physical catchment attributes in the form of dataframe.
- Parameters:
features (list/str, optional, (default=None)) – name/names of static attributes to fetch
stations (list/int, optional (default=None)) – name/names of stations whose static/physical features are to be read
- Returns:
a pandas dataframe of shape (stations, features). With default input arguments, shape is (1386, 113)
- Return type:
pd.DataFrame
Examples
>>> from ai4water.datasets import Quadica >>> dataset = Quadica() >>> cat_features = dataset.catchment_attributes() ... # get attributes of only selected stations >>> dataset.catchment_attributes(stations=[1,2,3])
- fetch_monthly(features: Optional[Union[str, List[str]]] = None, stations: Optional[Union[List[int], int]] = None, median: bool = True, fnc: bool = True, fluxes: bool = True, precipitation: bool = True, avg_temp: bool = True, pet: bool = True, only_continuous: bool = True, cat_features: bool = True, max_nan_tol: Optional[int] = 0) Tuple[DataFrame, DataFrame] [source]
Fetches monthly concentrations of water quality parameters.
- Parameters:
features (str/list, optional (default=None)) –
name or names of water quality parameters to fetch. By default following parameters are considered
NO3
NO3N
TN
Nmin
PO4
PO4P
TP
DOC
TOC
stations (int/list, optional (default=None)) – name or names of stations whose data is to be fetched
median (bool, optional (default=True)) – whether to fetch median concentration values or not
fnc (bool, optional (default=True)) – whether to fetch flow normalized concentrations or not
fluxes (bool, optional (default=True)) – Setting this to true will add two features i.e. mean_Flux_FEATURE and mean_FNFlux_FEATURE
precipitation (bool, optional (default=True)) – whether to fetch average monthly precipitation or not
avg_temp (bool, optional (default=True)) – whether to fetch average monthly temperature or not
pet (bool, optional (default=True)) – whether to fether potential evapotranspiration data or not
only_continuous (bool, optional (default=True)) – If true, will return data for only those stations who have continuos monthly timeseries data from 1993-01-01 to 2013-01-01.
cat_features (bool, optional (default=True)) – whether to fetch catchment features or not.
max_nan_tol (int, optional (default=0)) – setting this value to 0 will remove the whole time-series with any missing values. If None, no time-series with NaNs values will be removed.
- Returns:
- two dataframes whose length is same but the columns are different
a pandas dataframe of timeseries of parameters (stations*timesteps, dynamic_features)
a pandas dataframe of static features (stations*timesteps, catchment_features)
- Return type:
Examples
>>> from ai4water.datasets import Quadica >>> dataset = Quadica() >>> mon_dyn, mon_cat = dataset.fetch_monthly(max_nan_tol=None) ... # However, mon_dyn contains data for all parameters and many of which have ... # large number of nans. If we want to fetch data only related to TN without any ... # missing value, we can do as below >>> mon_dyn_tn, mon_cat_tn = dataset.fetch_monthly(features="TN", max_nan_tol=0) ... # if we want to find out how many catchments are included in mon_dyn_tn >>> len(mon_dyn_tn['OBJECTID'].unique()) ... # 25
- metadata() DataFrame [source]
fetches the metadata about the stations as pandas’ dataframe. Each row represents metadata about one station and each column represents one feature. The R2 and pbias are regression coefficients and percent bias of WRTDS models for each parameter.
- Returns:
a dataframe of shape (1386, 60)
- Return type:
pd.DataFrame
- monthly_medians(features: Optional[Union[str, List[str]]] = None, stations: Optional[Union[List[int], int]] = None) DataFrame [source]
This function reads the c_months.csv file which contains the monthly medians over the whole time series of water quality variables and discharge
- Parameters:
features (list/str, optional, (default=None)) – name/names of features
stations (list/int, optional (default=None)) – stations for which
- Returns:
a dataframe of shape (16629, 18). 15 of the 18 columns represent a water chemistry parameter. 16629 comes from 1386*12 where 1386 is stations and 12 is months.
- Return type:
pd.DataFrame
- pet(stations: Optional[Union[List[int], int]] = None, st: Optional[Union[str, int, DatetimeIndex]] = None, en: Optional[Union[str, int, DatetimeIndex]] = None) DataFrame [source]
average monthly potential evapotranspiration starting from 1950-01 to 2018-09
Examples
>>> from ai4water.datasets import Quadica >>> dataset = Quadica() >>> df = dataset.pet() # -> (828, 1386)
- precipitation(stations: Optional[Union[List[int], int]] = None, st: Optional[Union[str, int, DatetimeIndex]] = None, en: Optional[Union[str, int, DatetimeIndex]] = None) DataFrame [source]
sums of precipitation starting from 1950-01 to 2018-09
- Parameters:
stations – name of stations for which data is to be retrieved. By default, data for all stations is retrieved.
st (optional) – starting point of data. By default, the data starts from 1950-01
en (optional) – end point of data. By default, the data ends at 2018-09
- Returns:
a dataframe of shape (828, 1388)
- Return type:
pd.DataFrame
Examples
>>> from ai4water.datasets import Quadica >>> dataset = Quadica() >>> df = dataset.precipitation() # -> (828, 1388)
- property station_names
names of stations
- to_DataSet(target: str = 'TP', input_features: Optional[list] = None, split: str = 'temporal', lookback: int = 24, **ds_args)[source]
This function prepares data for machine learning prediction problem. It returns an instance of ai4water.preprocessing.DataSetPipeline which can be given to model.fit or model.predict
- Parameters:
target (str, optional (default="TN")) – parameter to consider as target
input_features (list, optional) – names of input features
split (str, optional (default="temporal")) – if
temporal
, validation and test sets are taken from the data of each station and then concatenated. Ifspatial
, training validation and test is decided based upon stations.lookback (int) –
**ds_args – key word arguments
- Returns:
an instance of DataSetPipeline
- Return type:
Example
>>> from ai4water.datasets import Quadica ... # initialize the Quadica class >>> dataset = Quadica() ... # define the input features >>> inputs = ['median_Q', 'OBJECTID', 'avg_temp', 'precip', 'pet'] ... # prepare data for TN as target >>> dsp = dataset.to_DataSet("TN", inputs, lookback=24)
- url = {'catchment_attributes.csv': 'https://www.hydroshare.org/resource/88254bd930d1466c85992a7dea6947a4/data/contents/catchment_attributes.csv', 'metadata.pdf': 'https://www.hydroshare.org/resource/26e8238f0be14fa1a49641cd8a455e29/data/contents/Metadata_QUADICA.pdf', 'quadica.zip': 'https://www.hydroshare.org/resource/26e8238f0be14fa1a49641cd8a455e29/data/contents/QUADICA.zip'}
- wrtds_annual(features: Optional[Union[str, list]] = None, st: Optional[Union[str, int, DatetimeIndex]] = None, en: Optional[Union[str, int, DatetimeIndex]] = None) DataFrame [source]
Annual median concentrations, flow-normalized concentrations, and mean fluxes estimated using Weighted Regressions on Time, Discharge, and Season (WRTDS) for stations with enough data availability.
- Parameters:
features (optional) –
st (optional) – starting point of data. By default, the data starts from 1992
en (optional) – end point of data. By default, the data ends at 2013
- Returns:
a dataframe of shape (4213, 46)
- Return type:
pd.DataFrame
Examples
>>> from ai4water.datasets import Quadica >>> dataset = Quadica() >>> df = dataset.wrtds_annual()
- wrtds_monthly(features: Optional[Union[str, list]] = None, stations: Optional[Union[List[int], int]] = None, st: Optional[Union[str, int, DatetimeIndex]] = None, en: Optional[Union[str, int, DatetimeIndex]] = None) DataFrame [source]
Monthly median concentrations, flow-normalized concentrations and mean fluxes of water chemistry parameters. These are estimated using Weighted Regressions on Time, Discharge, and Season (WRTDS) for stations with enough data availability. This data is available for total 140 stations. The data from all stations does not start and end at the same period. Therefore, some stations have more datapoints while others have less. The maximum datapoints for a station are 576 while smallest datapoints are 244.
- Parameters:
features (str/list, optional) –
stations (int/list optional (default=None)) – name/names of satations whose data is to be retrieved.
st (optional) – starting point of data. By default, the data starts from 1992-09
en (optional) – end point of data. By default, the data ends at 2013-12
- Returns:
a dataframe of shape (50186, 47)
- Return type:
pd.DataFrame
Examples
>>> from ai4water.datasets import Quadica >>> dataset = Quadica() >>> df = dataset.wrtds_monthly()
RC4USCoast
- class ai4water.datasets.RC4USCoast(path=None, *args, **kwargs)[source]
Bases:
Datasets
Monthly river water chemistry (N, P, SIO2, DO, … etc), discharge and temperature of 140 monitoring sites of US coasts from 1950 to 2020 following the work of Gomez et al., 2022.
Examples
>>> from ai4water.datasets import RC4USCoast >>> dataset = RC4USCoast()
- __init__(path=None, *args, **kwargs)[source]
- Parameters:
path – path where the data is already downloaded. If None, the data will be downloaded into the disk.
- fetch_chem(parameter, stations: Union[List[int], int, str] = 'all', as_dataframe: bool = False, st: Optional[Union[str, int, DatetimeIndex]] = None, en: Optional[Union[str, int, DatetimeIndex]] = None)[source]
Returns water chemistry parameters from one or more stations.
- Parameters:
stations (list, str) – name/names of stations from which the parameters are to be fetched
as_dataframe (bool (default=False)) – whether to return data as pandas.DataFrame or xarray.Dataset
st – start time of data to be fetched. The default starting date is 19500101
en – end time of data to be fetched. The default end date is 20201201
- Return type:
pandas DataFrame or xarray Dataset
Examples
>>> from ai4water.datasets import RC4USCoast >>> ds = RC4USCoast() >>> data = ds.fetch_chem(['temp', 'do']) >>> data >>> data = ds.fetch_chem(['temp', 'do'], as_dataframe=True) >>> data.shape # this is a multi-indexed dataframe (119280, 4) >>> data = ds.fetch_chem(['temp', 'do'], st="19800101", en="20181230")
- fetch_q(stations: Union[int, List[int], str, ndarray] = 'all', as_dataframe: bool = True, nv=0, st: Optional[Union[str, int, DatetimeIndex]] = None, en: Optional[Union[str, int, DatetimeIndex]] = None)[source]
returns discharge data
- Parameters:
stations – stations for which q is to be fetched
as_dataframe (bool (default=True)) – whether to return the data as pd.DataFrame or as xarray.Dataset
nv (int (default=0)) –
st – start time of data to be fetched. The default starting date is 19500101
en – end time of data to be fetched. The default end date is 20201201
Examples
>>> from ai4water.datasets import RC4USCoast >>> ds = RC4USCoast() # get data of all stations as DataFrame >>> q = ds.fetch_q("all") >>> q.shape (852, 140) # where 140 is the number of stations # get data of only two stations >>> q = ds.fetch_q([1,10]) >>> q.shape (852, 2) # get data as xarray Dataset >>> q = ds.fetch_q("all", as_dataframe=False) >>> type(q) xarray.core.dataset.Dataset # getting data between specific periods >>> data = ds.fetch_q("all", st="20000101", en="20181230")
- property parameters: List[str]
>>> from ai4water.datasets import RC4USCoast >>> ds = RC4USCoast() >>> len(ds.parameters) 27
- property stations: ndarray
>>> from ai4water.datasets import RC4USCoast >>> ds = RC4USCoast(path=r'F:\data\RC4USCoast') >>> len(ds.stations) 140
- url = {'RC4USCoast.zip': 'https://www.ncei.noaa.gov/data/oceans/ncei/ocads/data/0260455/RC4USCoast.zip', 'info.xlsx': 'https://www.ncei.noaa.gov/data/oceans/ncei/ocads/data/0260455/supplemental/dataset_info.xlsx'}