Miscellaneous

Busan Beach data

ai4water.datasets.busan_beach(inputs: Optional[list] = None, target: Union[list, str] = 'tetx_coppml') DataFrame[source]

Loads the Antibiotic resitance genes (ARG) data from a recreational beach in Busan, South Korea along with environment variables.

The data is in the form of mutlivariate time series and was collected over the period of 2 years during several precipitation events. The frequency of environmental data is 30 mins while that of ARG is discontinuous. The data and its pre-processing is described in detail in Jang et al., 2021

Parameters:
  • inputs

    features to use as input. By default all environmental data is used which consists of following parameters

    • tide_cm

    • wat_temp_c

    • sal_psu

    • air_temp_c

    • pcp_mm

    • pcp3_mm

    • pcp6_mm

    • pcp12_mm

    • wind_dir_deg

    • wind_speed_mps

    • air_p_hpa

    • mslp_hpa

    • rel_hum

  • target

    feature/features to use as target/output. By default tetx_coppml is used as target. Logically one or more from following can be considered as target

    • ecoli

    • 16s

    • inti1

    • Total_args

    • tetx_coppml

    • sul1_coppml

    • blaTEM_coppml

    • aac_coppml

    • Total_otus

    • otu_5575

    • otu_273

    • otu_94

Returns:

a pandas dataframe with inputs and target and indexed with pandas.DateTimeIndex

Return type:

pd.DataFrame

Examples

>>> from ai4water.datasets import busan_beach
>>> dataframe = busan_beach()
>>> dataframe.shape
(1446, 14)
>>> dataframe = busan_beach(target=['tetx_coppml', 'sul1_coppml'])
>>> dataframe.shape
(1446, 15)

Photodegradation of Melachite Green

This data is about photocatalytic degradation of melachite green dye using nobel metal dobe BiFeO3. For further description of this data see Jafari et al., 2023 and for the use of this data for removal efficiency prediction see . This dataset consists of 1200 points collected during ~135 experiments.

param inputs:

features to use as input. By default following features are used as input

  • Catalyst_type

  • Surface area

  • Pore Volume

  • Catalyst_loading (g/L)

  • Light_intensity (W)

  • time (min)

  • solution_pH

  • HA (mg/L)

  • Anions

  • Ci (mg/L)

  • Cf (mg/L)

type inputs:

list, optional

param target:

features to use as target. By default Efficiency (%) is used as target which is photodegradation removal efficiency of dye from wastewater. Following are valid target names

  • Efficiency (%)

  • k_first

  • k_2nd

type target:

str, optional, default=”Efficiency (%)”

param encoding:

type of encoding to use for the two categorical features i.e., Catalyst_type and Anions, to convert them into numberical. Available options are ohe, le and None. If ohe is selected the original input columns are replaced with ohe hot encoded columns. This will result in 6 columns for Anions and 15 columns for Catalyst_type.

type encoding:

str, default=None

returns:
  • data (pd.DataFrame) – a pandas dataframe consisting of input and output features. The default setting will result in dataframe shape of (1200, 12)

  • cat_encoder – catalyst encoder

  • an_encoder – encoder for anions

Examples

>>> from ai4water.datasets import mg_photodegradation
>>> mg_data, catalyst_encoder, anion_encoder = mg_photodegradation()
>>> mg_data.shape
(1200, 12)
... # the default encoding is None, but if we want to use one hot encoder
>>> mg_data_ohe, cat_enc, an_enc = mg_photodegradation(encoding="ohe")
>>> mg_data_ohe.shape
(1200, 31)
>>> cat_enc.inverse_transform(mg_data_ohe.iloc[:, 9:24].values)
>>> an_enc.inverse_transform(mg_data_ohe.iloc[:, 24:30].values)
... # if we want to use label encoder
>>> mg_data_le, cat_enc, an_enc = mg_photodegradation(encoding="le")
>>> mg_data_le.shape
(1200, 12)
>>> cat_enc.inverse_transform(mg_data_le.iloc[:, 9].values.astype(int))
>>> an_enc.inverse_transform(mg_data_le.iloc[:, 10].values.astype(int))
... # By default the target is efficiency but if we want
... # to use first order k as target
>>> mg_data_k, _, _ = mg_photodegradation(target="k_first")
... # if we want to use 2nd order k as target
>>> mg_data_k2, _, _ = mg_photodegradation(target="k_2nd")

Groundwater of Punjab region

groundwater level (meters below ground level) dataset from Punjab region (Pakistan and north-west India) following the study of MacAllister_ et al., 2022.

param data_type:

either full or LTS. The full contains the full dataset, there are 68783 rows of observed groundwater level data from 4028 individual sites. In LTS there are 7547 rows of groundwater level observations from 130 individual sites, which have water level data available for a period of more than 40 years and from which at least two thirds of the annual observations are available.

type data_type:

str (default=”full”)

param country:

the country for which data to retrieve. Either PAK or IND.

type country:

str (default=None)

returns:

a pandas DataFrame with datetime index

rtype:

pd.DataFrame

Examples

>>> from ai4water.datasets import gw_punjab
>>> full_data = gw_punjab()
find out the earliest observation
>>> print(full_data.sort_index().head(1))
>>> lts_data = gw_punjab()
>>> lts_data.shape
    (68782, 4)
>>> df_pak = gw_punjab(country="PAK")
>>> df_pak.sort_index().dropna().head(1)

Weisssee

class ai4water.datasets.Weisssee(path=None, overwrite=False, **kwargs)[source]

Bases: Datasets

__init__(path=None, overwrite=False, **kwargs)[source]
Parameters:
  • name – str (default=None) name of dataset

  • units – str, (default=None) the unit system being used

  • path – str (default=None) path where the data is available (manually downloaded). If None, it will be downloaded

dynamic_attributes = ['Precipitation_measurements', 'long_wave_upward_radiation', 'snow_density_at_30cm', 'long_wave_downward_radiation']
fetch(**kwargs)[source]

Examples

>>> from ai4water.datasets import Weisssee
>>> dataset = Weisssee()
>>> data = dataset.fetch()
url = '10.1594/PANGAEA.898217'

WeatherJena

class ai4water.datasets.WeatherJena(path=None, obs_loc='roof')[source]

Bases: Datasets

10 minute weather dataset of Jena, Germany hosted at https://www.bgc-jena.mpg.de/wetter/index.html from 2002 onwards.

>>> from ai4water.datasets import WeatherJena
>>> dataset = WeatherJena()
>>> data = dataset.fetch()
>>> data.sum()
__init__(path=None, obs_loc='roof')[source]

The ETP data is collected at three different locations i.e. roof, soil and saale(hall).

Parameters:

obs_loc (str, optional (default=roof)) –

location of observation. It can be one of following
  • roof

  • soil

  • saale

property dynamic_features: list

returns names of features available

fetch(st: Optional[Union[str, int, DatetimeIndex]] = None, en: Optional[Union[str, int, DatetimeIndex]] = None) DataFrame[source]

Fetches the time series data between given period as pandas dataframe.

Parameters:
  • st (Optional) – start of data to be fetched. If None, the data from start (2003-01-01) will be retuned

  • en (Optional) – end of data to be fetched. If None, the data from till (2021-12-31) end be retuned.

Returns:

a pandas dataframe of shape (972111, 21)

Return type:

pd.DataFrame

Examples

>>> from ai4water.datasets import WeatherJena
>>> dataset = WeatherJena()
>>> data = dataset.fetch()
>>> data.shape
(972111, 21)
... # get data between specific period
>>> data = dataset.fetch("20110101", "20201231")
>>> data.shape
(525622, 21)
url = 'https://www.bgc-jena.mpg.de/wetter/weather_data.html'

SWECanada

class ai4water.datasets.SWECanada(path=None, **kwargs)[source]

Bases: Datasets

Daily Canadian historical Snow Water Equivalent dataset from 1928 to 2020 from Brown et al., 2019 .

Examples

>>> from ai4water.datasets import SWECanada
>>> swe = SWECanada()
... # get names of all available stations
>>> stns = swe.stations()
>>> len(stns)
2607
... # get data of one station
>>> df1 = swe.fetch('SCD-NS010')
>>> df1['SCD-NS010'].shape
(33816, 3)
... # get data of 10 stations
>>> df5 = swe.fetch(5, st='20110101')
>>> df5.keys()
['YT-10AA-SC01', 'ALE-05CA805', 'SCD-NF078', 'SCD-NF086', 'INA-07RA01B']
>>> [v.shape for v in df5.values()]
[(3500, 3), (3500, 3), (3500, 3), (3500, 3), (3500, 3)]
... # get data of 0.1% of stations
>>> df2 = swe.fetch(0.001, st='20110101')
... # get data of one stations starting from 2011
>>> df3 = swe.fetch('ALE-05AE810', st='20110101')
>>> df3.keys()
>>> ['ALE-05AE810']
>>> df4 = swe.fetch(stns[0:10], st='20110101')
__init__(path=None, **kwargs)[source]
Parameters:
  • name – str (default=None) name of dataset

  • units – str, (default=None) the unit system being used

  • path – str (default=None) path where the data is available (manually downloaded). If None, it will be downloaded

property end
features = ['snw', 'snd', 'den']
fetch(station_id: Union[None, str, float, int, list] = None, features: Union[None, str, list] = None, q_flags: Union[None, str, list] = None, st=None, en=None) dict[source]

Fetches time series data from selected stations.

Parameters:
  • station_id – station/stations to be retrieved. In None, then data from all stations will be returned.

  • features

    Names of features to be retrieved. Following features are allowed:

    • snw snow water equivalent kg/m3

    • snd snow depth m

    • den snowpack bulk density kg/m3

    If None, then all three features will be retrieved.

  • q_flags

    If None, then no qflags will be returned. Following q_flag values are available.

    • data_flag_snw

    • data_flag_snd

    • qc_flag_snw

    • qc_flag_snd

  • st – start of data to be retrieved

  • en – end of data to be retrived.

Returns:

a dictionary of dataframes of shape (st:en, features + q_flags) whose length is equal to length of stations being considered.

Return type:

dict

fetch_station_attributes(stn, features_to_fetch, st=None, en=None) DataFrame[source]

fetches attributes of one station

q_flags = ['data_flag_snw', 'data_flag_snd', 'qc_flag_snw', 'qc_flag_snd']
property start
stations() list[source]
url = 'https://doi.org/10.5194/essd-2021-160'