Miscellaneous
Busan Beach data
- ai4water.datasets.busan_beach(inputs: Optional[list] = None, target: Union[list, str] = 'tetx_coppml') DataFrame [source]
Loads the Antibiotic resitance genes (ARG) data from a recreational beach in Busan, South Korea along with environment variables.
The data is in the form of mutlivariate time series and was collected over the period of 2 years during several precipitation events. The frequency of environmental data is 30 mins while that of ARG is discontinuous. The data and its pre-processing is described in detail in Jang et al., 2021
- Parameters:
inputs –
features to use as input. By default all environmental data is used which consists of following parameters
tide_cm
wat_temp_c
sal_psu
air_temp_c
pcp_mm
pcp3_mm
pcp6_mm
pcp12_mm
wind_dir_deg
wind_speed_mps
air_p_hpa
mslp_hpa
rel_hum
target –
feature/features to use as target/output. By default tetx_coppml is used as target. Logically one or more from following can be considered as target
ecoli
16s
inti1
Total_args
tetx_coppml
sul1_coppml
blaTEM_coppml
aac_coppml
Total_otus
otu_5575
otu_273
otu_94
- Returns:
a pandas dataframe with inputs and target and indexed with pandas.DateTimeIndex
- Return type:
pd.DataFrame
Examples
>>> from ai4water.datasets import busan_beach >>> dataframe = busan_beach() >>> dataframe.shape (1446, 14) >>> dataframe = busan_beach(target=['tetx_coppml', 'sul1_coppml']) >>> dataframe.shape (1446, 15)
Photodegradation of Melachite Green
This data is about photocatalytic degradation of melachite green dye using nobel metal dobe BiFeO3. For further description of this data see Jafari et al., 2023 and for the use of this data for removal efficiency prediction see . This dataset consists of 1200 points collected during ~135 experiments.
- param inputs:
features to use as input. By default following features are used as input
Catalyst_type
Surface area
Pore Volume
Catalyst_loading (g/L)
Light_intensity (W)
time (min)
solution_pH
HA (mg/L)
Anions
Ci (mg/L)
Cf (mg/L)
- type inputs:
list, optional
- param target:
features to use as target. By default
Efficiency (%)
is used as target which is photodegradation removal efficiency of dye from wastewater. Following are valid target namesEfficiency (%)
k_first
k_2nd
- type target:
str, optional, default=”Efficiency (%)”
- param encoding:
type of encoding to use for the two categorical features i.e.,
Catalyst_type
andAnions
, to convert them into numberical. Available options areohe
,le
and None. If ohe is selected the original input columns are replaced with ohe hot encoded columns. This will result in 6 columns for Anions and 15 columns for Catalyst_type.- type encoding:
str, default=None
- returns:
data (pd.DataFrame) – a pandas dataframe consisting of input and output features. The default setting will result in dataframe shape of (1200, 12)
cat_encoder – catalyst encoder
an_encoder – encoder for anions
Examples
>>> from ai4water.datasets import mg_photodegradation
>>> mg_data, catalyst_encoder, anion_encoder = mg_photodegradation()
>>> mg_data.shape
(1200, 12)
... # the default encoding is None, but if we want to use one hot encoder
>>> mg_data_ohe, cat_enc, an_enc = mg_photodegradation(encoding="ohe")
>>> mg_data_ohe.shape
(1200, 31)
>>> cat_enc.inverse_transform(mg_data_ohe.iloc[:, 9:24].values)
>>> an_enc.inverse_transform(mg_data_ohe.iloc[:, 24:30].values)
... # if we want to use label encoder
>>> mg_data_le, cat_enc, an_enc = mg_photodegradation(encoding="le")
>>> mg_data_le.shape
(1200, 12)
>>> cat_enc.inverse_transform(mg_data_le.iloc[:, 9].values.astype(int))
>>> an_enc.inverse_transform(mg_data_le.iloc[:, 10].values.astype(int))
... # By default the target is efficiency but if we want
... # to use first order k as target
>>> mg_data_k, _, _ = mg_photodegradation(target="k_first")
... # if we want to use 2nd order k as target
>>> mg_data_k2, _, _ = mg_photodegradation(target="k_2nd")
Groundwater of Punjab region
groundwater level (meters below ground level) dataset from Punjab region (Pakistan and north-west India) following the study of MacAllister_ et al., 2022.
- param data_type:
either
full
orLTS
. Thefull
contains the full dataset, there are 68783 rows of observed groundwater level data from 4028 individual sites. InLTS
there are 7547 rows of groundwater level observations from 130 individual sites, which have water level data available for a period of more than 40 years and from which at least two thirds of the annual observations are available.- type data_type:
str (default=”full”)
- param country:
the country for which data to retrieve. Either
PAK
orIND
.- type country:
str (default=None)
- returns:
a pandas DataFrame with datetime index
- rtype:
pd.DataFrame
Examples
>>> from ai4water.datasets import gw_punjab
>>> full_data = gw_punjab()
find out the earliest observation
>>> print(full_data.sort_index().head(1))
>>> lts_data = gw_punjab()
>>> lts_data.shape
(68782, 4)
>>> df_pak = gw_punjab(country="PAK")
>>> df_pak.sort_index().dropna().head(1)
Weisssee
- class ai4water.datasets.Weisssee(path=None, overwrite=False, **kwargs)[source]
Bases:
Datasets
- __init__(path=None, overwrite=False, **kwargs)[source]
- Parameters:
name – str (default=None) name of dataset
units – str, (default=None) the unit system being used
path – str (default=None) path where the data is available (manually downloaded). If None, it will be downloaded
- dynamic_attributes = ['Precipitation_measurements', 'long_wave_upward_radiation', 'snow_density_at_30cm', 'long_wave_downward_radiation']
- fetch(**kwargs)[source]
Examples
>>> from ai4water.datasets import Weisssee >>> dataset = Weisssee() >>> data = dataset.fetch()
- url = '10.1594/PANGAEA.898217'
WeatherJena
- class ai4water.datasets.WeatherJena(path=None, obs_loc='roof')[source]
Bases:
Datasets
10 minute weather dataset of Jena, Germany hosted at https://www.bgc-jena.mpg.de/wetter/index.html from 2002 onwards.
>>> from ai4water.datasets import WeatherJena >>> dataset = WeatherJena() >>> data = dataset.fetch() >>> data.sum()
- __init__(path=None, obs_loc='roof')[source]
The ETP data is collected at three different locations i.e. roof, soil and saale(hall).
- Parameters:
obs_loc (str, optional (default=roof)) –
- location of observation. It can be one of following
roof
soil
saale
- fetch(st: Optional[Union[str, int, DatetimeIndex]] = None, en: Optional[Union[str, int, DatetimeIndex]] = None) DataFrame [source]
Fetches the time series data between given period as pandas dataframe.
- Parameters:
st (Optional) – start of data to be fetched. If None, the data from start (2003-01-01) will be retuned
en (Optional) – end of data to be fetched. If None, the data from till (2021-12-31) end be retuned.
- Returns:
a pandas dataframe of shape (972111, 21)
- Return type:
pd.DataFrame
Examples
>>> from ai4water.datasets import WeatherJena >>> dataset = WeatherJena() >>> data = dataset.fetch() >>> data.shape (972111, 21) ... # get data between specific period >>> data = dataset.fetch("20110101", "20201231") >>> data.shape (525622, 21)
- url = 'https://www.bgc-jena.mpg.de/wetter/weather_data.html'
SWECanada
- class ai4water.datasets.SWECanada(path=None, **kwargs)[source]
Bases:
Datasets
Daily Canadian historical Snow Water Equivalent dataset from 1928 to 2020 from Brown et al., 2019 .
Examples
>>> from ai4water.datasets import SWECanada >>> swe = SWECanada() ... # get names of all available stations >>> stns = swe.stations() >>> len(stns) 2607 ... # get data of one station >>> df1 = swe.fetch('SCD-NS010') >>> df1['SCD-NS010'].shape (33816, 3) ... # get data of 10 stations >>> df5 = swe.fetch(5, st='20110101') >>> df5.keys() ['YT-10AA-SC01', 'ALE-05CA805', 'SCD-NF078', 'SCD-NF086', 'INA-07RA01B'] >>> [v.shape for v in df5.values()] [(3500, 3), (3500, 3), (3500, 3), (3500, 3), (3500, 3)] ... # get data of 0.1% of stations >>> df2 = swe.fetch(0.001, st='20110101') ... # get data of one stations starting from 2011 >>> df3 = swe.fetch('ALE-05AE810', st='20110101') >>> df3.keys() >>> ['ALE-05AE810'] >>> df4 = swe.fetch(stns[0:10], st='20110101')
- __init__(path=None, **kwargs)[source]
- Parameters:
name – str (default=None) name of dataset
units – str, (default=None) the unit system being used
path – str (default=None) path where the data is available (manually downloaded). If None, it will be downloaded
- property end
- features = ['snw', 'snd', 'den']
- fetch(station_id: Union[None, str, float, int, list] = None, features: Union[None, str, list] = None, q_flags: Union[None, str, list] = None, st=None, en=None) dict [source]
Fetches time series data from selected stations.
- Parameters:
station_id – station/stations to be retrieved. In None, then data from all stations will be returned.
features –
Names of features to be retrieved. Following features are allowed:
snw
snow water equivalent kg/m3snd
snow depth mden
snowpack bulk density kg/m3
If None, then all three features will be retrieved.
q_flags –
If None, then no qflags will be returned. Following q_flag values are available.
data_flag_snw
data_flag_snd
qc_flag_snw
qc_flag_snd
st – start of data to be retrieved
en – end of data to be retrived.
- Returns:
a dictionary of dataframes of shape (st:en, features + q_flags) whose length is equal to length of stations being considered.
- Return type:
- fetch_station_attributes(stn, features_to_fetch, st=None, en=None) DataFrame [source]
fetches attributes of one station
- q_flags = ['data_flag_snw', 'data_flag_snd', 'qc_flag_snw', 'qc_flag_snd']
- property start
- url = 'https://doi.org/10.5194/essd-2021-160'