lsapy.stats.stats_summary#

lsapy.stats.stats_summary(data, on_vars=None, on_dims=None, on_dim_values=None, bins=None, bins_labels=None, all_bins=False, cell_area=None, dropna=False, **kwargs)[source]#

Generate a descriptive statistics summary of the data.

Returns a pandas DataFrame of data according to the given parameters. The statistics includes count, mean, std, min, max, and 25%, 50%, and 75% percentiles. Bins can be provided to further group the data into intervals.

Parameters:

data (Dataset) – The input data.
on_vars (list | None) – Variables for which the statistics are calculated. If None (default), all variables are kept.
on_dims (list | None) – Dimensions for which the statistics are calculated. If None (default), all dimensions except spatial ones (i.e., lon or x and lat or y) are kept.
on_dim_values (dict[str, Any] | None) – Values of dimensions to be kept in the summary. If None (default), all values are kept.
bins (list | ndarray | None) – Bins defining data intervals. If None (default), no binning is performed.
bins_labels (list | None) – Labels for the bins. If None (default), bins values are used as labels. The length of the list must be equal to the number of bins. Ignored if bins is None.
all_bins (bool | None) – If True, a additional bin corresponding to the bounds of bins is added. Default is False. Ignored if bins is None.
cell_area (tuple[float | int, str] | None) – Add a column to the summary with the given associated area calculated based on the count statistic variable. The tuple must contain the area value and the unit of the area.
dropna (bool | None) – If True, dimensions with NaN values are removed. Default is False.
**kwargs (dict, optional) – Additional keyword arguments passed to pd.cut used to bin the data.

Returns:

A DataFrame with the statistics for the defined dimensions and variables, including: count, mean, std, min, max, and 25%, 50%, and 75% percentiles.

Return type:

DataFrame

Examples

>>> from lsapy.utils import open_data
>>> from lsapy import SuitabilityCriteria, LandSuitabilityAnalysis
>>> import lsapy.standardize as lstd
>>> from xclim.indicators.atmos import growing_degree_days

Let’s first define a Land Suitability Analysis (LSA):

>>> drainage = open_data("land", variables="drainage")
>>> tas = open_data("climate", variables="tas")
>>> sc = {
...     "drainage_class": SuitabilityCriteria(
...         name="drainage_class",
...         long_name="Drainage Class Suitability",
...         weight=3,
...         category="soilTerrain",
...         indicator=drainage,
...         func=lstd.discrete,
...         fparams={"rules": {0: 0, 1: 0.1, 2: 0.5, 3: 0.9, 4: 1}},
...     ),
...     "growing_degree_days": SuitabilityCriteria(
...         name="growing_degree_days",
...         long_name="Growing Degree Days Suitability",
...         weight=1,
...         category="climate",
...         indicator=growing_degree_days(tas, thresh="10 degC", freq="YS-JUL"),
...         func=lstd.vetharaniam2022_eq5,
...         fparams={"a": -1.41, "b": 801},
...     ),
... }
>>> lsa = LandSuitabilityAnalysis("land_use", sc)
>>> lsa.run(inplace=True)

We can then compute the statistics summary for the data:

>>> stats = stats_summary(lsa.data)

on_vars, on_dims, and on_dim_values parameters can be used to filter the data. If we want to get the statistics summary for only ‘growing_degree_days’, ‘suitability’, and the first three years, we can do:

>>> stats = stats_summary(
...     lsa.data,
...     on_vars=["growing_degree_days", "suitability"],  # select variables
...     on_dim_values={"time": slice("2000", "2002")},  # select values of the time dimension
... )

This will compute the statistics for the two variables and for each year of the 2000-2002 period. We can also provide bins to group the data into intervals. For example, if we want to get the statistics for four bins (0-0.25, 0.25-0.5, 0.5-0.75, 0.75-1), we can do:

>>> stats = stats_summary(
...     lsa.data,
...     bins=[0, 0.25, 0.5, 0.75, 1],  # define bins
...     bins_labels=["unsuitable", "poorly suitable", "moderately suitable", "highly suitable"],  # define labels
...     all_bins=True,  # add an additional bin for the overall range (i.e., 0-1)
... )

Finally, we can get the area associated with each bin by providing the area of each cell in the data. Assuming that each cell has an area of 5 hectares (ha), we can do:

>>> stats = stats_summary(
...     lsa.data,
...     bins=[0, 0.25, 0.5, 0.75, 1],
...     bins_labels=["unsuitable", "poorly suitable", "moderately suitable", "highly suitable"],
...     all_bins=True,
...     cell_area=(5, "ha"),  # define the area of each cell
... )