py_madaclim main package

API docs for both info and raster_manipulation modules under the main py-madaclim package.

You can find the API docs for the py-madaclim.utils sub-module here

py_madaclim.info module

class py_madaclim.info.MadaclimLayers(clim_raster: Path | None = None, env_raster: Path | None = None)[source]

Bases: object

A class that represents all of the information and data from the climate and environmental variable layers that can be found from the rasters of the Madaclim database.

The main metadata retrieval tool for the Madaclim database. Access all layers information with the all_layers attribute. Also provides methods to filter, generate unique labels from the all_layers attr. Access the crs and band number from the climate and environmental rasters when they are provided in the constructor. Categorical data can be explored in details with the categorical_layers attribute and the value:category pairs with the get_categorical_combinations.

clim_raster

The path to the Madaclim climate raster GeoTiff file. Defaults to None if not specified.

Type:

pathlib.Path

env_raster

The path to the Madaclim environmental raster GeoTif file. Defaults to None if not specified.

Type:

pathlib.Path

all_layers

A DataFrame containing a complete and formatted version of all Madaclim layers.

Type:

pd.DataFrame

categorical_layers

A DataFrame containing the in depth information about the layers with categorical data.

Type:

pd.DataFrame

property all_layers: DataFrame

Retrieves the ‘all_layers’ Dataframe using the private ‘_get_madaclim_layers’ method.

Contains all information about all the raster layers in the Madaclim db.

Returns:

A DataFrame containing a complete and formatted version of all Madaclim layers.

Return type:

pd.DataFrame

property categorical_layers: DataFrame

Retrieves the ‘categorical_layers’ Dataframe using the private ‘_get_categorical_df’ method.

Contains detailed information about the categorical layers from the rasters in the Madaclim db.

Returns:

A DataFrame containing information for each categorical value in each layer

Return type:

pd.DataFrame

property clim_crs: CRS

Retrieves the Coordinate Reference System (CRS) from the Madaclim climate raster.

This property first validates the clim_raster attribute, ensuring its integrity and existence.

It then opens the raster file and retrieves the CRS in EPSG format. The EPSG code is used to create and return a pyproj CRS object.

Returns:

The CRS object derived from the EPSG code of the climate raster.

Return type:

pyproj.crs.crs.CRS

Example

Valid ‘clim_raster’ attribute before accessing the ‘clim_crs’ attr.

>>> mada_info = MadaclimLayers()
>>> mada_info.clim_crs
Traceback (most recent call last):
    raise AttributeError(f"Undefined attribute: '{raster_attr_name}'. You need to assign a valid pathlib.Path to the related raster attribute first.")
AttributeError: Undefined attribute: 'clim_raster'. You need to assign a valid pathlib.Path to the related raster attribute first.
>>> mada_info.clim_raster = Path("./madaclim_current.tif")
>>> print(mada_info.clim_crs)
EPSG:32738
property clim_raster: Path

Get or set the path to the climate raster file.

This property allows you to get the current path to the climate raster file, or set a new

path. If setting a new path, the value must be a pathlib.Path object or a str. If the value is a str, it will be converted to a pathlib.Path object. The path must exist, otherwise a FileNotFoundError will be raised.

Returns:

The current path to the climate raster file.

Raises:
  • TypeError – If the new path is not a pathlib.Path object or str.

  • ValueError – If the new path cannot be converted to a pathlib.Path object.

  • FileNotFoundError – If the new path does not exist.

Type:

pathlib.Path

download_data(save_dir: Path | None = None)[source]

Downloads climate and environment raster files from the Madaclim website.

This method downloads the climate and environment raster data from the Madaclim website and saves them to the specified directory. If no directory is specified, the data is saved to the current working directory.

Parameters:

save_dir (Optional[pathlib.Path]) – The directory where the data should be saved. If not specified, the data is saved to the current working directory.

Raises:

ValueError – If save_dir is not a directory.

property env_crs: CRS

Retrieves the Coordinate Reference System (CRS) from the Madaclim environmental raster.

This property first validates the env_raster attribute, ensuring its integrity and existence. It then opens the raster file and retrieves the CRS in EPSG format. The EPSG code is used to create and return a pyproj CRS object.

Returns:

The CRS object derived from the EPSG code of the environmental raster.

Return type:

pyproj.crs.crs.CRS

Example

Valid ‘env_raster’ attribute before accessing the ‘env_crs’ attr.

>>> mada_info = MadaclimLayers()
>>> mada_info.env_crs
Traceback (most recent call last):
    raise AttributeError(f"Undefined attribute: '{raster_attr_name}'. You need to assign a valid pathlib.Path to the related raster attribute first.")
AttributeError: Undefined attribute: 'env_raster'. You need to assign a valid pathlib.Path to the related raster attribute first.
>>> mada_info.env_raster = Path("./madaclim_current.tif")
>>> print(mada_info.env_crs)
EPSG:32738
property env_raster: Path

Get or set the path to the environment raster file.

This property allows you to get the current path to the environment raster file, or set a new

path. If setting a new path, the value must be a pathlib.Path object or a str. If the value is a str, it will be converted to a pathlib.Path object. The path must exist, otherwise a FileNotFoundError will be raised.

Returns:

The current path to the environment raster file.

Raises:
  • TypeError – If the new path is not a pathlib.Path object or str.

  • ValueError – If the new path cannot be converted to a pathlib.Path object.

  • FileNotFoundError – If the new path does not exist.

Type:

pathlib.Path

fetch_specific_layers(layers_labels: int | str | List[int | str], *args: str) dict | DataFrame[source]

Fetches specific layers from the all_layers DataFrame based on the given input and returns either the entire rows or certain columns as a dictionary.

Parameters:
  • layers_labels (Union[int, str, List[Union[int, str]]]) – The layer labels to fetch. Can be a single int or str value, or a list of int or str values. The input can also be in the format “layer_{num}” or “{geotype}_{num}_{name}_({description})” (output from get_layers_labels(as_descriptive_labels=True) method).

  • *args (str) – Optional. One or more column names in all_layers DataFrame. If specified, only these columns

  • dictionary. (will be returned as a) –

Returns:

If args is specified, returns a nested dictionary with the format:
{
layer_<num>: {

<arg1>: <value>, <arg2>: <value>, …

} Otherwise, returns a DataFrame with the specified layers.

Return type:

Union[dict, pd.DataFrame]

Raises:
  • TypeError – If any value in layers_labels cannot be converted to an int or is not in the “layer_{num}” format.

  • ValueError – If any layer_number does not fall between the minimum and maximum layer numbers.

  • KeyError – If any value in args is not a column in all_layers DataFrame.

Examples

Using a list of layer numbers

>>> mada_info = MadaclimLayers()
>>> mada_info.fetch_specific_layers([1, 15, 55, 71])
   geoclim_type  layer_number layer_name                      layer_description  is_categorical         units
0          clim             1      tmin1  Monthly minimum temperature - January           False       °C x 10
14         clim            15      tmax3    Monthly maximum temperature - March           False       °C x 10
54         clim            55      bio19       Precipitation of coldest quarter           False  mm.3months-1
70          env            71        alt                               Altitude           False        meters

Using the output from get_layers_labels method

>>> bioclim_labels = [label for label in mada_info.get_layers_labels(as_descriptive_labels=True) if "bio" in label]
>>> bio1_bio2_labels = bioclim_labels[0:3]
>>> mada_info.fetch_specific_layers(bio1_bio2_labels)
geoclim_type  layer_number layer_name                 layer_description  is_categorical                                       units
36         clim            37       bio1           Annual mean temperature           False                                     degrees
37         clim            38       bio2                Mean diurnal range           False  mean of monthly max temp - monthy min temp
>>> # Or from descriptive_labels as well
>>> pet_layers = [label for label in mada_info.get_layers_labels(as_descriptive_labels=True) if "pet" in label]
>>> len(pet_layers)
13
>>> mada_info.fetch_specific_layers(pet_layers[-1])
   geoclim_type  layer_number layer_name                                  layer_description  is_categorical units
67         clim            68        pet  Annual potential evapotranspiration from the T...           False    mm

Fetch as dict with keys as layer_<num> and vals of choice using

>>> mada_info.fetch_specific_layers([55, 75], "geoclim_type", "layer_name", "is_categorical")
{
‘layer_55’: {

‘geoclim_type’: ‘clim’, ‘layer_name’: ‘bio19’, ‘is_categorical’: False

}, ‘layer_75’: {

‘geoclim_type’: ‘env’, ‘layer_name’: ‘geo’, ‘is_categorical’: True}

}

} >>> # Only col names will be accepted as additionnal args >>> bio1 = next((layer for layer in mada_info.get_layers_labels(as_descriptive_labels=True) if “bio1” in layer), None) >>> mada_info.fetch_specific_layers(bio1, “band_number”) Traceback (most recent call last):

if not min_layer <= layer_number <= max_layer:

KeyError: “Invalid args: [‘band_number’]. Args must be one of a key of [‘geoclim_type’, ‘layer_number’, ‘layer_name’, ‘layer_description’, ‘is_categorical’, ‘units’] or ‘all’” >>> # Get all keys with the all argument >>> mada_info.fetch_specific_layers(bio1, “all”) {

‘layer_37’: {

‘geoclim_type’: ‘clim’, ‘layer_number’: 37, ‘layer_name’: ‘bio1’, ‘layer_description’: ‘Annual mean temperature’, ‘is_categorical’: False, ‘units’: ‘degrees’

}

}

get_bandnums_from_layers(layers_labels: int | str | List[int | str]) List[int][source]

Retrieves band numbers corresponding to the provided layers’ labels.

This method accepts labels for a subset of layers (specified as either layer numbers, “layer_<num>” format, or descriptive labels) and returns the corresponding band numbers from the all_layers dataframe. If the input is in the descriptive label format or “layer_<num>” format, it should match the output of the get_layers_labels method.

Parameters:

layers_labels (Union[int, str, List[Union[int, str]]]) – A list of layer labels in various formats, or a single layer label.

Raises:
  • TypeError – If elements of layers_labels cannot be converted to int or if they do not match the format produced by the get_layers_labels method.

  • ValueError – If the derived layer numbers do not fall within the valid range of layer numbers in the all_layers dataframe.

Returns:

A list of band numbers corresponding to the provided layer labels.

Return type:

List[int]

Example

>>> mada_info = MadaclimLayers(clim_raster="madaclim_current.tif", env_raster="madaclim_enviro.tif")
>>> last_20 = mada_info.get_layers_labels()[-20:]
>>> band_nums = mada_info.get_bandnums_from_layers(last_20)
>>> band_nums
[60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 1, 2, 3, 4, 5, 6, 7, 8, 9]
get_categorical_combinations(layers_labels: int | str | List[int | str] | None = None, as_descriptive_keys: bool = False) dict | Dict[str, Dict[int, str]][source]

Returns a dictionary representation of the specified categorical layers corresponding the the categorical value encoding.

Parameters:
  • layers_labels (Optional[Union[int, str, List[Union[int, str]]]]) – The layer labels to fetch. Can be a single integer or string value, or a list of integer or string values. The input can also be in the format “layer_{num}” or “{geotype}_{num}_{name}_({unit})” (output from get_layers_labels(as_descriptive_labels=True) method). If layers_labels is None, all categorical layers are fetched.

  • as_descriptive_keys (bool) – If True, returns the descriptive layer labels. Otherwise, returns the “layer_<num>” format. Defaults to False

Raises:
  • TypeError – If layers_labels is not a list of integers or strings, a single integer or a string

  • that can be converted to an integer, or in the output format from the 'get_layers_labels' method.

  • ValueError – If a layer number in layers_labels is not a valid categorical layer number.

Returns:

A dictionary of the specified categorical layers. If multiple layers were specified, the dictionary keys are ‘layer_{num}’, and the values are dictionaries with layer values as keys and their corresponding categories as values. If a single layer was specified, the dictionary keys are the categorical values, and the values are the categories themselves.

Return type:

Union[dict, Dict[str, Dict[int, str]]]

Examples

If multiple layers specified, it returns:

>>> madaclim_info = MadaclimLayers()
>>> >>> madaclim_info.get_categorical_combinations([75, 76])
{
    'layer_75': {
        1: 'N-Bemarivo',
        2: 'S-Bemarivo,_N-Mangoro',
        ...
    },
    'layer_76': {
        1: 'Bare_Rocks',
        2: 'Raw_Lithic_Mineral_Soils',
        ...
    },
    ...
}

If a single layer is specified, it returns:

>>> madaclim_info.get_categorical_combinations("layer_76")
{
    'layer_76: {
        1: 'Bare_Rocks',
        2: 'Raw_Lithic_Mineral_Soils',
    ...
    }
}

For more descriptive keys (same output from as_descriptive_labels)

>>> madaclim_info.get_categorical_combinations("layer_76", as_descriptive_keys=True)
{
    'env_76_soi_Soil types (categ_vals: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23)': {
        1: 'Bare_Rocks',
        2: 'Raw_Lithic_Mineral_Soils',
    ...
    }
}
get_layers_labels(layers_subset: str | List[int] | None = None, as_descriptive_labels: bool = False) list[source]

Retrieves unique layer labels based on the provided subset of layers.

This method fetches the unique labels from the all_layers dataframe, given a subset of layers (specified as either layer numbers, a geoclim_type, or a single layer number). The layer labels can be returned in a descriptive format if as_descriptive_labels is set to True.

Parameters:
  • layers_subset (Optional[Union[str, List[int]]], optional) – A list of layer numbers or a geoclim_type string to subset the labels from, or a single layer number as a string or int. Defaults to None, which will select all layers (no subset).

  • as_descriptive_labels (bool, optional) – If True, returns the descriptive layer labels. Otherwise, returns the “layer_<num>” format. Defaults to False.

Raises:
  • TypeError – If elements of layers_subset cannot be converted to int.

  • ValueError – If layers_subset is a string not in possible_geoclim_types, cannot be converted to int, or if the ‘layer_number’ and ‘layer_name’ columns in the all_layers dataframe have non-unique entries.

Returns:

A list of unique layer labels. These labels are either in the “layer_<num>” format or the descriptive format, based on as_descriptive_labels.

Return type:

list

Examples

Get labels for all layers

>>> mada_info = MadaclimLayers()
>>> all_layers = mada_info.get_layers_labels()
>>> len(all_layers)
79
>>> # Basic format 'layer_<num>'
>>> all_layers[:5]
['layer_1', 'layer_2', 'layer_3', 'layer_4', 'layer_5']

Specify a geoclim subset

>>> env_layers = mada_info.get_layers_labels(layers_subset="env")
>>> env_layers
['layer_71', 'layer_72', 'layer_73', 'layer_74', 'layer_75', 'layer_76', 'layer_77', 'layer_78', 'layer_79']

Extract more information

>>> informative_labels = mada_info.get_layers_labels(as_descriptive_labels=True)
>>> informative_labels[:2]
['clim_1_tmin1_Monthly minimum temperature - January (°C x 10)', 'clim_2_tmin2_Monthly minimum temperature - February (°C x 10)']

Specify a single layer or a subset of layers

>>> mada_info.get_layers_labels(37, as_descriptive_labels=True)
['clim_37_bio1_Annual mean temperature (degrees)']
>>> mada_info.get_layers_labels([68, 75], as_descriptive_labels=True)
['clim_68_pet_Annual potential evapotranspiration from the Thornthwaite equation (mm)', 'env_75_geo_Rock types (categ_vals: 1, 2, 4, 5, 6, 7, 9, 10, 11, 12, 13)']
select_geoclim_type_layers(geoclim_type: str) DataFrame[source]

Method that selects the desired geoclimatic type layers as a dataframe.

Parameters:

geoclim_type (str) – The desired geoclimatic layers type to extract.

Returns:

A slice of the all_layers dataframe containing the desired geoclimatic type layers.

Return type:

pd.DataFrame

Raises:
  • TypeError – If geoclim_type is not a string.

  • ValueError – If geoclim_type does not corresponds to a valid geoclim type.

py_madaclim.raster_manipulation module

class py_madaclim.raster_manipulation.MadaclimCollection(madaclim_points: MadaclimPoint | List[MadaclimPoint] | None = None)[source]

Bases: object

add_points(madaclim_points: MadaclimPoint | List[MadaclimPoint]) None[source]

Adds one or more MadaclimPoint objects to the MadaclimCollection.

Parameters:

madaclim_points (Union[MadaclimPoint, List[MadaclimPoint]]) – A single MadaclimPoint object or a list of MadaclimPoint objects to be added to the MadaclimCollection.

Raises:
  • TypeError – If the input is not a MadaclimPoint object or a list of MadaclimPoint objects.

  • ValueError – If the input MadaclimPoint(s) is/are already in the MadaclimCollection or if their specimen_id(s) are not unique.

Examples

Add a single point

>>> from py_madaclim.geoclim.raster_manipulation import MadaclimPoint, MadaclimCollection
>>> specimen_1 = MadaclimPoint(specimen_id="spe1", latitude=-23.574583, longitude=46.419806, source_crs="epsg:4326")
>>> collection = MadaclimCollection()
>>> collection
No MadaclimPoint inside the collection.
>>> collection.add_points(specimen_1)
>>> collection
MadaclimCollection = [
    MadaclimPoint(specimen_id=spe1, mada_geom_point=POINT (644890.8921103649 7392153.658976035), sampled=False)
]

Add multiple points

>>> specimen_2 = MadaclimPoint(specimen_id="spe2", latitude=-20.138470, longitude=46.054688, family="Rubiaceae", has_sequencing=True, num_samples=1)
>>> other_collection.add_points([specimen_1, specimen_2])
>>> print(other_collection)
MadaclimCollection = [
    MadaclimPoint(specimen_id=spe1, mada_geom_point=POINT (644890.8921103649 7392153.658976035), sampled=False),
    MadaclimPoint(specimen_id=spe2, mada_geom_point=POINT (610233.867750987 7772846.143786541), sampled=False)
]

No duplicates allowed >>> other_collection.add_points(specimen_1) Traceback (most recent call last): File “<stdin>”, line 1, in <module> File “…/src/py_madaclim/geoclim/raster_manipulation.py”, line 1013, in add_points

ValueError: MadaclimPoint(

specimen_id = spe1, source_crs = 4326, latitude = -23.574583, longitude = 46.419806, mada_geom_point = POINT (644890.8921103649 7392153.658976035), sampled_layers = None, nodata_layers = None

) is already in the current MadaclimCollection instance.

property all_points: list

Get the all_points attribute.

Corresponds to a list of each object in the collection.

Returns:

A list of all the MadaclimPoint objects in the MadaclimCollection.

Return type:

list

Examples

>>> # MadaclimPoints are stored in the .all_points attributes in a list
>>> collection.all_points[0]
MadaclimPoint(
    specimen_id = sample_A,
    source_crs = 4326,
    latitude = -18.9333,
    longitude = 48.2,
    mada_geom_point = POINT (837072.9150244407 7903496.320897499),
    sampled_layers = None,
    nodata_layers = None
)
binary_encode_categorical() None[source]

Binary encodes the categorical layers contained in the sampled_layers attribute.

This function performs binary encoding of categorical layers found in the raster data for each of the Point in the collection. After the encoding, the function updates the Collection instance’s attributes for the categorical encoding status, the encoded layers and the gdf replacing the categorical columns by the binary encoded features.

Raises:

ValueError – If the MadaclimCollection doesn’t contain any MadaclimPoints or if the raster data has not been sampled yet.

Returns:

None

Notes

See the binary_encode_categorical method for logic and possible raised exceptions.

property encoded_categ_labels: List[str] | None

Get the labels from the binary encoded categorical layers for the whole collection.

Returns:

A list containing the set of the labels from the binary encoding

of categorical features from the whole collection. None if the categorical layers have not been encoded yet.

Return type:

Optional[List[str]]

property encoded_categ_layers: Dict[str, Dict[str, int]] | None

Get the binary encoded categorical layers values.

Returns:

A nested dictionary containing the set of binary encoded categorical features.

The outer dictionary uses the MadaclimPoint.specimen_id as keys. The corresponding value for each key is another dictionary where the keys are layer number (or a more descriptive label) with the categorical feature. Values correspond to the binary encoded value for that given category

Return type:

Optional[Dict[str, int]]

property gdf: GeoDataFrame

Get the GeoPandas DataFrame of the collection (concat of all points’ gdfs).

Returns:

A Geopandas GeoDataFrame generated from instance attributes and Point geometry.

Return type:

gpd.GeoDataFrame

property is_categorical_encoded: bool

Get the state of the binary encoding of the categorical layers of the collection.

Returns:

The state of the binary_encode_categorical method. If True, the method has been called

and a new set of binary features has been generated for the whole collection. Otherwise, either the layers have not been sampled or the categorical features havenot been encoded.

Return type:

bool

property nodata_layers: Dict[str, str | List[str]] | None

Get the nodata_layers attribute of the collection generated from the sampled_from_rasters method.

This attribute is a dictionary that contains the MadaclimPoint.specimen_id as keys and the values as the ‘nodata_layers’ as str or list of str.

Returns:

A dictionary with MadaclimPoint.specimen_id as keys and values of str or list of str of the layers_name with nodata values.

None if Collection has not been sampled yet or all layers sampled contained valid data.

Return type:

Optional[Dict[str, Union[str, List[str]]]]

plot_on_layer(layer: str | int, **kwargs) None[source]

Plot a layer as a raster map and distribution plot with a focus on the MadaclimPoint object.

Based on the Madaclim’s CRS, the MadaclimPoint’s geometry (mada_geom_point) is plotted on the raster map and the sampled value is displayed against a distribution of all possible values for that layer in Madaclim db. Pass addition kwargs to customize each subplots (See **kwargs, _LayerPlotter and _LayerConfig for more details).

Parameters:
  • layer (Union[str, int]) – Layer to plot. Accepts layer numbers as integers, or layer labels in descriptive or layer_<num> format.

  • **kwargs – Additional arguments to customize the subplots, imshow, colorbar, histplot and Point objects from matplotlib. Use “subplots_<arg>”, “imshow_<arg>”, “cax_<arg>”, and “barplot_<arg>” formats to customize corresponding the base Raster and barplot plots as matplotlib/sns arguments. Use “point_<arg>” to customize the Point objects on the raster.

Returns:

None

Raises:
  • ValueError – If the specified layer to plot is not has not been sampled yet.

  • ValueError – If the layer label cannot be found within the sampled layers.

classmethod populate_from_csv(csv_file: str | Path) MadaclimCollection[source]

Creates a new MadaclimCollection from a CSV file.

Each row of the CSV file should represent a MadaclimPoint. The CSV file must have columns that correspond to the arguments of the MadaclimPoint constructor. If a ‘source_crs’ column is not provided, the method uses the default CRS value.

Parameters:

csv_file (Union[str, pathlib.Path]) – The path to the CSV file.

Returns:

A new MadaclimCollection instance with MadaclimPoint objects created from the rows of the CSV file.

Return type:

MadaclimCollection

Raises:
  • TypeError – If ‘csv_file’ is not a str or pathlib.Path object.

  • FileNotFoundError – If the file specified by ‘csv_file’ does not exist.

  • ValueError – If the CSV file headers are missing required arguments for

  • constructing MadaclimPoint objects.

Examples

CSV requirements for construction

>>> # header must contain req. positional args for MadaclimPoint
>>> # When no source_crs header is found, defaults to EPSG:4326
specimen_id,latitude,longitude
sample_A,-18.9333,48.2
sample_B,-16.295741,46.826763
sample_C,-21.223,47.5204
sample_D,-17.9869,49.2966
sample_E,-21.5166,47.4833
>>> collection = MadaclimCollection.populate_from_csv("some_samples.csv")
Warning! No source_crs column in the csv. Using the default value of EPSG:4326...
Created new MadaclimCollection with 5 samples.
>>> # Can accept other non-required data for MadaclimPoint instantiation
specimen_id,latitude,longitude,source_crs,has_sequencing,specie
sample_F,-19.9333,47.2,4326,True,bojeri
sample_G,-18.295741,45.826763,4326,False,periwinkle
sample_H,-21.223,44.5204,4326,False,spectabilis
>>> other_collection = MadaclimCollection.populate_from_csv("other_samples.csv")
classmethod populate_from_df(df: DataFrame) MadaclimCollection[source]

Class method to populate a MadaclimCollection from a pandas DataFrame.

This method takes a DataFrame where each row represents a MadaclimPoint and its attributes. If the ‘source_crs’ column is not provided in the DataFrame, the default CRS will be used.

Parameters:

df (pd.DataFrame) – DataFrame where each row represents a MadaclimPoint. Expected columns are the same as the required arguments for the MadaclimPoint constructor.

Returns:

A new MadaclimCollection instance populated with MadaclimPoints

created from the DataFrame.

Return type:

MadaclimCollection

Raises:
  • TypeError – If ‘df’ is not a pd.DataFrame.

  • ValueError – If the DataFrame is missing any of the required arguments to construct a MadaclimPoint.

Example

Respect requirements for MadaclimPoint construction in df columns

>>> import pandas as pd
>>> sample_df
specimen_id    latitude  longitude
0    sample_W  -16.295741  46.826763
1    sample_X    -17.9869    49.2966
2    sample_Y    -18.9333    48.2166
3    sample_Z      -13.28      49.95
>>> collection = MadaclimCollection.populate_from_df(sample_df)
Warning! No source_crs column in the df. Using the default value of EPSG:4326...
Creating MadaclimPoint(specimen_id=sample_W...)
Creating MadaclimPoint(specimen_id=sample_X...)
Creating MadaclimPoint(specimen_id=sample_Y...)
Creating MadaclimPoint(specimen_id=sample_Z...)
Created new MadaclimCollection with 4 samples.
remove_points(*, madaclim_points: MadaclimPoint | List[MadaclimPoint] | None = None, indices: int | List[int] | None = None, clear: bool = False) None[source]

Removes MadaclimPoint objects from the MadaclimCollection based on specified criteria.

This method allows removing MadaclimPoint objects from the collection by providing either MadaclimPoint instance(s), index/indices, or by clearing the whole collection.

Parameters:
  • madaclim_points (Optional[Union[MadaclimPoint, List[MadaclimPoint]]], optional) – A single MadaclimPoint object or a list of MadaclimPoint objects to be removed from the collection. Defaults to None.

  • indices (Optional[Union[int, List[int]]], optional) – A single index or a list of indices of the MadaclimPoint objects to be removed from the collection. Defaults to None.

  • clear (bool, optional) – If set to True, removes all MadaclimPoint objects from the collection. When using this option, ‘madaclim_points’ and ‘indices’ must not be provided. Defaults to False.

Raises:
  • ValueError – If the MadaclimCollection is empty or if none of the input options are provided.

  • ValueError – If ‘madaclim_points’ and ‘indices’ are both provided.

  • ValueError – If ‘clear’ is set to True and either ‘madaclim_points’ or ‘indices’ are provided.

  • TypeError – If an invalid type is provided for ‘madaclim_points’ or ‘indices’.

  • ValueError – If a provided MadaclimPoint object is not in the collection or if an index is out of bounds.

  • IndexError – If an index is out of range.

Examples

Remove points by passing in the ‘MadaclimPoint’ instances, the index or the ‘specimen_id’

>>> sample_W = collection.all_points[0]
>>> collection.remove_points(madaclim_points=sample_W)
>>> # Using the position index of the instance
>>> collection.remove_points(indices=-1)    # Removes last point of the collection
>>> # Using the specimen.id attribute
>>> collection.remove_points(madaclim_points="sample_Y")

Remove multiple points

>>> # A list of str or MadaclimPoint or mixed types are accepted for the madaclim_points argument.
>>> sample_w = collection.all_points[0]
>>> to_remove = [sample_w, "sample_X"]
>>> collection.remove_points(madaclim_points=to_remove)
>>> # Or pass in a list of indices to the indices argument.
>>> collection.remove_points(indices=[0, -1])    # Remove first and last point
>>> # Finaly we can clear the collection of all instances.
>>> collection.remove_points(clear=True)
No MadaclimPoint inside the collection.
sample_from_rasters(clim_raster: Path, env_raster: Path, layers_to_sample: int | str | List[int | str] = 'all', layer_info: bool = False) None[source]

Samples geoclimatic data from raster files for specified layers at the location of each point belonging to the MadaclimCollection’s instance.

Calling this method will also update the sampled_layers attributes with the data extracted from the layers_to_sample for every point in the collection. If sampled data containing ‘nodata’ values, the nodata_layers attribute will be updated with the name of the layers accordingly. Also, the gdf attribute GeoDataFrame will be updated with the sampled_layers.

Parameters:
  • clim_raster_path (pathlib.Path) – Path to the climate raster file.

  • env_raster_path (pathlib.Path) – Path to the environment raster file.

  • layers_to_sample (Union[int, str, List[Union[int, str]]], optional) – The layer number(s) to sample from the raster files. Can be a single int, a single string in the format ‘layer_<num>’, or a list of ints or such strings. Defaults to ‘all’.

  • layer_info (bool, optional) – Whether to use descriptive labels for the returned dictionary keys. Defaults to False.

Returns:

None

Raises:

ValueError – If the MadaclimCollection doesn’t contain any MadaclimPoints.

Notes

This method also updates the ‘sampled_layers’ and ‘nodata_layers’ attributes of the MadaclimCollection instance.

Examples

Sample the value for each point in the collection according to their location

>>> from py_madaclim.geoclim.raster_manipulation import MadaclimPoint, MadaclimCollection
>>> specimen_1 = MadaclimPoint(specimen_id="spe1_aren", latitude=-18.9333, longitude=48.2, genus="Coffea", species="arenesiana", has_sequencing=True)
>>> specimen_2 = MadaclimPoint(specimen_id="spe2_humb", latitude=-12.716667, longitude=45.066667, source_crs=4326, genus="Coffea", species="humblotiana", has_sequencing=True)
>>> collection = MadaclimCollection()
>>> collection.add_points([specimen_1, specimen_2])
>>> from py_madaclim.info import MadaclimLayers
>>> madaclim_info = MadaclimLayers()
>>> bioclim_labels = [label for label in madaclim_info.get_layers_labels(as_descriptive_labels=True) if "bio" in label]
>>> # Validating the rasters
mada_rasters = MadaclimRasters(clim_raster="madaclim_current.tif", env_raster="madaclim_enviro.tif")
>>> collection.sample_from_rasters(
        mada_rasters.clim_raster,
        mada_rasters.env_raster,
        layers_to_sample=bioclim_labels
    )

Attribute state updating

>>> collection    # sampled status updated
MadaclimCollection = [
        MadaclimPoint(specimen_id=spe1_aren, mada_geom_point=POINT (837072.9150244407 7903496.320897499),sampled=True),
        MadaclimPoint(specimen_id=spe2_humb, mada_geom_point=POINT (507237.57495924993 8594195.741515966),sampled=True)
]
>>> # Results also stored in the `sampled_layers` attribute
>>> collection.sampled_layers["spe2_humb"]["layer_55"]
66
>>> # layers_to_sample also accepts a single layer, or multiple layers as the output from the `get_layers_labels` method in MadaclimLayers
>>> collection.sample_from_rasters(37)
{'spe1_aren': {'layer_37': 196}, 'spe2_humb': {'layer_37': 238}}
property sampled_layers: Dict[str, Dict[str, int]] | None

Get the sampled_layers attribute of the collection generated from the sampled_from_rasters method.

This attribute is a nested dictionary. The outer dictionary uses the MadaclimPoint.specimen_id as keys. The corresponding value for each key is another dictionary, which uses layer_names as keys and sampled values from rasters as values.

Returns:

A dictionary with MadaclimPoint.specimen_id as keys and a dictionary of layer_names (str) and sampled values (int) as values.

None if Collection has not been sampled yet.

Return type:

Optional[Dict[str, Dict[str, int]]]

static sanitize_attr_name(attribute: str)[source]

Strips incompatible char from attribute names

Parameters:

attribute (str) – The attribute header from the csv file

Returns:

The compatible attribute name.

Return type:

str

class py_madaclim.raster_manipulation.MadaclimPoint(specimen_id: str, longitude: float, latitude: float, source_crs: ~pyproj.crs.crs.CRS = <Geographic 2D CRS: EPSG:4326> Name: WGS 84 Axis Info [ellipsoidal]: - Lat[north]: Geodetic latitude (degree) - Lon[east]: Geodetic longitude (degree) Area of Use: - name: World. - bounds: (-180.0, -90.0, 180.0, 90.0) Datum: World Geodetic System 1984 ensemble - Ellipsoid: WGS 84 - Prime Meridian: Greenwich, **kwargs)[source]

Bases: object

A class representing a specimen as a geographic point with a specific coordinate reference system (CRS) and additional attributes. The class provides methods for validating the point’s coordinates and CRS, as well as sampling values from climate and environmental rasters of the Madaclim database.

specimen_id

An identifier for the point.

Type:

str

latitude

The latitude of the point.

Type:

float

longitude

The longitude of the point.

Type:

float

source_crs

The coordinate reference system of the point.

Type:

pyproj.crs.crs.CRS

mada_geom_point

A Shapely Point object representing the point projected in the Madaclim rasters’ CRS.

Type:

shapely.geometry.point.Point

sampled_layers

A dictionary containing the layers labels as keys and their values from the sampled raster at the Point’s position as int. None if data has not been sampled yet.

Type:

Optional[Dict[str, int]]

nodata_layers

A list containing the layers labels. None if data has not been sampled yet or no layers sampled containined nodata values.

Type:

Optional[List[str]]

is_categorical_encoded

The state of the binary_encode_categorical method. If True, the method has been called and a new set of binary features has been generated. Otherwise, either the layers have not been sampled or the categorical features not been encoded.

Type:

bool

encoded_categ
encoded_categ_layers

A dictionary containing the set of binary encoded categorical features.

Type:

Optional[Dict[str, int]]

gdf

A Geopandas GeoDataFrame generated from instance attributes and mada_geom_point geometry. Updates along any changes to the instance’s attributes.

Type:

gpd.GeoDataFrame

property base_attr: dict

Get the base attributes when constructing the instance.

Returns:

A dictionary containing the base attributes names as keys and their values as values.

Return type:

dict

binary_encode_categorical() None[source]

Binary encodes the categorical layers contained in the sampled_layers attribute.

This function performs binary encoding of categorical layers found in the raster data. It uses the MadaclimLayers object to get information about possible categorical layers. If no categorical layers are found in the data, a ValueError is raised. After the encoding, the function updates the respective instance attributes for the categorical encoding status, the encoded layers and the gdf replacing the categorical columns by the binary encoded features.

Raises:
  • ValueError – If no categorical layers are found in the raster

  • data or if the raster data has not been sampled yet.

Returns:

None

property encoded_categ_labels: List[str] | None

Get the labels from the binary encoded categorical layers of the instance.

Returns:

A list containing the set of the labels from the binary encoding

of categorical features. None if the categorical layers have not been encoded yet.

Return type:

Optional[List[str]]

property encoded_categ_layers: Dict[str, int] | None

Get the binary encoded categorical layers values.

Returns:

A dictionary containing the set of binary encoded categorical features.

Keys are contain the layer number (or more description) and the categorical feature. Values are the binary encoded value for that given category.

Return type:

Optional[Dict[str, int]]

property gdf: GeoDataFrame

Get the GeoPandas DataFrame using mada_geom_point as geometry.

Returns:

A Geopandas GeoDataFrame generated from instance attributes and Point geometry.

Return type:

gpd.GeoDataFrame

static get_args_names() Tuple[list, list][source]

Gets the names of the required and default arguments of the MadaclimPoint constructor.

This method uses the inspect module to introspect the MadaclimPoint constructor and extract the names of its arguments. It then separates these into required arguments (those that don’t have default values) and default arguments (those that do).

Returns:

A tuple containing two lists:
  • The first list contains the names of the required arguments.

  • The second list contains the names of the default arguments.

Return type:

Tuple[list, list]

Note

  • ‘self’ is excluded from the returned lists.

static get_default_source_crs(as_epsg: bool = True) CRS | int[source]

Extracts the default value of the source_crs attribute. By default, it will return the crs as the EPSG code.

Parameters:

as_epsg (bool, optional) – The EPSG code of the source_CRS. Defaults to True.

Returns:

The default value for the source_crs attribute. If true, source_crs is returned as the EPSG code of the crs.

Return type:

Union[pyproj.crs.crs.CRS, int]

property is_categorical_encoded: bool

Get the state of the binary encoding of the categorical layers.

Returns:

The state of the binary_encode_categorical method. If True, the method has been called and a new set of binary features has been generated.

Otherwise, either the layers have not been sampled or the categorical features have not been encoded.

Return type:

bool

property latitude: float

Gets or sets the latitude attribute.

Parameters:

value (float) – The latitude value of the point.

Returns:

The current latitude value of the point.

Return type:

float

property longitude: float

Gets or sets the longitude attribute.

Parameters:

value (float) – The longitude value of the point.

Returns:

The current longitude value of the point.

Return type:

float

property mada_geom_point: Point
property nodata_layers: List[str] | None

Get the layers labels containing nodata values when calling the`sampled_from_rasters` method.

Returns:

A list containing the layers labels (either layer_<num> or more descriptive).

Returns None if data has not been sampled yet or no layers sampled containined nodata values.

Return type:

Optional[List[str]]

plot_on_layer(layer: str | int, **kwargs) None[source]

Plot a layer as a raster map and distribution plot with a focus on the MadaclimPoint object.

Based on the Madaclim’s CRS, the MadaclimPoint’s geometry (mada_geom_point) is plotted on the raster map and the sampled value is displayed against a distribution of all possible values for that layer in Madaclim db. Pass addition kwargs to customize each subplots (See **kwargs, _LayerPlotter and _LayerConfig for more details).

Parameters:
  • layer (Union[str, int]) – Layer to plot. Accepts layer numbers as integers, or layer labels in descriptive or layer_<num> format.

  • **kwargs – Additional arguments to customize the subplots, imshow, colorbar, histplot and Point objects from matplotlib. Use “subplots_<arg>”, “imshow_<arg>”, “cax_<arg>”, and “histplot_<arg>” formats to customize corresponding the base Raster and histogram plots as matplotlib/sns arguments. Use “point_<arg>” to customize the Point objects on the raster. Use “vline_<arg>” to customize the vertical line on the distribution plot.

Returns:

None

Raises:
  • ValueError – If the specified layer to plot is not has not been sampled yet.

  • ValueError – If the layer label cannot be found within the sampled layers.

sample_from_rasters(clim_raster: Path, env_raster: Path, layers_to_sample: int | str | List[int | str] = 'all', layer_info: bool = False) None[source]

Samples geoclimatic data from raster files for specified layers at the location of the instances’s lat/lon coordinates from the mada_geom_point attribute.

Calling this method will also update the sampled_layers attributes with the data extracted from the layers_to_sample. If sampled data containing ‘nodata’ values, the nodata_layers attribute will be updated with the name of the layers accordingly. Also, the gdf attribute GeoDataFrame will be updated with the sampled_layers.

Parameters:
  • clim_raster_path (pathlib.Path) – Path to the climate raster file.

  • env_raster_path (pathlib.Path) – Path to the environment raster file.

  • layers_to_sample (Union[int, str, List[Union[int, str]]], optional) – The layer number(s) to sample from the raster files. Can be a single int, a single string in the format ‘layer_<num>’, or the descriptive label or a list of ints or such strings. Defaults to ‘all’.

  • layer_info (bool, optional) – Whether to use descriptive labels for the returned dictionary keys. Defaults to False.

Raises:
  • TypeError – If the layers_to_sample is not valid, or if the mada_geom_point attribute is not a Point object.

  • ValueError – If the layer_number is out of range or if the mada_geom_point object is empty.

Returns:

None

Examples

Sample a set of layers >>> from py_madaclim.info import MadaclimLayers >>> madaclim_info = MadaclimLayers() >>> bioclim_labels = [label for label in madaclim_info.get_layers_labels(as_descriptive_labels=True) if “bio” in label]

>>> specimen_1 = MadaclimPoint(specimen_id="spe1_aren", latitude=-18.9333, longitude=48.2, genus="Coffea", species="arenesiana", has_sequencing=True)
>>> spe1_bioclim = specimen_1.sample_from_rasters(
...     clim_raster="madaclim_current.tif",
...     env_raster="madaclim_enviro.tif",
...     layers_to_sample=bioclim_labels
... )
>>> spe1_bioclim["layer_37"]
196
>>> # layer_info key as more descriptive and informative
>>> spe1_bioclim = specimen_1.sample_from_rasters(
...     clim_raster="madaclim_current.tif",
...     env_raster="madaclim_enviro.tif",
...     layers_to_sample=bioclim_labels,
...     layer_info=True
... )
>>> bio1_label = bioclim_labels[0]
'clim_37_bio1 (Annual mean temperature)'
>>> spe1_bioclim[bio1_label]
196

Warning message for NaN in the data extracted >>> # We can easily access the nodata layers (still sampled with the method regardless) >>> spe2_all_layers, spe2_nodata_layers = specimen_2.sample_from_rasters( … clim_raster=”madaclim_current.tif”, … env_raster=”madaclim_enviro.tif”, … layer_info=True, … return_nodata_layers=True … ) >>> len(spe2_nodata_layers) 5 >>> spe2_nodata_layers[0] # Example of a categorical feature description with raster-value/description associations ‘env_75_geology (1=Alluvial_&_Lake_deposits, 2=Unconsolidated_Sands, 4=Mangrove_Swamp, 5=Tertiary_Limestones_+_Marls_&_Chalks, 6=Sandstones, 7=Mesozoic_Limestones_+_Marls_(inc._”Tsingy”), 9=Lavas_(including_Basalts_&_Gabbros), 10=Basement_Rocks_(Ign_&_Met), 11=Ultrabasics, 12=Quartzites, 13=Marble_(Cipolin))’

Updated attributes post-sampling >>> specimen_2.sample_from_rasters( … clim_raster=”madaclim_current.tif”, … env_raster=”madaclim_enviro.tif”, … layers_to_sample=[37, 75] … ) MadaclimPoint(

specimen_id = spe2_humb, source_crs = 4326, latitude = -12.716667, longitude = 45.066667, mada_geom_point = POINT (507237.57495924993 8594195.741515966), len(sampled_layers) = 2 layer(s), len(nodata_layers) = 1 layer(s), is_categorical_encoded = False genus = Coffea, species = humblotiana, has_sequencing = True, gdf.shape = (1, 10)

) >>> specimen_2.sampled_layers {‘layer_37’: 238, ‘layer_75’: -32768} >>> specimen_2.nodata_layers [‘layer_75’]

property sampled_layers: Dict[str, int] | None

Get the instance’s data obtained from the sampled_from_rasters method.

Returns:

A dictionary containing the layers labels as keys and their values as int.

Returns None if data has not been sampled yet.

Return type:

Optional[Dict[str, int]]

property source_crs: CRS

Get or sets the source_crs attribute.

Args:

value (pyproj.crs.CRS): The coordinate reference system for the point.

Returns:

The coordinate reference system of the point.

Return type:

pyproj.crs.CRS

property specimen_id: str

Get or sets the specimen_id attribute.

Parameters:

value (str) – The new identifier for the MadaclimPoint.

Returns:

The identifier for the MadaclimPoint.

Return type:

str

class py_madaclim.raster_manipulation.MadaclimRasters(clim_raster: Path, env_raster: Path)[source]

Bases: object

Handles operations on Madaclim climate and environmental raster files. Also provides a method to visualize the raster layers (map) and distribution of the raster values.

clim_raster

Path to the climate raster file.

Type:

pathlib.Path

clim_crs

The CRS derived of the climate raster file.

Type:

pyproj.crs.crs.CRS

clim_nodata_val

The nodata value from the climate raster file

Type:

float

clim_bounds

The bounds of the climate raster in order of (left, bottom, right, top)

Type:

tuple

env_raster

Path to the environmental raster file.

Type:

pathlib.Path

env_crs

The CRS derived of the environmental raster file.

Type:

pyproj.crs.crs.CRS

env_nodata_val

The nodata value from the environmental raster file

Type:

float

env_bounds

The bounds of the environmental raster in order of (left, bottom, right, top)

Type:

tuple

property clim_bounds: Tuple[float]

Retrieves the bounds of the climate raster.

This property opens the raster file and retrieves bounds values (left, bottom, right, top) from the raster.

Returns:

The bounds values in order (left, bottom, right, top)

Return type:

Tuple[float]

property clim_crs: CRS

Retrieves the Coordinate Reference System (CRS) from the Madaclim climate raster.

This property opens the raster file and retrieves the CRS in EPSG format. The EPSG code is used to create and return a pyproj CRS object.

Returns:

The CRS derived of the climate raster file.

Return type:

pyproj.crs.crs.CRS

property clim_nodata_val: float

Retrieves the nodata value from the Madaclim climate raster.

This property opens the raster file and retrieves nodata value from the raster.

Returns:

The nodata value from the climate raster file

Return type:

float

property clim_raster: Path

Retrieves or sets the climate raster file path.

Parameters:

value (pathlib.Path) – The new climate raster file path.

Returns:

The climate raster file path.

Return type:

pathlib.Path

property env_bounds: Tuple[float]

Retrieves the bounds of the climate raster.

This property opens the raster file and retrieves bounds values (left, bottom, right, top) from the raster.

Returns:

The bounds values in order (left, bottom, right, top)

Return type:

Tuple[float]

property env_crs: CRS

Retrieves the Coordinate Reference System (CRS) from the Madaclim environmental raster.

This property opens the raster file and retrieves the CRS in EPSG format. The EPSG code is used to create and return a pyproj CRS object.

Returns:

The CRS derived of the environmental raster file.

Return type:

pyproj.crs.crs.CRS

property env_nodata_val: float

Retrieves the nodata value from the Madaclim environmental raster.

This property opens the raster file and retrieves nodata value from the raster.

Returns:

The nodata value from the environmental raster file

Return type:

float

property env_raster: Path

Retrieves or sets the environmental raster file path.

Parameters:

value (pathlib.Path) – The new environmental raster file path.

Returns:

The environmental raster file path.

Return type:

pathlib.Path

plot_layer(layer: str | int, **kwargs) Tuple[Figure, List[Axes]][source]

Method to plot a specific layer from the Madagascan climate/environmental raster datasets. The layer is displayed as a raster map and its distribution is plotted in a histogram.

It accepts layer labels in the following formats: layer_<num> (e.g. “layer_1”) and <descriptive_layer_label> (e.g. “annual_mean_temperature”). Alternatively, the layer number can be supplied directly as an integer.

Depending on whether the layer is categorical or continuous, the visualization will be different. For categorical layers, it will display a map using different colors for each category and a legend mapping categories to colors. For continuous layers, it will display a color gradient map with a color bar.

Parameters:
  • layer (Union[str, int]) – Layer to plot. Accepts layer numbers as integers, or layer labels in descriptive or layer_<num> format.

  • **kwargs – Additional arguments to customize the subplots, imshow, colorbar and histplot from matplotlib. Use “subplots_<arg>”, “imshow_<arg>”, “cax_<arg>”, and “histplot_<arg>” formats to customize corresponding matplotlib/sns arguments.

Returns:

The top-level container for all plot elements. axes (List[matplotlib.axes.Axes]): An array containing the Axes objects

of the subplots.

Return type:

fig (matplotlib.figure.Figure)

Raises:
  • TypeError – If ‘layer’ is not a str or an int.

  • ValueError – If ‘layer’ is not found within the range of layers.

Note

This method returns the fig and axes object for further customization when used by other classes. It uses the private _PlotConfig and _LayerPlotter utility classes for the checks and vizualisation.

Example

Visualization of the raster maps

>>> from py_madaclim.info import MadaclimLayers
>>> # Extract environmental layers labels
>>> mada_info = MadaclimLayers(clim_raster="madaclim_current.tif", env_raster="madaclim_enviro.tif")
>>> env_labels = mada_info.get_layers_labels("env", as_descriptive_labels=True)
>>> # Default visualization of the raster map
>>> from py_madaclim.raster_manipulation import MadaclimRasters
>>> mada_rasters = MadaclimRasters(clim_raster=mada_info.clim_raster, env_raster=mada_info.env_raster)    # Using common attr btw the instances
>>> mada_rasters.plot_layer(env_layers_labels[0])
>>> # Pass in any number of kwargs to the imshow or cax (raster + colorbarax) or histplot for customization
>>> mada_rasters.plot_layer(env_labels[0], imshow_cmap="terrain", histplot_binwidth=100, histplot_stat="count")
>>> # Some layers are categorical data so the figure formatting will change (no cbar)
>>> geo_rock_label = next(label for label in env_labels if "geo" in label)
>>> mada_rasters.plot_layer(geo_rock_label, subplots_figsize=(12, 8))
>>> # For numerical features with highly skewed distribution, specify vmin or vmax for the raster map
>>> mada_rasters.plot_layer(env_labels[3], imshow_vmin=6000)
>>> # To know which are the categorical data, use the MadaclimLayers utilities
>>> mada_info.categorical_layers    # as df
>>> mada_info.get_categorical_combinations()    # As dict, default selects all possibilities