Feature Selector

The FeatureSelector is intended to be used in a stepwise manner, starting with data normalization, followed by the removal of low-variance features, handling of multicollinearity, and the elimination of highly correlated features. It supports various strategies for feature selection, allowing users to customize the preprocessing pipeline according to their QSAR modeling needs.

The class also provides methods for visualizing data clusters and determining the optimal number of clusters for analysis, enhancing the interpretability and analysis of QSAR datasets. This tool is essential for researchers and scientists working in the field of QSAR modeling, offering a systematic approach to feature selection and dataset preparation.

class qsar.preprocessing.feature_selector.FeatureSelector(df: DataFrame, y: str = 'Log_MP_RATIO', cols_to_ignore=None)

Bases: object

Implements methods for feature selection within QSAR modeling.

The recommended order of use is normalization followed by feature selection.

Parameters:
  • df (pd.DataFrame) – DataFrame with continuous data describing the observations and features.

  • y (str, optional) – Name of the column for the dependent variable. Defaults to ‘Log_MP_RATIO’.

  • cols_to_ignore (list, optional) – List of column names to be ignored during processing.

get_correlation(df: DataFrame | None = None, y: str = '', cols_to_ignore=None, method: str = 'kendall') DataFrame

Calculates a correlation score for all features within a DataFrame.

Parameters:
  • df (pd.DataFrame, optional.) – DataFrame with only continuous data describing the observations and features. If None, uses the internal DataFrame. Defaults to None.

  • y (str, optional.) – Name of the column for the dependent variable to ignore in the collinearity test. Defaults to an empty string, indicating no dependent variable.

  • cols_to_ignore (list, optional.) – List of column names to ignore during correlation computation. Defaults to an empty list.

  • method (str, optional.) – Method used to calculate the correlation between the features. Options are “pearson”, “kendall”, or “spearman”. Defaults to “kendall”.

Returns:

A DataFrame containing the correlation coefficients of the features.

Return type:

pd.DataFrame.

get_correlation_to_y(df: DataFrame | None = None, y: str = '', cols_to_ignore=None, method: str = 'kendall') DataFrame

Calculates a correlation score for each feature in relation to the specified dependent variable.

Parameters:
  • df (pd.DataFrame.) – DataFrame with continuous data describing observations and features.

  • y (str, optional) – The dependent variable to compare for correlation. Defaults to an empty string.

  • cols_to_ignore (list, optional) – List of columns where the function should not be executed. Defaults to an empty list.

  • method (str, optional) – Method used to calculate correlation. Options include “pearson”, “kendall”, or “spearman”. Defaults to “kendall”.

Returns:

A Series object with the correlation score for each feature relative to the y variable.

Return type:

pd.Series

remove_highly_correlated(df_correlation: DataFrame | None = None, df_corr_y: DataFrame | None = None, df: DataFrame | None = None, threshold: float = 0.9, verbose: bool = False, inplace=False, graph: bool = False) DataFrame

Removes all the features from the DataFrame that have a correlation coefficient above the specified threshold.

Parameters:
  • df_correlation (pd.DataFrame, optional) – A DataFrame representing the correlation matrix. Defaults to None, which will use the internal DataFrame’s correlation matrix.

  • df_corr_y (pd.DataFrame, optional) – A DataFrame of correlation values correlating all features to the dependent variable. Defaults to None, which will calculate it from the internal DataFrame.

  • df (pd.DataFrame, optional) – A DataFrame with only continuous data describing the observations and features. Defaults to None, which will use the internal DataFrame.

  • threshold (float, optional) – The threshold where features correlated beyond should be dropped. Defaults to 0.9.

  • verbose (bool, optional) – If True, displays additional information during processing. Defaults to False.

  • inplace (bool, optional) – If True, replaces the internal DataFrame with the result. Defaults to False.

  • graph (bool, optional) – If True, draws a heatmap of all dropped features and their respective correlation to each other. Defaults to False.

Returns:

A DataFrame with highly correlated features removed.

Return type:

pd.DataFrame

remove_low_variance(y: str = '', variance_threshold: float = 0, cols_to_ignore=None, verbose: bool = False, inplace: bool = False) tuple[DataFrame, list]

Removes features from the DataFrame with variance below the specified threshold.

Parameters:
  • df (pd.DataFrame) – DataFrame with continuous data describing observations and features.

  • y (str, optional) – The dependent variable which will be ignored in the feature removal process. Defaults to an empty string.

  • variance_threshold (float) – The threshold for variance below which features will be removed.

  • cols_to_ignore (list, optional) – List of columns to be ignored during processing. Defaults to an empty list.

  • verbose (bool, optional) – If True, displays descriptive text to help visualize changes. Defaults to False.

  • inplace (bool, optional) – If True, updates the attribute df of the FeatureSelector object with the resultant DataFrame. Defaults to False.

Returns:

A tuple containing the DataFrame with low variance features removed and a list of the removed column names.

Return type:

Tuple[pd.DataFrame, list]

remove_multicollinearity(df: DataFrame | None = None, y: str = 'Log_MP_RATIO')

Removes multicollinearity from the DataFrame.

Parameters:
  • df (pd.DataFrame, optional) – DataFrame with continuous data describing the observations and features. Defaults to None.

  • y (str, optional) – Name of the column for the dependent variable. Defaults to ‘Log_MP_RATIO’.

scale_data(y: str = 'Log_MP_RATIO', verbose: bool = False, inplace=False) DataFrame

Normalizes the data within the DataFrame.

Parameters:
  • y (str, optional) – The dependent variable (ignored in feature scaling). Defaults to ‘Log_MP_RATIO’.

  • verbose (bool, optional) – If True, displays text to help visualize changes. Defaults to False.

  • inplace (bool, optional) – If True, replaces the internal DataFrame with the normalized one. Defaults to False.

Returns:

The normalized DataFrame.

Return type:

pd.DataFrame

transform() DataFrame

Removes low variance and highly correlated features from the DataFrame stored in the FeatureSelector object.

Returns:

A DataFrame with low variance and highly correlated features removed.

Return type:

pd.DataFrame

qsar.preprocessing.feature_selector.display_data_cluster(df_corr: DataFrame, n_clusters: int = 8) None

Displays the correlated features in a clusterized heatmap graph.

Parameters:
  • df_corr (pd.DataFrame) – Correlation DataFrame to be clustered.

  • n_clusters (int, optional) – Number of clusters to form. Defaults to 8.

  • n_init (int, optional) – Number of time the k-means algorithm will run with different centroid seeds. Defaults to 500.

  • max_iter (int, optional) – Maximum number of iterations of the k-means algorithm for a single run. Defaults to 1000.

qsar.preprocessing.feature_selector.display_elbow(df: DataFrame, max_num_clusters: int = 15) None

Displays the elbow curve for the given dataframe and its associated Within-Cluster Sum of Square (WCSS).

Parameters:
  • df (pd.DataFrame) – A correlation dataframe to determine the optimal number of clusters for k-means clustering.

  • max_num_clusters (int, optional) – The maximum number of clusters to evaluate for the elbow curve. Defaults to 15.