Cross Validator

The CrossValidator is designed to be flexible and applicable to any model conforming to the scikit-learn interface, making it a valuable tool for QSAR model development and validation. The class supports both standard K-Fold and Stratified K-Fold cross-validation strategies, allowing for its use in a wide range of QSAR scenarios, including those with imbalanced datasets.

The evaluation methods within the CrossValidator class enable the assessment of QSAR models based on various performance metrics such as R squared, cross-validation score, and mean squared error, providing comprehensive insights into model behavior and efficacy.

class qsar.utils.cross_validator.CrossValidator(df: DataFrame)

Bases: object

Class for cross-validation related functionalities.

Variables:: df (pd.DataFrame) – DataFrame containing the data.

create_cv_folds(df: DataFrame | None = None, y: str = 'Log_MP_RATIO', n_folds: int = 3, n_groups: int = 5) → tuple

Create cross-validation folds.

Parameters:

df (pd.DataFrame, optional) – DataFrame to be used. If not provided, a default will be used.
y (str, optional) – Target column name. Defaults to ‘Log_MP_RATIO’.
n_folds (int, optional) – Number of folds. Defaults to 3.
n_groups (int, optional) – Number of groups for stratified k-fold. Defaults to 5.

Returns:

A tuple containing a list of feature sets, a list of targets, a DataFrame with fold information, the target column name, and the number of folds.

Return type:

tuple

cross_value_score(model, df: DataFrame | None = None) → float

Compute cross-validation score for the given model.

Parameters:

model (Model) – The model to be evaluated.
df (pd.DataFrame, optional) – DataFrame to be used, if not provided, default is used.

Returns:

Mean cross-validation score.

Return type:

float

evaluate_model_performance(model, x_train, y_train, x_test, y_test) → dict

Compute various scores for model evaluation.

Parameters:

model (Model) – The model to be evaluated.
x_train (pd.DataFrame) – Training feature set.
y_train (pd.DataFrame) – Training target set.
x_test (pd.DataFrame) – Testing feature set.
y_test (pd.DataFrame) – Testing target set.

Returns:

A tuple containing the R squared score, CV score, custom CV score, and Q squared score.

Return type:

tuple

static get_predictions(model, x_train: DataFrame, y_train: DataFrame, x_test: DataFrame) → tuple

Get predictions using the provided model.

Parameters:

model (object or model instance) – The model to be used for prediction.
x_train (pd.DataFrame) – Training feature set.
y_train (pd.DataFrame) – Training target set.
x_test (pd.DataFrame) – Testing feature set.

Returns:

A tuple containing predictions on the training set and predictions on the testing set.

Return type:

tuple