Gan Featurizer

The class supports operations such as counting heavy atoms in molecules, filtering molecules based on atom counts, determining the appropriate atom count based on a dataset’s distribution, and converting SMILES strings into unique, feature-encoded molecular formats compatible with GAN inputs. The design aims to streamline the preparation of chemical datasets for QSAR modeling in a GAN framework, focusing on molecular feature extraction and preprocessing.

class qsar.gan.gan_featurizer.QsarGanFeaturizer(**kwargs)

Bases: MolGanFeaturizer

Featurizes molecules for a Generative Adversarial Network (GAN) model using the RDKit and DeepChem libraries.

The class is responsible for processing SMILES strings into a format suitable for GAN models in QSAR applications.

determine_atom_count(smiles: DataFrame, quantile: float = 0.95) tuple[int, Series]

Determines the atom count for a DataFrame of SMILES strings.

Parameters:
  • smiles (pd.DataFrame) – A DataFrame of SMILES strings.

  • quantile (float) – The quantile to use when determining the atom count. Default is 0.95.

Returns:

A tuple containing the atom count and a DataFrame of atom counts.

Return type:

tuple[int, DataFrame]

get_features(smiles: DataFrame) ndarray

Returns the features for a DataFrame of SMILES strings.

Parameters:

smiles (pd.DataFrame) – A DataFrame of SMILES strings.

Returns:

An array of features for the SMILES strings.

Return type:

np.ndarray

static get_unique_smiles(nmols: ndarray) list

Returns a list of unique SMILES strings.

Parameters:

nmols (np.ndarray) – An array of molecules.

Returns:

A list of unique SMILES strings.

Return type:

list