Quickstart Guide for qsarKit
Welcome to the Quickstart Guide for qsarKit
! This guide aims to help new users get started with qsarKit
by providing a brief overview, detailed installation instructions, and a simple example to demonstrate basic usage.
What is qsarKit?
qsarKit
is a Python package designed for robust predictive modeling using Quantitative Structure-Activity Relationship (QSAR) analysis. It is tailored for researchers and health professionals working with environmental contaminants in breast milk. Developed by Professor Nadia Tahiri’s team, qsarKit
integrates multiple predictive models and offers tools for synthetic data generation using Generative Adversarial Networks (GANs).
Basic Usage
Once you have installed and activated your qsarKit
environment, you are ready to use the package. qsarKit
offers flexible ways to utilize its functionalities: you can either run the package as a complete pipeline or use its individual functionalities by importing each module.
Running the Complete Pipeline
To execute the entire pipeline, including preprocessing, data augmentation, model training/optimization, and prediction, you can use a single command. This approach is useful for users who wish to apply the standard workflow with minimal setup:
python main.py --config ridge_model.yaml --output results/
In this example, ridge_model.yaml
should contain all the necessary configurations for each step of the pipeline. The results of each step will be saved in the results/
directory.
The advantages of running the full pipeline include simplicity and the assurance that all steps are executed in the correct order. However, this method provides less flexibility compared to running each step individually.
Using Package Functionalities Individually
In addition to running the complete qsarKit
pipeline, users can leverage the package’s modular design to use individual functionalities. This approach provides greater flexibility and allows integration with other data processing or analysis workflows.
Data Extraction and Preprocessing
Start by extracting and preprocessing your dataset. Use the Extractor
and PreprocessingPipeline
classes for these tasks:
from qsar.utils.extractor import Extractor
from qsar.preprocessing.custom_preprocessing import PreprocessingPipeline
import pandas as pd
# Configuration for data extraction
datasets_config = {
'full_train': 'path/to/full_train_dataset.csv',
'full_test': 'path/to/full_test_dataset.csv'
}
target_column = 'desired_target_column'
# Extract data
extractor = Extractor(datasets_config)
df_full = pd.concat([extractor.get_df("full_train"), extractor.get_df("full_test")])
# Preprocess data
preprocessing = PreprocessingPipeline(target=target_column, variance_threshold=0.0, cols_to_ignore=[], verbose=False, threshold=0.9)
pipeline = preprocessing.get_pipeline()
df_processed = pipeline.fit_transform(df_full)
Cross-validation and Model Evaluation
Perform cross-validation and evaluate model performance using the CrossValidator
class:
from qsar.utils.cross_validator import CrossValidator
from qsar.utils.visualizer import Visualizer
# Setup cross-validation and visualization
cross_validator = CrossValidator(df_processed)
visualizer = Visualizer()
# Create cross-validation folds
X_list, y_list, df, y, n_folds = cross_validator.create_cv_folds()
visualizer.display_cv_folds(df, y, n_folds)
Model Training and Hyperparameter Optimization
Dynamically load machine learning models, optimize their hyperparameters, and train them:
from qsar.utils.hyperparameter_optimizer import HyperParameterOptimizer
from qsar.utils import get_class_from_path
# Define model configurations
models_config = [
{'name': 'ridge', 'hyperparameters': {...}},
# Add more models as needed
]
# Dynamically load and optimize models
for model_config in models_config:
model_name = model_config['name']
model_class = get_class_from_path("qsar.models." + model_name, model_name.capitalize() + "Model")
model_instance = model_class()
# Optimize model
optimizer = HyperParameterOptimizer(model=model_instance, data=df_processed, direction='maximize', trials=100)
study = optimizer.optimize()
# Set best hyperparameters
best_params = study.best_params
model_instance.set_hyperparameters(**best_params)
# Evaluate model performance
R2, CV, custom_cv, Q2 = cross_validator.evaluate_model_performance(model_instance, X_list, y_list)
visualizer.display_model_performance(model_name, R2, CV, custom_cv, Q2)
Remember, this is just a guideline. You should adapt the code examples to fit your specific datasets, models, and requirements. The qsarKit
package is designed to be modular, offering flexibility for diverse QSAR modeling needs.
This approach allows you to customize each step of the pipeline according to your needs. You can modify the configurations, substitute modules, or integrate qsarKit
’s functionalities into larger systems.
For further examples and detailed instructions on how to use each module, refer to the tutorials included with the package. The tutorials provide comprehensive guides on each component of qsarKit
, helping you to understand and utilize the full potential of the package.
Further Resources
Tutorials: Explore the tutorials/ directory at https://github.com/tahiri-lab/QSAR/tree/main/tutorials for detailed guides on using
qsarKit
, including model training, data preprocessing, and synthetic data generation.Documentation: Visit the official
qsarKit
documentation at https://tahiri-lab.github.io/QSAR/ for comprehensive information on all features and functionalities.Contact: For additional support or feedback, please contact Professor Nadia Tahiri at Nadia.Tahiri@USherbrooke.ca.
Thank you for choosing qsarKit
for your QSAR predictive modeling needs. We hope this guide helps you get started smoothly.