An Introduction to AI Model Training for Life Sciences: Part 2

Taking a closer look at each of the four steps within the training phase of AI Model Training.

When it comes to training an AI model, success doesn’t only hinge on the quality of your data. Your training process plays a direct role in the quality and accuracy of your model’s results. As we discussed in our previous post, the workflow for model training consists of the training phase and the inference phase

Within the training phase, there are four stages of data training for the ML workflow.

The four stages of data training

  1. Data collection – acquiring relevant data from multiple sources for ML training.
  2. Data cleaning and preprocessing – removing inconsistencies, completing missing values, and transforming data to prepare it for training.
  3. Model selection and training – selecting the right training algorithm based on the problem and feeding it the prepared data.
  4. Model evaluation – assessing the model’s performance against known outcomes to ensure it predicts accurately (assessing the success of the model’s training).

Let’s take a closer look at each of these four steps. 

Data collection

The first step consists of collecting raw data from a variety of sources. The data can come in different formats, including PDFs, Excel spreadsheets, CSVs, databases, and data from applications and API messages. This data is sent through a preprocessing pipeline where it is filtered and prepared for cleaning. There may be additional post-pipeline checks made to the data to ensure it is adequately processed and cleaned. 

Data cleaning

Why does data need to be cleaned within a preprocessing pipeline for ML model training? There are several reasons:

  • The data collected may have missing values within certain datasets that could lead to an inaccurate analysis.
  • It may contain outlier data points that differ significantly from others, potentially skewing results.
  • There may be repeated entries within datasets, causing over-representation of certain data points.

Collected data must be preprocessed and cleaned to prevent errors in the training process that could impact the model’s performance. 

Model selection

The ML model you select for training will depend on key criteria relative to the training outcomes as well as the various types of training models available. To ensure you make the right selection, consider the following criteria:

  • The problem type
  • The data size
  • Interpretability
  • Domain specificity

You also need to be aware of the various types of models available for training:

  • Regression
  • Classification
  • Clustering
  • Neural Networks for NLP and vision
  • Decision trees for tabular data

Model training

Within model training, there are model parameters and hyperparameters that control how it learns from the inputted training data. Model parameters are parameters that the model learns from training data to make predictions. Hyperparameters are adjustable settings used to optimize the model’s performance, set before training begins. 

There are different training techniques that can be utilized for model training, most notably batch training and stochastic gradient descent.

Batch training updates weights using the entire dataset which provides stable and consistent weight updates. Stochastic training updates weights using one data point at a time which offers faster but more variable updates. 

Other types of training include transfer learning which consists of leveraging knowledge from another pre-trained model on a new, related task. This type of training is especially useful for NLP and vision training. There are also ensemble methods for tabular data which involves combining multiple models to improve your model’s predictions. 

Model evaluation

After completing training, it’s time to evaluate how well your model has learned from its training data. There are different metrics to help your team assess how well the model is performing during evaluation including accuracy, precision, recall, F-1 score (combining precision and recall scores to generate a new score), and AUC-ROC. For NLP testing, perplexity is an important metric and IoU is important for vision testing. 

When it comes to considerations to keep in mind during evaluations, be sure to monitor potential underfitting or overfitting in the results. Underfitting is when a model has failed to learn the patterns within training data adequately and performs poorly on training data, as well as new data. Overfitting occurs when a model gives accurate predictions based on training data but not on new data. 

Post-training deployment and real-world considerations

With your model trained and ready for use, it’s time to deploy it within real-world operations. This could be within the cloud, on-premises, or via edge devices. There are domain-specific considerations to keep in mind for deployment based on the model’s deployment needs. For instance, if you need real-time processing for NLP or image quality for vision, you may need to choose a different, more specific deployment option.

Deployment inevitably comes with its own set of challenges to navigate to make the process as smooth as possible. Scalability – how many people will be using the model or require access to it – is important as it will impact your deployment strategy, as well as latency. 

The outputs of the model also need to be continuously monitored to ensure that it doesn’t “drift” (known as model drifting) which results in a drop in the accuracy of output. If that does occur, the monitoring team will need to collect the data for retraining purposes. 

Conclusion

The most important point we need to reiterate is that ML training and evaluation should never occur without at least human monitoring the processes and results at all times. 

There should always be a human governing AI and ML processes, from training to deployment and beyond. An ML model cannot – and should never – replace humans. At Saama, we like to call it keeping a “human in the loop” during all AI operations to augment your work. 

If you’d like to learn more about how our proprietary AI-powered platforms and solutions can enhance your clinical trial data management operations, book a demo with us. 

Recommended Reading