Measuring and evaluating a Machine Learning Solution

Answer these questions to learn how you can measure and validate your solution, and understand how it might be biased.

Q6.1 Qualitative

How will you evaluate the solution's performance in terms that are meaningful to stakeholders? For example, examination of system behaviour under specific conditions or results for known examples.

Additional context & tips

It can be difficult to understand what numerical performance metrics mean in real-world terms. This section should help you to define qualitative ways to evaluate AI/ML solution performance. It's absolutely vital to be sure you are measuring solution performance in meaningful ways. The purpose of this question is to plan to ensure your evaluation is meaningful.

Examine specific examples

When using Machine Learning, you will get lots of answers. It can be helpful to examine some specific samples in detail to get a better idea how things work. You can select samples randomly, or deliberately select samples with unusual characteristics.

For example, you might look at samples which were known to be unusual or problematic for some reason. Or, you can identify samples where particular input feature values are present, and verify the outputs make sense. SMEs are helpful to pick out extreme and unusual input features and define what good outputs might be. For these deep-dives into specific samples, it helps to have SMEs define good outputs, and to make graphics or outputs which help with assessment.

Compare to naive, baseline models

Another way to gain insight into model behaviour is to define naive, baseline models such as "predict average value", "predict median value" or "predict most frequent label all the time". When your classes are imbalanced, this helps to see whether your ML models are actually any better than chance!

Feature importance

Next, you can look at feature importance. There are many feature-importance techniques, but all of them aim to explain which features contributed the most to some outputs. Some models are more interpretable than others; explainable AI techniques are useful to interpret feature importance for complex, "black-box" ML models. Once you've pulled out some measures of feature importance, plan to check with your SMEs if the right features are being used. Some models (such as logistic regression) have easily interpreted coefficients which also indicate the direction of effect; SMEs can verify these too.

Checking cost terms

Finally, if you're using Optimization techniques, you can dive into specific assignments or schedules and check the implementation of constraints and cost terms are correct. These are interpretable by definition, so you just need your SMEs to check the equations and outputs are correct.

Q6.2 Quantitative

What numerical performance metrics can you use to evaluate your solution? Which variables or outputs will be measured using each metric? How good is "good-enough", or what minimum performance is necessary? Do you have any existing systems or human performance which can act as a baseline?

Plan to measure in ways which will reflect real-world utility and establish the viability of the identified use-case.

Additional context & tips

To conduct systematic experiments and achieve objective progress you need to agree on numerical measures of solution performance. You can use more than one metric; different metrics expose different weaknesses, and help you understand why your solution is failing and how to improve it.

What to measure

The first aspect to consider is which output[s] to measure. This may be obvious from your problem representation (i.e. the thing you're optimizing or training the model to do), but sometimes you can only approach the problem indirectly and you need aggregate output functions to assess overall performance meaningfully - in addition to the outputs of your models or algorithms.

Classification metrics

In a classification problem representation, there are many metrics available, and we would recommend implementing all the most popular. Many assume binary classification (two outcomes) but can be extended to multi-class, multi-label settings. They include:

  • Accuracy (correct answers / all answers). 
  • Precision (fraction of predicted true outputs which are actually true)
  • Recall  (fraction of true labels which are predicted as true)
  • F-Score (average of precision and recall - i.e. considers both)
  • Sensitivity & Specificity (see link for explanation)

Relying on just one metric can be misleading, e.g. if you measure accuracy in an imbalanced dataset where 95% of answers are "0", you can get 95% accuracy by just guessing "0" all the time! This is obviously not intelligent or desirable.

Regression metrics

In a regression problem you have a continuous range of output values. Metrics to measure output performance include:

  • R-squared (also known as coefficient of determination; measures how well the input feature variables predict the output variable; note several similar definitions)
  • MAE (Mean Absolute Error)
  • MSE (Mean Square Error)
  • RMSE (Root Mean Square Error)

See this article for intuition on choosing a metric. MAE is usually easier to interpret than RMSE.

Baselines and acceptance criteria

Is there any minimum performance necessary for solution viability or client acceptance criteria? You'll need to measure these to ensure you can meet them. It's also useful to measure existing solution performance (even with 3 rounds of human expert review, errors always slip through regardless what people may claim). Having a human / existing solution baseline helps to justify your project, and prevents unfair comparison to anecdotal or subjective claims about how well the old way works. It's best to define these now before you get into arguments about them!

Q6.3 Fairness and generalization

How will you evaluate your solution appropriately and fairly, minimizing bias

How will you ensure your solution generalizes from your existing data, to real-world conditions? How can you ensure your data is representative of the variability of future, real-world data?

Additional context & tips

Let me convince you that the performance of an ML solution in your existing data does not matter. What actually matters is the real-world performance of your ML solution, which you can't measure until you put it into production (and even then, you might be alienating some user groups without being aware of it). Generalization refers to use of a model in the real world, after training it on a limited sample of data.

To estimate how your solution might perform in real world conditions (i.e. how it generalizes), you can plan to use validation techniques such as Bootstrap resampling and cross-validation, which repeatedly train and evaluate your ML models on different subsets of data.

Bias

Now let's talk about why using those validation techniques might not work as well as you might hope. One common reason is bias - which can be a part of your data, your models or algorithms, and your wider business processes! Bias isn't just a thing you should worry about for ethical reasons. It can destroy the entire value proposition of your solution, rendering it at best useless and at worst destroy your business. And you wouldn't know until it's too late! This is serious stuff.

  • This is not only about models which deal with people. For example, imagine you're a fastener manufacturer who adjusts machines to handle a certain component size tolerance. You produce a month's worth of stock with biased statistics about actual component tolerance - and 95% of them are defective as a result!

Look at the figure below, which illustrates bias in data used to train a ML animal classifier. The classifier tries to predict if a picture contains a dog, or a cat:

Example of the concept of Bias in data used for machine learning

Using the training data provided, the model learns well and scores highly on performance metrics. But when we give it the "real world image" on the right, it is completely wrong. Worse, it is very confident in its wrong answer!

Can you work out why the model is wrong? The training data is biased because it is not representative of the variety of fur colours in real-world cats. In fact, all the dark furred animals are dogs, which the classifier exploits to label any dark-furred animal as a dog.

Now that you understand bias, what steps will you take to detect and control bias in your AI/ML solution?

Ethical and Responsible AI

A number of businesses have developed material to help you address bias by embracing ethical and responsible AI practices. You should aim to be familiar with the risks and mitigation strategies at a project design phase, and note risks you have identified and strategies you will adopt in response. This might include:

  • First seeking to measure and understand if your data is representative of real-world conditions
  • Seeking new data or surveying specific user groups
  • Re-weighting or re-sampling the data used
  • Explicitly evaluating performance in under-represented sample groups