These articles about AI and ML project design are intended to address the
real
difficulties
people experience when trying to define and scope their projects.
Project
design issues
are widely regarded as a key cause of project failure. So to have the best chance of
success,
read our tips before you start, and review them regularly during your project.
These articles are organised as a set of questions you should ask yourself about each aspect of your
project. Most importantly, you do not need to be an AI or ML expert to answer the questions.
We also provide a free tool - a kind of Business Model Canvas for AI and ML projects:
Project designer
Using the tool, you can record your answers to these questions.
How will you evaluate the solution's performance in terms that are meaningful to stakeholders? For example, examination of system behaviour under specific conditions or results for known examples.
It can be difficult to understand what numerical performance metrics mean in real-world terms. This section should help you to define qualitative ways to evaluate AI/ML solution performance. It's absolutely vital to be sure you are measuring solution performance in meaningful ways. The purpose of this question is to plan to ensure your evaluation is meaningful.
Examine specific examples
When using Machine Learning, you will get lots of answers. It can be helpful to examine some specific samples in detail to get a better idea how things work. You can select samples randomly, or deliberately select samples with unusual characteristics.
For example, you might look at samples which were known to be unusual or problematic for some reason. Or, you can identify samples where particular input feature values are present, and verify the outputs make sense. SMEs are helpful to pick out extreme and unusual input features and define what good outputs might be. For these deep-dives into specific samples, it helps to have SMEs define good outputs, and to make graphics or outputs which help with assessment.
Compare to naive, baseline models
Another way to gain insight into model behaviour is to define naive, baseline models such as "predict average value", "predict median value" or "predict most frequent label all the time". When your classes are imbalanced, this helps to see whether your ML models are actually any better than chance!
Feature importance
Next, you can look at feature importance. There are many feature-importance techniques, but all of them aim to explain which features contributed the most to some outputs. Some models are more interpretable than others; explainable AI techniques are useful to interpret feature importance for complex, "black-box" ML models. Once you've pulled out some measures of feature importance, plan to check with your SMEs if the right features are being used. Some models (such as logistic regression) have easily interpreted coefficients which also indicate the direction of effect; SMEs can verify these too.
Checking cost terms
Finally, if you're using Optimization techniques, you can dive into specific assignments or schedules and check the implementation of constraints and cost terms are correct. These are interpretable by definition, so you just need your SMEs to check the equations and outputs are correct.
What numerical performance metrics can you use to evaluate your solution? Which variables or outputs will be measured using each metric? How good is "good-enough", or what minimum performance is necessary? Do you have any existing systems or human performance which can act as a baseline?
Plan to measure in ways which will reflect real-world utility and establish the viability of the identified use-case.
To conduct systematic experiments and achieve objective progress you need to agree on numerical measures of solution performance. You can use more than one metric; different metrics expose different weaknesses, and help you understand why your solution is failing and how to improve it.
What to measure
The first aspect to consider is which output[s] to measure. This may be obvious from your problem representation (i.e. the thing you're optimizing or training the model to do), but sometimes you can only approach the problem indirectly and you need aggregate output functions to assess overall performance meaningfully - in addition to the outputs of your models or algorithms.
Classification metrics
In a classification problem representation, there are many metrics available, and we would recommend implementing all the most popular. Many assume binary classification (two outcomes) but can be extended to multi-class, multi-label settings. They include:
Relying on just one metric can be misleading, e.g. if you measure accuracy in an imbalanced dataset where 95% of answers are "0", you can get 95% accuracy by just guessing "0" all the time! This is obviously not intelligent or desirable.
Regression metrics
In a regression problem you have a continuous range of output values. Metrics to measure output performance include:
See this article for intuition on choosing a metric. MAE is usually easier to interpret than RMSE.
Baselines and acceptance criteria
Is there any minimum performance necessary for solution viability or client acceptance criteria? You'll need to measure these to ensure you can meet them. It's also useful to measure existing solution performance (even with 3 rounds of human expert review, errors always slip through regardless what people may claim). Having a human / existing solution baseline helps to justify your project, and prevents unfair comparison to anecdotal or subjective claims about how well the old way works. It's best to define these now before you get into arguments about them!
How will you evaluate your solution appropriately and fairly, minimizing bias?
How will you ensure your solution generalizes from your existing data, to real-world conditions? How can you ensure your data is representative of the variability of future, real-world data?
Let me convince you that the performance of an ML solution in your existing data does not matter. What actually matters is the real-world performance of your ML solution, which you can't measure until you put it into production (and even then, you might be alienating some user groups without being aware of it). Generalization refers to use of a model in the real world, after training it on a limited sample of data.
To estimate how your solution might perform in real world conditions (i.e. how it generalizes), you can plan to use validation techniques such as Bootstrap resampling and cross-validation, which repeatedly train and evaluate your ML models on different subsets of data.
Bias
Now let's talk about why using those validation techniques might not work as well as you might hope. One common reason is bias - which can be a part of your data, your models or algorithms, and your wider business processes! Bias isn't just a thing you should worry about for ethical reasons. It can destroy the entire value proposition of your solution, rendering it at best useless and at worst destroy your business. And you wouldn't know until it's too late! This is serious stuff.
Look at the figure below, which illustrates bias in data used to train a ML animal classifier. The classifier tries to predict if a picture contains a dog, or a cat:
Using the training data provided, the model learns well and scores highly on performance metrics. But when we give it the "real world image" on the right, it is completely wrong. Worse, it is very confident in its wrong answer!
Can you work out why the model is wrong? The training data is biased because it is not representative of the variety of fur colours in real-world cats. In fact, all the dark furred animals are dogs, which the classifier exploits to label any dark-furred animal as a dog.
Now that you understand bias, what steps will you take to detect and control bias in your AI/ML solution?
Ethical and Responsible AI
A number of businesses have developed material to help you address bias by embracing ethical and responsible AI practices. You should aim to be familiar with the risks and mitigation strategies at a project design phase, and note risks you have identified and strategies you will adopt in response. This might include: