These articles about AI and ML project design are intended to address the
real
difficulties
people experience when trying to define and scope their projects.
Project
design issues
are widely regarded as a key cause of project failure. So to have the best chance of
success,
read our tips before you start, and review them regularly during your project.
These articles are organised as a set of questions you should ask yourself about each aspect of your
project. Most importantly, you do not need to be an AI or ML expert to answer the questions.
We also provide a free tool - a kind of Business Model Canvas for AI and ML projects:
Project designer
Using the tool, you can record your answers to these questions.
Take a minute to think about key entities and concepts involved in your solution. Make notes here. Try to identify:
Since you're considering an AI/ML project, you're going to be dealing with a quantity of data. This data will have structure which you should capture here. We'd like you to consider 4 aspects:
Class
A class is a name for a type of object or event. For example, "Car" is a name for a class of wheeled vehicles. Each individual car object is an instance of the Car class.
There might be several key entities in a complex project. Name the most important ones as classes. For example, if you're optimizing vehicle pickups from depots to minimize total travel time, your classes might be vehicles and depots. Individual vehicles or journeys might be samples (see below). Try to keep it simple - the aim here isn't to produce a detailed taxonomy or design software, just to identify key entities as classes.
Samples
You're not looking for a one-off calculation - you're looking for a repeatable process. You're going to repeat that process with different "things" of the same type - these are your samples.
Samples are instances of your classes. For example, if you're going to classify images according to their content, the images are your samples. Each image is one sample (or sample unit - confusingly, people often use sample to mean one, or one group of things).
Features
Features are the attributes of the samples. For example, let's say our classes are cars, and in our sample of cars we have recorded year of manufacture, make and model. These three attributes are our features.
Targets
We should also think about how to evaluate solutions. You will need to provide some sort of target for learning or optimisation. Do you have labels, or numbers which are instances of correct answers? Or, do you have a scoring function which can evaluate candidate solutions to the problem? Note how you would measure the quality of outputs from your AI/ML methods.
Will your solution be:
Describe how your solution will do one of these things using specific terminology from your problem statement and explaining how it fulfills part of the value proposition.
What do you propose your solution will do? It should be something which addresses your problem statement and fulfils part of your value proposition.
Automation is a high stakes approach - you need to be sure you'll achieve such a high level of performance that mistakes will either never happen, or you'll be able to deal with the consequences. It's appropriate when the problem is well understood, especially if the statistical properties of the data do not change (for example, physical processes). It's not a good idea when the statistics of the data are nonstationary (meaning they constantly change), such as user behaviour. In this case, you'll need continual re-training and re-evaluation of your solution.
Automation can be made safer by building in human review and exception-handling processes from the start. Ensure you know the consequences of these exceptions and how you will be able to identify them (other than by user complaints!)
A safer way to represent your problem is Decision Support. This means the solution will work with humans to help them do their jobs more effectively and efficiently. One way to do this is to get the AI/ML to make recommendations, which human experts can choose to review and accept, modify or reject. This process places exception-handling at the centre of the process - you will create workflows and user-interfaces for this to happen continually. Fail-safes are built-in. Decision support is appropriate when the AI/ML must perform specific tasks, but you know it won't be perfect and you need ways to deal with the exceptions.
The third problem representation is insights creation. This differs from decision support in that the insights are less targeted towards a specific use-case and more for open-ended exploration of the data. Insights solutions are similar to Business Intelligence (BI) platforms, but may still have very sophisticated algorithms or models underneath to generate specific insights. They are not simply data visualisation. Insights outputs might include anomaly detection or trend analyses, based on ML models. If your solution produces insights, your users must action those insights.
This question may require some AI/ML expertise, but have a go anyway. You can always change the answer later. Popular AI/ML problem representations are listed below. Which one will you use? How will it be fitted to your problem?
This is one of the more technical questions, but don't stress if you can't answer confidently. It's worth having a go and understanding some of the possibilities out there.
Supervised Learning
One of the most common ML approaches is Supervised Learning. Supervised means there's a way to supervise the behaviour of your model by comparing its output to a set of correct answers. These answers must be available, and lots of them.
Answers can be categorical (e.g. correct labels for classes such as "Disease" and "Healthy") or numerical. If the answers are categorical, you may wany to use a Classification problem-representation. Classification simply means "tell me the class of this sample".
If the answers are numerical, you can use a Regression problem-representation. Another way to think of regression is as function approximation - the model will learn a magical function to reproduce the correct output numbers given the input features.
Other problem representations
The figure below shows some other common problem-representations.
Optimisation
An optimisation problem involves searching through a space of potential solutions to find candidate solutions that maximize or minimize an objective function. The objective function must be able to provide a numerical score for any candidate solution. All possible solutions must be represented in the space; AI algorithms will try to search the space efficiently to find good solutions. Optimization is typically used when the problem is well defined, but highly constrained and the primary difficulty is finding good candidate solutions. The methods are relatively simple and all outputs are interpretable.
Optimization problem representations include Timetabling and Scheduling, Vehicle Routing, Bin-Packing and other assignment problems. They typically have "hard" constraints (must be safisfied) and "soft constraints" (do your best).
Reinforcement Learning
Reinforcement learning frames the problem as an Agent, which interacts with a World. The Agent receives Observations from the World and must learn to generate Actions which produce high Rewards. A Reward is simply a number which represents the quality of the most recent Agent Action. The Agent interacts with the world over a period of time, usually called an Episode, making many actions and accumulating many Rewards. You must be able to define the reward of any action in any state of the Agent and World, and also enumerate all potential actions, which do not change over time.
Unsupervised Learning
Unsupervised learning is pattern or structure detection in data. It aims to reduce a large amount of data to a smaller set of model parameters which capture it as accurately and comprehensively as possible. For example, clustering of user behaviour - if you can find clusters in your data, you can start to think about what types of user those clusters represent and look for differences in behaviour between those clusters. Unsupervised learning can also be used for dimensionality reduction.
Unsupervised learning usually doesn't directly solve a problem, but it generates insights about the data.
AI/ML solutions rarely use structured (relational) data; instead, relational data is usually transformed to tabular format. Even if your data is images or video, it will usually still have the same structure - many samples, each with the same features. How will you transform your various data sources into a single tabular format? Pay particular attention to links between data and cardinality changes.
You have already explored data structure in a previous question (section on Data). In the previous questions, you have identified what your AI/ML solution should do, and the problem representation you might use. Given that knowledge, it's now time to think about how to fit the data to that representation.
Consider the figure below:
Let's say we want to explore the relationship between customer demographics and subscription cancellation.
In this figure, we have a relational database (or two input spreadsheets) which hold attributes of the class we're modelling: Customer-Subscriptions. The table at the bottom is our ML dataset, which includes details of customers and their subscriptions.
To produce a ML dataset we have to join multiple data sources together. We need to copy across attributes about Customers (such as age) and attributes about Subscriptions (e.g. status).
You will need to deal with cardinality changes (e.g. there are many subscriptions per customer) and think about how that will affect the algorithms, models and results.
If we have questions like "do older customers keep their subscriptions for longer?" we need to calculate new attributes during the transformation process - in this example, we have calculated "Subscription duration" from Subscription start and end dates.
Finally, which of these features are labels or target outputs, if you're using a problem representation that requires them?
Sketch out the steps involved in obtaining and producing data for your solution, continuously.
If you must provide "correct answers" for your chosen approach, how will those answers be produced? How will you conduct human or automatic feedback to continuously measure solution performance?
How will users or operators interact with the solution?
How are its outputs integrated into other systems?
In the transformation question, we asked you to provide detail on how the data would be transformed to fit the problem representation. Now we want you to take a step back and look at the bigger picture - the entire process or pipeline of steps that need to happen to feed data into your solution, and to take the results and provide them to users or stakeholders so they can benefit from them.
Consider the various systems and data sources you need to attach, and how you will record, distribute and present the outputs.
If your approach involves human review and continual evaluation, how will this happen? How will issues be detected, tracked and resolved? If human review is not part of your approach, how will you deal with faults and errors? Can you capture feedback from users of downstream systems?