Designing an AI or ML Solution

These questions will help you transform your problem and requirements into the outline of a solution.

Q5.1 Identify key entities

Take a minute to think about key entities and concepts involved in your solution. Make notes here. Try to identify:

  • Classes - key types of entity
  • Samples - many independent instances of classes, to which inference or optimisation is applied
  • Features - properties or attributes of each sample, such as measurements
  • Targets - labels, known correct outputs, evaluation function, etc.
Additional context & tips

Since you're considering an AI/ML project, you're going to be dealing with a quantity of data. This data will have structure which you should capture here. We'd like you to consider 4 aspects:

Class

A class is a name for a type of object or event. For example, "Car" is a name for a class of wheeled vehicles. Each individual car object is an instance of the Car class.

There might be several key entities in a complex project. Name the most important ones as classes. For example, if you're optimizing vehicle pickups from depots to minimize total travel time, your classes might be vehicles and depots. Individual vehicles or journeys might be samples (see below). Try to keep it simple - the aim here isn't to produce a detailed taxonomy or design software, just to identify key entities as classes.

Samples

You're not looking for a one-off calculation - you're looking for a repeatable process. You're going to repeat that process with different "things" of the same type - these are your samples. 

Samples are instances of your classes. For example, if you're going to classify images according to their content, the images are your samples. Each image is one sample (or sample unit - confusingly, people often use sample to mean one, or one group of things).

Features

Features are the attributes of the samples. For example, let's say our classes are cars, and in our sample of cars we have recorded year of manufacture, make and model. These three attributes are our features.

Targets

We should also think about how to evaluate solutions. You will need to provide some sort of target for learning or optimisation. Do you have labels, or numbers which are instances of correct answers? Or, do you have a scoring function which can evaluate candidate solutions to the problem? Note how you would measure the quality of outputs from your AI/ML methods.

Q5.2 Approach

Will your solution be:

  • Automation: The solution will perform a task without human intervention, although perhaps with human review
  • Decision support: The solution will help people to complete a process, perhaps with recommendations.
  • Generate insights or data for people to use

Describe how your solution will do one of these things using specific terminology from your problem statement and explaining how it fulfills part of the value proposition.

Additional context & tips

What do you propose your solution will do? It should be something which addresses your problem statement and fulfils part of your value proposition.

Automation is a high stakes approach - you need to be sure you'll achieve such a high level of performance that mistakes will either never happen, or you'll be able to deal with the consequences. It's appropriate when the problem is well understood, especially if the statistical properties of the data do not change (for example, physical processes). It's not a good idea when the statistics of the data are nonstationary (meaning they constantly change), such as user behaviour. In this case, you'll need continual re-training and re-evaluation of your solution. 

Automation can be made safer by building in human review and exception-handling processes from the start. Ensure you know the consequences of these exceptions and how you will be able to identify them (other than by user complaints!) 

A safer way to represent your problem is Decision Support. This means the solution will work with humans to help them do their jobs more effectively and efficiently. One way to do this is to get the AI/ML to make recommendations, which human experts can choose to review and accept, modify or reject. This process places exception-handling at the centre of the process - you will create workflows and user-interfaces for this to happen continually. Fail-safes are built-in. Decision support is appropriate when the AI/ML must perform specific tasks, but you know it won't be perfect and you need ways to deal with the exceptions.

The third problem representation is insights creation. This differs from decision support in that the insights are less targeted towards a specific use-case and more for open-ended exploration of the data. Insights solutions are similar to Business Intelligence (BI) platforms, but may still have very sophisticated algorithms or models underneath to generate specific insights. They are not simply data visualisation. Insights outputs might include anomaly detection or trend analyses, based on ML models. If your solution produces insights, your users must action those insights.

Q5.3 Problem representation

This question may require some AI/ML expertise, but have a go anyway. You can always change the answer later. Popular AI/ML problem representations are listed below. Which one will you use? How will it be fitted to your problem?

  1. Optimization (you must be able to generate and evaluate all possible solutions; AI can search through them efficiently)
  2. Unsupervised Learning (discover patterns in data)
  3. Supervised Learning - requires a large dataset of samples with "correct" answers. Will learn to generate "correct" answers for other samples. There are two main types:
    1. Classification: The answers are categorical labels, such as Case/Control or 0/1.
    2. Regression: Approximating a function; the answers are real numbers such as  5.18.
  4. Reinforcement Learning. You must create a function which defines the quality (reward) of any action or output of the solution. Used when there's no "correct" answer, but the quality of answers can be evaluated.
Additional context & tips

This is one of the more technical questions, but don't stress if you can't answer confidently. It's worth having a go and understanding some of the possibilities out there.

Supervised Learning

One of the most common ML approaches is Supervised Learning. Supervised means there's a way to supervise the behaviour of your model by comparing its output to a set of correct answers. These answers must be available, and lots of them.

Answers can be categorical (e.g. correct labels for classes such as "Disease" and "Healthy") or numerical. If the answers are categorical, you may wany to use a Classification problem-representation. Classification simply means "tell me the class of this sample".

If the answers are numerical, you can use a Regression problem-representation. Another way to think of regression is as function approximation - the model will learn a magical function to reproduce the correct output numbers given the input features.

Common approaches to ML: Supervised learning. Classification vs Regression

Other problem representations

The figure below shows some other common problem-representations.

Optimisation, Reinforcement Learning and Unsupervised Learning

Optimisation

An optimisation problem involves searching through a space of potential solutions to find candidate solutions that maximize or minimize an objective function. The objective function must be able to provide a numerical score for any candidate solution. All possible solutions must be represented in the space; AI algorithms will try to search the space efficiently to find good solutions. Optimization is typically used when the problem is well defined, but highly constrained and the primary difficulty is finding good candidate solutions. The methods are relatively simple and all outputs are interpretable.

Optimization problem representations include Timetabling and Scheduling, Vehicle Routing, Bin-Packing and other assignment problems. They typically have "hard" constraints (must be safisfied) and "soft constraints" (do your best).

Reinforcement Learning

Reinforcement learning frames the problem as an Agent, which interacts with a World. The Agent receives Observations from the World and must learn to generate Actions which produce high Rewards. A Reward is simply a number which represents the quality of the most recent Agent Action. The Agent interacts with the world over a period of time, usually called an Episode, making many actions and accumulating many Rewards. You must be able to define the reward of any action in any state of the Agent and World, and also enumerate all potential actions, which do not change over time.

Unsupervised Learning

Unsupervised learning is pattern or structure detection in data. It aims to reduce a large amount of data to a smaller set of model parameters which capture it as accurately and comprehensively as possible. For example, clustering of user behaviour - if you can find clusters in your data, you can start to think about what types of user those clusters represent and look for differences in behaviour between those clusters. Unsupervised learning can also be used for dimensionality reduction. 

Unsupervised learning usually doesn't directly solve a problem, but it generates insights about the data.

Q5.4 Data transformation

AI/ML solutions rarely use structured (relational) data; instead, relational data is usually transformed to tabular format. Even if your data is images or video, it will usually still have the same structure - many samples, each with the same features. How will you transform your various data sources into a single tabular format? Pay particular attention to links between data and cardinality changes.

Additional context & tips

You have already explored data structure in a previous question (section on Data). In the previous questions, you have identified what your AI/ML solution should do, and the problem representation you might use. Given that knowledge, it's now time to think about how to fit the data to that representation.

Consider the figure below:

Linking data structures and denormalizing them to produce a ML dataset

Let's say we want to explore the relationship between customer demographics and subscription cancellation.

In this figure, we have a relational database (or two input spreadsheets) which hold attributes of the class we're modelling: Customer-Subscriptions. The table at the bottom is our ML dataset, which includes details of customers and their subscriptions.

To produce a ML dataset we have to join multiple data sources together. We need to copy across attributes about Customers (such as age) and attributes about Subscriptions (e.g. status). 

You will need to deal with cardinality changes (e.g. there are many subscriptions per customer) and think about how that will affect the algorithms, models and results. 

If we have questions like "do older customers keep their subscriptions for longer?" we need to calculate new attributes during the transformation process - in this example, we have calculated "Subscription duration" from Subscription start and end dates. 

Finally, which of these features are labels or target outputs, if you're using a problem representation that requires them?

Q5.5 Outline the Pipeline

Sketch out the steps involved in obtaining and producing data for your solution, continuously. 

If you must provide "correct answers" for your chosen approach, how will those answers be produced? How will you conduct human or automatic feedback to continuously measure solution performance?

How will users or operators interact with the solution? 

How are its outputs integrated into other systems?

Additional context & tips

In the transformation question, we asked you to provide detail on how the data would be transformed to fit the problem representation. Now we want you to take a step back and look at the bigger picture - the entire process or pipeline of steps that need to happen to feed data into your solution, and to take the results and provide them to users or stakeholders so they can benefit from them.

Consider the various systems and data sources you need to attach, and how you will record, distribute and present the outputs.

If your approach involves human review and continual evaluation, how will this happen? How will issues be detected, tracked and resolved? If human review is not part of your approach, how will you deal with faults and errors? Can you capture feedback from users of downstream systems?