Categories → Data , Statistics , Study Design , Independence
Bias in machine learning refers to a systematic error in predictions caused by an oversimplified model or data that does not represent the true population.
In machine learning, bias refers to the tendency of a model to consistently predict outcomes that are systematically different to the true values. It occurs when the model is not able to capture the complex relationships between the input variables and the output variable, and instead relies on simpler, more easily predictable relationships that may not be representative of the true data generating process. Or, the model may capture relationships in the data provided, but the data is not representative of all the real-world conditions in which the model will be used. You may see discussion of Fairness - in practice, this is a related problem, where we actively seek to understand and mitigate specific types of bias.
The figure below shows the training data for a Cat or Dog classifier model. Given an image, the model must predict whether the image contains a cat or a dog. When testing the model with the "real world data" image on the right, the model predicts "Dog" and even worse, it does so with very high confidence! Why is it wrong, and why is it so confident in its wrong answer?
In this case, bias in the data has produced a biased model after training. Notice the training set only contains cats which light coloured fur, whereas it does include dogs with dark fur. This training data is not representative of all cats - and in the real world, when we encounter a black cat, the model gets it wrong.
The Catalogue of Bias does a great job explaining and enumerating them, just to give you an idea of the many ways bias can creep into data, models, and methods. The rest of this article will cover just a few that are likely to pop up in the type of study users of Causal Wizard will conduct, especially through the data you use.
Bias can manifest in different ways depending on the specific context of the problem being addressed. Some examples of bias in machine learning include (but are not limited to):
Underfitting (and Bias-Variance tradeoff): When a model is too simple to capture the complexity of the data, it may result in high bias and low variance. This means that the model consistently predicts outcomes that are far from the true values, and the error remains high even as the amount of training data increases. An example of underfitting is when a linear regression model is used to predict a nonlinear relationship between the input and output variables. Typically, the model is underpowered to represent the system being modelled.
Sampling bias: This is a problem with the data provided. When the training data is not representative of the population from which it is drawn, it may result in biased predictions. For example, if a model is trained on data from only one geographic region, it may not generalize well to other regions.
Label bias: When the labels (target outputs) in the training data are incorrect or incomplete, it may result in biased predictions. For example, if a model is trained to predict the risk of default on loans, but the training data only includes defaults from certain demographic groups, the model may not be able to generalize well to other groups.
Algorithmic bias: When the algorithms used in machine learning are designed or trained in a way that unfairly disadvantages certain groups, it may result in biased predictions. For example, facial recognition algorithms may be biased against certain ethnic groups due to the way the training data was collected.
To reduce bias in machine learning models, it is important to carefully choose the features used in the model, ensure that the training data is representative of the population, and regularly monitor the predictions to identify and correct any biases that may arise.
In Causal Wizard, you should endeavor to ensure you provide data that isn't biased, and/or that you understand the types of bias present and the implications for your use-case.
We also recommend using simpler models wherever possible, and comparing causal effect estimates from simpler models to estimates from more complex models. You can choose the type of model in the Edit Study Wizard page, after the Check step, before requesting a new Result.
To address overfitting, it is important to use techniques such as regularization, cross-validation, and early stopping.
Regularization methods such as L1 or L2 regularization add a penalty term to the model's objective function to encourage it to be simpler and reduce the impact of irrelevant features.
Cross-validation involves dividing the data into multiple subsets and training on different subsets while testing on the others to evaluate the model's ability to generalize. A similar process, Bootstrap resampling, is provided in Causal Wizard.
Early stopping involves stopping the training process when the validation accuracy stops improving, rather than continuing until the training accuracy reaches 100%. These techniques help prevent overfitting and improve the model's ability to generalize to new data.
The figure above presents an intuition of the balance between Over and Under-fitting in a classification task (red or blue). The green line depicts the model decision boundary. An ideal model will perform well both in your training data, and in validation data - giving you the best chances of it also performing well in the real world. Without a validation regime, it's impossible to get a sense of whether a model is overfitted or underfitted, and therefore impossible to predict real-world performance. Overpowered models trained on too few samples will tend to exhibit over-fitting. Regularization smooths the decision boundary, helping to reduce over-fitting and excess variance.