Class Imbalance

CategoriesStatistics , Data

Class imbalance means that your classes (also called cohorts, or groups) are not balanced. One group is much larger than the other.

In Machine Learning and statistics, a Class is a group of samples which share the same category, label, or group identity. Similarly, in scientific experiments, a group of subjects is also sometimes called a cohort.

In Causal Wizard, we are interested in comparing two groups or cohorts: 

  • The control group
  • The treated group

The treatment is used to determine which group each sample belongs to. We will then compare the outcomes of the two groups.

Causal Wizard will refuse to proceed if one of these groups is empty (i.e. no samples are classified as Controls or Treated).

If the ratio of Controls to Treated exceeds 5:1 or 1:5 (either way), Causal Wizard will warn you that these classes are imbalanced (also known as class-imbalance).

Why worry about imbalanced data? Statistical models will focus on the most common class, whichever that is, and will not model the rare class as accurately. This means your results are likely to be less accurate.

If both classes have thousands of samples, this may be less of a concern. But if your dataset is small, and one of the classes is extremely small, this problem may be severe. It is left to your judgement how severe the impact may be in your case.

Possible remedies include:

  • Obtaining and using more data. This is the best approach, if possible.
  • If you already have a lot of samples, reducing the number of samples in the most populous (majority) class, whichever that is. Be careful not to introduce bias when resampling.
  • Impute additional samples in the rare (minority) class. This is tricky, because the statistics of the imputed samples are likely to be biased.

Read more about how to handle class imbalance here