Does a training programme increase participants' wages?
The study explores whether a training program increases participants' wages, given other confounding factors. Later updates to this study added new participants and there was some controversy over whether the right participants were included in the study to draw strong conclusions. But we're not interested in the final result - we would like to show that you can explore counterfactual outcomes using Causal Wizard.
Remember, a counterfactual outcome is one which didn't happen, but might have. Causal Wizard allows you to understand what would have happened, if different Treatment statuses (Treated, or Control) were applied to various sample sub-groups. Ok, let's get started.
Before we go any further, let's examine the data. To do this, click the Data tab. You'll see the variables available in the data file, and a sample of the data to help understand it. In particular, look at the variables Training and Wage_1978 and the values they contain. We can see that the Training variable (or column - same thing) contains Bool or Boolean values which are True or False, and that the Wage_1978 column contains Real (continuous numerical) values:
In this case study our treatment is the variable Training (a boolean value, True if the participant completed the training program, and False otherwise).
The Outcome we're interested in is Wage_1978, 3 years after completing the training program.
Set the Treament and Outcome variables as described above.
Now let's start to build our Causal Diagram by adding the Treatment (Training) and Outcome (Wage_1978). Then add an edge between them, expressing that Treatment directly affects the Outcome:
A new feature of Causal Wizard is the ability to specify whether a variable should be interpreted as Categorical or Numerical. To change the variable type, click the variable in the Causal Diagram. In this case, click Training. A modal dialog will appear. Change "Interpret values as" to switch between Categorical and Numerical type.
In the Causal Diagram, each variable has an icon to represent its data type. A bar chart indicates categorical data, and a smooth curve indicates numerical data.
A boolean value can be interpreted as either Categorical or Numerical. If numerical, False = 0 and True = 1.
Next, click the Define groups button to specify which participants are Treated and Controls. Our Treatment variable data values are either False (0) or True (1). Boolean values can be represented as Numerical or Categorical, so there are two ways we could define the same separation of samples.
Numerical values are separated with a Threshold value.
Categorical values are separated via a set of user-defined values for each group.
Type each value in the relevant text input box and press the Add button. If you want to specify "Anything else", press that button instead. In the following example, the groups are defined as Control = False and Treated = True. "Anything else" can't be combined with other text values.
Experiment with the editor to understand how to define our two groups (Treated - Training is True, and Control - Training is False). Note that using the Tags editor, you must enter True and False with a leading capital letter.
We should add some additional confounding variables to complete the Causal Diagram. Note that Wage_1974 affects both Training and Wage_1978. This is an important confounder, and it illustrates how initial values for a later Outcome can be included in a Study, when the values vary over time.
Just for variety (not reflecting the original study) the Age variable only affects Wage, not whether the participant is given Training.
Press the Check button to confirm the Study is set up completely. If successful, you should be presented with a choice of models. Select Backdoor: Linear Regression because this includes Counterfactual analysis. Click Calculate to begin analysis.
For now, skip the other results and scroll down to the Counterfactual Outcomes table. Let's examine the results to see what they say and what insights we can find.
The table has four columns: Scenario, Samples, Sum and Mean outcomes:
Each row represents the outcome of a different scenario. In the image above, a red box highlights the scenario "If all samples were treated" i.e. in this study, if all participants received the training program. Scenarios can vary in the number of participants, for example, some scenarios include only Control or Treated samples. Therefore, the Samples column tells you how many samples are involved.
The Sum and Mean columns are summary statistics of each scenario. We can quickly sanity-check that these numbers make sense, for each scenario. Let's focus on Mean outcomes.
First, we can see that the Mean outcome for all actual controls is approx. 4500 and the mean outcome for all actual treated is about 6300. So we might expect to see the mean outcomes of all scenarios to be between these values.
Sure enough, we can see that if all samples were controls the mean outcome would be 4600 (slightly higher than actual controls) and if all samples were treated, the mean outcome would be 6200 - slightly lower than the actual treated outcomes.
Finally, two additional scenarios swap treatment status for Control and Treated groups separately. These values also lie within the expected ranges.
You can use counterfactual analysis in your own data in the same way, to estimate the effect of treatment on untreated samples, and vice-versa.