Categories → Statistics , Causal Effect , Causal Inference
A definitive guide to Refutation and Bootstrap-based statistical significance testing in DoWhy
This article was motivated by our need to fully understand the refutation tests in DoWhy, a popular Python library for Causal effect-size estimation. In particular, we wanted to understand what it means when a specific test “passes” or “fails”, and what that means for the validity of estimated causal effects.
To understand the refutation tests it’s essential to first understand the principle of the Bootstrap method, on which they’re based.
In Statistics, the Bootstrap method is the name for any process which repeatedly produces new datasets from an original dataset, by randomly drawing samples from the original dataset.
Importantly, with Bootstrap, samples are drawn with replacement, meaning that the same sample-unit (i.e. individual participant, observation, or individual record) can be picked more than once. This is different to e.g. cross-validation, which divides a dataset into several parts, with each sample-unit occurring exactly once, in only one part.
The bootstrap method is often used to explore the statistics of the resulting datasets and the performance of predictive models built from a single dataset, especially the stability of model performance. It is often used in a retrospective, observational setting.
The Bootstrap method is popular because it is conceptually simple and makes few assumptions about data, methods and models.
However, one key assumption is that the original dataset is a good approximation of the real world data it was drawn from.
Given this assumption, we can characterise how our statistics and models would behave in new, unseen data, if it were drawn from the same original distribution.
For example, after building a set of predictive models from bootstrapped datasets, we can measure the mean and variance of model accuracy on these datasets, and thereby predict how our model will perform on unseen data from the real world. This is why the Bootstrap technique is often used to explore how machine-learning models will generalize to unseen data.
Bootstrap is a Frequentist method, but if you’re keen on a side-quest, there is a Bayesian Bootstrap!
DoWhy offers a wide range of refutation and validation tests, but we will cover three of the most popular. The three were chosen because they make few assumptions about the model and data, and are therefore widely applicable.
The refutation methods in DoWhy broadly fall into two categories:
All the tests involve some key concepts:
In all the refutation and sensitivity tests, we use Bootstrap to approximate the frequency that the test value falls in the null distribution. The frequency is interpreted as a probability, and provides a p-value for our null hypothesis. Therefore, a very small p-value (typically < 0.05) implies a small probability that the test-value is in the null distribution, and we should accept the alternative hypothesis.
Conversely, a large p-value (typically ≥ 0.05) implies a large probability that the test-value is in the null distribution, and we should therefore accept the null hypothesis. Simple!
A Bootstrap derived test is used to generate a p-value for the causal effect estimate from DoWhy. What does this p-value actually mean? According to the source code, the method is based on “The Percentile Bootstrap: A Primer With Step-by-Step Instructions in R” by Rousselet, Pernet, and Wilcox (2021).
In the figure above, plot A shows the distribution of causal effect estimates we would expect to observe given many Bootstrap datasets drawn from our original dataset. Some estimates are higher than our original, and some are lower. Our estimate is quite far from zero, and we rarely observe an estimate of zero. Things are looking good for our confirmation of a causal effect!
Plot B shows the null distribution used in the DoWhy statistical significance test. The distribution is now centered on zero, because the outcomes in our Bootstrap dataset are randomly permuted (shuffled) — this should destroy any causal effect, producing a causal-effect-estimate of zero! Of course, due to artefacts in the data, the estimates in the null distribution will not be exactly zero.
So how does the test determine whether our original estimate was significant?
If the original estimate (our test-value) has greater absolute magnitude than 95% of our null distribution estimates, the p-value returned by DoWhy will be < 0.05, and we can say that our result is significant, and accept our alternative hypothesis that the true causal effect is not zero.
What happens when the original estimate is not significant? Plot C (below) shows this scenario:
In this scenario, the original estimate (our test-value) is not more extreme than 95% of our null-distribution estimates (which were created with random outcomes, and therefore should have no causal effect). Therefore, it is quite plausible that we would have obtained this estimate even if there were no real causal effect, and we accept our null hypothesis that the test-value is in the null distribution; the result does not support the conclusion that the true causal effect is nonzero.
Note that although in these illustrations a roughly Normal distribution is shown, you should not expect any particular distribution in your data. The Normal distribution is purely for illustration purposes.
A common-cause is a variable which affects both Treatment and Outcome variables, which means it certainly could change your estimate, which in DoWhy is the causal-effect of Treatment on Outcome.
An unobserved common-cause is a variable which is not captured in your data, but its invisible influence disrupts your estimate of the causal effect.
The Random Common Cause (RCC) refutation test aims to address this possible criticism, by simulating the effect of a random, unobserved common cause and measuring how much this affects causal estimates. These RCC-affected estimates form our null distribution:
Using the figure above, we can grasp the intuition behind this test. We expect our null distribution of causal-effect-estimates to be stable despite the added RCC variable.
However, this means our interpretation of the results is reversed compared to the significance / permuted-outomes test described earlier. Let me explain.
If the null distribution is not affected by the RCC, the original estimate is likely to be in the middle of the null distribution. This means the p-value will be large, and the result will not be significant. This supports the validity of our original causal-effect estimate.
To contrast, if the null distribution is affected by the RCC, the original estimate will be comparatively more extreme; the p-value may be small; the result could be significant. This refutes the validity of our original causal-effect estimate.
A placebo treatment is a “fake” substitute for the real Treatment, which presumably has no causal effect (other than the placebo effect, of course!)
The Placebo Treatment refuter verifies that if you replace your real Treatment with a random variable, the causal effect disappears.
To implement this test we use zero as our test-value, not the original effect estimate.
The bootstrap method is then used to generate a null distribution of causal effect estimates with data which has randomly permuted or random treatment values.
Like the RCC test, our interpretation of the results is reversed compared to the significance / permuted-outomes test described earlier.
Next, we verify that the test-value zero is in the null distribution. In this case the p-value will be large, and the result not significant. The validity of the original causal effect estimate is supported.
If zero is extreme compared to our null distribution of causal effects, then the p-value will be small and the result may be significant. This does not support our original causal effect analysis:
Failing the Placebo-Treatment refuter suggests a methodological or program error, data-leakage, or data which easily allows a falsely non-zero causal effect to be generated. You should definitely investigate if this happens.
DoWhy may return a p-value range or interval (0, 1rather than a scalar (single value). This occurs when the test value is more extreme than all values in the null distribution. In this situation, we can’t quantify the p-value exactly; we can only say it’s more extreme than we can measure with the available samples. DoWhy returns an interval to indicate this has occurred.
The interval may be “half-open”, with mismatched brackets. The type of bracket indicates whether a range value is inclusive or exclusive. For example, (0,1] means greater than 0 and less than or equal to 1.
What is refutation, and why would you want to do it? If you’re from a machine-learning background, you may not have encountered refutation before.
Whereas validation more broadly seeks to estimate model performance on unseen data, refutation seeks to do this by modelling the results of specific, defined scenarios. Each refutation scenario “disproves” a potential “explanation” of the original estimate.
Why would some researchers prefer refutation over other types of validation? Generally, ML folks have some fairly strong and narrow presumptions about their objectives:
Contrast this to many other users of Causal models, such as researchers or policy-makers in epidemiology or econometrics, who:
While both groups seek to use Causal methods to make statistical models which are better able to handle interventions or other changes, they have different objectives and use-cases.