Refutation and Significance testing in DoWhy

CategoriesStatistics , Causal Effect , Causal Inference

A definitive guide to Refutation and Bootstrap-based statistical significance testing in DoWhy

This article was motivated by our need to fully understand the refutation tests in DoWhy, a popular Python library for Causal effect-size estimation. In particular, we wanted to understand what it means when a specific test “passes” or “fails”, and what that means for the validity of estimated causal effects.

To understand the refutation tests it’s essential to first understand the principle of the Bootstrap method, on which they’re based.

What is the Bootstrap?

In Statistics, the Bootstrap method is the name for any process which repeatedly produces new datasets from an original dataset, by randomly drawing samples from the original dataset. 

Importantly, with Bootstrap, samples are drawn with replacement, meaning that the same sample-unit (i.e. individual participant, observation, or individual record) can be picked more than once. This is different to e.g. cross-validation, which divides a dataset into several parts, with each sample-unit occurring exactly once, in only one part.

Why use the Bootstrap?

The bootstrap method is often used to explore the statistics of the resulting datasets and the performance of predictive models built from a single dataset, especially the stability of model performance. It is often used in a retrospective, observational setting.

The Bootstrap method is popular because it is conceptually simple and makes few assumptions about data, methods and models.

However, one key assumption is that the original dataset is a good approximation of the real world data it was drawn from. 

Given this assumption, we can characterise how our statistics and models would behave in new, unseen data, if it were drawn from the same original distribution.

For example, after building a set of predictive models from bootstrapped datasets, we can measure the mean and variance of model accuracy on these datasets, and thereby predict how our model will perform on unseen data from the real world. This is why the Bootstrap technique is often used to explore how machine-learning models will generalize to unseen data.

Bootstrap is a Frequentist method, but if you’re keen on a side-quest, there is a Bayesian Bootstrap!

Using the Bootstrap method to validate and refute Causal models

DoWhy offers a wide range of refutation and validation tests, but we will cover three of the most popular. The three were chosen because they make few assumptions about the model and data, and are therefore widely applicable.

The refutation methods in DoWhy broadly fall into two categories:

  • Sensitivity tests (i.e. how stable are the results in response to changes in the sampled data).
  • Refutations based on synthetic, negative controls, which are expected to nullify the result. If they don’t, something is wrong with the model or data.
Refutation methods supported by DoWhy

All the tests involve some key concepts:

  • A test-value: This is a statistical quantity, typically the estimated causal effect which is the key result you wish to obtain using DoWhy.
  • A null distribution: This is a distribution of the same quantity, generated using the Bootstrap method, under specific conditions which vary from test to test (more details provided with each example below).
  • A null hypothesis: The null hypothesis is that the test value is in the null distribution. 
  • An alternative hypothesis: The test value does not fall in the null distribution.

p-values

In all the refutation and sensitivity tests, we use Bootstrap to approximate the frequency that the test value falls in the null distribution. The frequency is interpreted as a probability, and provides a p-value for our null hypothesis. Therefore, a very small p-value (typically < 0.05) implies a small probability that the test-value is in the null distribution, and we should accept the alternative hypothesis.

Conversely, a large p-value (typically ≥ 0.05) implies a large probability that the test-value is in the null distribution, and we should therefore accept the null hypothesis. Simple!

DoWhy Statistical Significance test (Permuted outcomes)

A Bootstrap derived test is used to generate a p-value for the causal effect estimate from DoWhy. What does this p-value actually mean? According to the source code, the method is based on “The Percentile Bootstrap: A Primer With Step-by-Step Instructions in R” by Rousselet, Pernet, and Wilcox (2021).

Illustration to help explain the intuition behind the statistical significance test. In plot A, the distribution of causal effect estimates on samples from the original data is shown (this is not actually calculated by DoWhy; this is just for illustration). Plot B shows the null distribution of samples created with permuted (random) outcomes. The original estimate is in the most extreme tails of the null distribution, which means it is unlikely to have been generated by the null distribution, in which there is no causal effect. The result is statistically significant, and we accept the alternative hypothesis that the causal effect is nonzero.

In the figure above, plot A shows the distribution of causal effect estimates we would expect to observe given many Bootstrap datasets drawn from our original dataset. Some estimates are higher than our original, and some are lower. Our estimate is quite far from zero, and we rarely observe an estimate of zero. Things are looking good for our confirmation of a causal effect!

Plot B shows the null distribution used in the DoWhy statistical significance test. The distribution is now centered on zero, because the outcomes in our Bootstrap dataset are randomly permuted (shuffled) — this should destroy any causal effect, producing a causal-effect-estimate of zero! Of course, due to artefacts in the data, the estimates in the null distribution will not be exactly zero.

So how does the test determine whether our original estimate was significant?

If the original estimate (our test-value) has greater absolute magnitude than 95% of our null distribution estimates, the p-value returned by DoWhy will be < 0.05, and we can say that our result is significant, and accept our alternative hypothesis that the true causal effect is not zero.

What happens when the original estimate is not significant? Plot C (below) shows this scenario:

In this scenario, the original estimate (our test-value) is not more extreme than 95% of our null-distribution estimates (which were created with random outcomes, and therefore should have no causal effect). Therefore, it is quite plausible that we would have obtained this estimate even if there were no real causal effect, and we accept our null hypothesis that the test-value is in the null distribution; the result does not support the conclusion that the true causal effect is nonzero.

Note that although in these illustrations a roughly Normal distribution is shown, you should not expect any particular distribution in your data. The Normal distribution is purely for illustration purposes.

Random Common Cause Refutation test

A common-cause is a variable which affects both Treatment and Outcome variables, which means it certainly could change your estimate, which in DoWhy is the causal-effect of Treatment on Outcome.

An unobserved common-cause is a variable which is not captured in your data, but its invisible influence disrupts your estimate of the causal effect.

The Random Common Cause (RCC) refutation test aims to address this possible criticism, by simulating the effect of a random, unobserved common cause and measuring how much this affects causal estimates. These RCC-affected estimates form our null distribution:

Effect of a Random Common Cause (RCC) on the null distribution of causal-effect estimates. Our original estimate (blue line) is the test value. Our intuition is that the causal effect estimate should not be significantly affected by the presence of an unobserved, randomcommon cause. This means the test-value should fall in the null distribution with the RCC. If it doesn’t, it means the causal effect estimates were affected by the RCC and the original estimate is less likely to be valid.

Using the figure above, we can grasp the intuition behind this test. We expect our null distribution of causal-effect-estimates to be stable despite the added RCC variable. 

However, this means our interpretation of the results is reversed compared to the significance / permuted-outomes test described earlier. Let me explain.

If the null distribution is not affected by the RCC, the original estimate is likely to be in the middle of the null distribution. This means the p-value will be large, and the result will not be significant. This supports the validity of our original causal-effect estimate.

To contrast, if the null distribution is affected by the RCC, the original estimate will be comparatively more extreme; the p-value may be small; the result could be significant. This refutes the validity of our original causal-effect estimate.

Placebo Treatment Refutation test

A placebo treatment is a “fake” substitute for the real Treatment, which presumably has no causal effect (other than the placebo effect, of course!)

The Placebo Treatment refuter verifies that if you replace your real Treatment with a random variable, the causal effect disappears.

To implement this test we use zero as our test-value, not the original effect estimate. 

The bootstrap method is then used to generate a null distribution of causal effect estimates with data which has randomly permuted or random treatment values. 

Like the RCC test, our interpretation of the results is reversed compared to the significance / permuted-outomes test described earlier.

Next, we verify that the test-value zero is in the null distribution. In this case the p-value will be large, and the result not significant. The validity of the original causal effect estimate is supported. 

If zero is extreme compared to our null distribution of causal effects, then the p-value will be small and the result may be significant. This does not support our original causal effect analysis:

Placebo treatment refutation — which refutes our original causal effect estimate! In this scenario, zero is not in the null distribution of causal effect estimates, which have random treatments, and therefore should be zero. This is not expected. It means there may be a problem with the data or analysis.

Failing the Placebo-Treatment refuter suggests a methodological or program error, data-leakage, or data which easily allows a falsely non-zero causal effect to be generated. You should definitely investigate if this happens.

Appendix I: p-value range or interval

DoWhy may return a p-value range or interval (0, 1rather than a scalar (single value). This occurs when the test value is more extreme than all values in the null distribution. In this situation, we can’t quantify the p-value exactly; we can only say it’s more extreme than we can measure with the available samples. DoWhy returns an interval to indicate this has occurred.

The interval may be “half-open”, with mismatched brackets. The type of bracket indicates whether a range value is inclusive or exclusive. For example, (0,1] means greater than 0 and less than or equal to 1.

Appendix II: The motivation for refutation

What is refutation, and why would you want to do it? If you’re from a machine-learning background, you may not have encountered refutation before.

Whereas validation more broadly seeks to estimate model performance on unseen data, refutation seeks to do this by modelling the results of specific, defined scenarios. Each refutation scenario “disproves” a potential “explanation” of the original estimate. 

Why would some researchers prefer refutation over other types of validation? Generally, ML folks have some fairly strong and narrow presumptions about their objectives:

  • You’re probably trying to create predictive rather than descriptive models
  • You are concerned about model predictive performance on unseen data, which will be encountered in the wild, which you hope will be statistically similar to your training data
  • You are using validation to estimate and confirm this generalisation performance
  • You’re probably interested in Causality because you want your model to make good predictions under changing or specific conditions (e.g. after an intervention)

Contrast this to many other users of Causal models, such as researchers or policy-makers in epidemiology or econometrics, who:

  • Are often trying to create descriptive rather than predictive models, that quantify interactions between specific variables (although of course almost all descriptive models can be used for prediction)
  • Already have all the (typically historical, observational) data that the model will ever be applied to
  • Want to understand the uncertainty in any conclusions reached
  • Are interested in Causal models to be able to accurately predict the effects of interventions or other changes to the systems being studied

While both groups seek to use Causal methods to make statistical models which are better able to handle interventions or other changes, they have different objectives and use-cases.

Related articles
In categories