p-values (Significance testing)

CategoriesStatistics , Study Design

A p-value is a statistical measure that helps researchers determine the strength of evidence against a null hypothesis.

A p-value is the result of a statistical significance test. It measures the likelihood of obtaining a result as extreme or more extreme than the one observed, if the null hypothesis was true. 

Another way to think about it is: "What's the chance that these exciting results are actually just a freak occurrence and not due to the phenomenon I'm studying?"

In experimental research, the null hypothesis usually assumes that there is no significant difference or effect between two or more groups being compared. The p-value is important because it tells us how likely it is that we would observe the data we collected (or the result of a statistical analysis) if the null hypothesis were true. A low p-value (usually <0.05) indicates that the observed result is unlikely to be due to chance alone and provides evidence against the null hypothesis - in other words, that your result is more likely to be due to a real phenomenon rather than chance.

What's so special about 0.05?

There's nothing special about 0.05. It's just a convention to use this threshold to determine significance. It's not even that strict - it still represents a 1 in 20 chance that your results are due to luck! 

Smaller p-values indicate greater significance and less chance the results are simply luck. However, it is often easy to get very small p-values in machine learning, and they may be misleading - tiny confounding errors or methodological mistakes can easily invalidate any p-value.

P-values are often misused, which is another thing to watch out for. Sometimes, researchers keep trying with experiment after experiment to reach the magical 0.05 significance threshold. This is known as p-value "hacking", and often produces invalid results which seem legitimate when the undocumented experimentation is hidden.

p-values and sample size

It is more difficult to get small p-values with smaller sample size (especially <100), which means you might erroneously reject a perfectly good hypothesis if your dataset is too small. When interpreting p-values, you should consider the number of samples involved. Vice-versa, if you have 100,000 samples and struggle to obtain a significant result, the effect size you are looking at is probably very small or nonexistent, or your experiment is confounded somehow.

In Causal Inference

In a causal inference context, the p-value can help determine whether there is a real causal relationship between two variables. The null hypothesis would be the absence of any causal effect.

If the p-value is low, it suggests that the observed relationship between the variables is not due to chance and provides evidence for a causal link between them. However, it is important to note that a low p-value alone does not prove causation, as other factors may be at play. Therefore, researchers must consider other factors such as confounding variables and the strength of the association between the variables to establish causality.

In Causal Wizard, p-values are provided for all Results, including validation and refutation tests.

Related articles
In categories