The Limitations of p-Values and the Case for Bayesian Hypothesis Testing
By Ranran Li · November 2019
Reviewed by: Shiying Wu & Chuanpeng Hu
I. What is a p-value and why is it problematic?
The p-value is defined as: "Given that the null hypothesis (H0) is true, the probability of observing results as extreme or more extreme than the data we observed."
Introduced by British geneticist and statistician Ronald Fisher in the 1930s, the p-value was meant to serve as a reference point to determine whether a result is significant. Fisher recommended a significance threshold of α = 0.05 (approximately 2 standard deviations from the mean of a normal distribution). If the p-value is small, then either a rare event occurred under H0 or the null hypothesis should be rejected in favor of an alternative. This method is known as Null Hypothesis Significance Testing (NHST).
However, increasing numbers of researchers have questioned the concept of "statistical significance" and the limitations of p-values. Four major issues include:
- "As or more extreme": The definition of "more extreme" is unclear and depends on the experimental design (e.g., fixed trials vs. sequential trials). This ambiguity raises concerns about what exactly p-values are measuring.
- Researcher Degrees of Freedom and p-hacking: Researchers may collect more data, exclude outliers, or try multiple analyses to reach a p-value under 0.05. This flexibility increases false positives and undermines replicability.
- Statistical significance may suggest an effect where none exists: NHST often leads to rejecting H0 too easily compared to Bayesian inference, especially because NHST does not account for the likelihood of the data under H1.
- Misinterpretation: Many mistakenly believe the p-value is the probability the hypothesis is true given the data (P(H|data)). This is incorrect. The p-value tells us the probability of the observed (or more extreme) data given that H0 is true.
II. NHST vs. Bayesian Hypothesis Testing: Why is Bayesian inference gaining traction?
Bayesian hypothesis testing differs in several key ways:
- Focuses on actual observed data and compares both H0 and H1: Bayes Factors (BF) quantify the relative likelihood of data under H1 versus H0. For example, BF10 = 10 means the data are 10 times more likely under H1 than under H0.
- Incorporates prior knowledge: Bayesian inference allows prior beliefs to inform analysis, which can then be updated based on new data to form posterior beliefs. Prior distributions can be default (diffuse) or informed (based on past studies).
- No fixed sample size required & allows for sequential analysis: Bayesian testing permits evidence to be monitored continuously. Once sufficient support for a hypothesis accumulates, data collection can stop — improving efficiency and reducing waste.
III. Will Bayesian inference replace p-values?
Bayesian logic is increasingly adopted, especially in light of the replication crisis. But will it replace p-values?
I believe the two methods reflect fundamentally different logics. NHST estimates the probability of data given H0; Bayes compares the probability of data under multiple hypotheses. Thus, Bayes factors are not replacements but complements to p-values, helping improve interpretability.
Despite advantages, Bayesian methods have caveats:
- Strong priors can overpower data: If prior beliefs are too strong, the posterior may reflect the prior more than the actual observed data, especially in small samples.
- Model dependence: Bayes Factors compare specified models. A high BF could result from comparing a mediocre model with a very poor one.
- B-hacking: Like p-hacking, researchers might manipulate prior specifications or thresholds to report strong Bayes Factors. While less problematic, it's something to be cautious of.
In 2018, 72 scholars proposed redefining statistical significance from p < 0.05 to p < 0.005 to reduce false positives (Benjamin et al., 2018).
IV. My Opinion
If you have time and interest in methodology, I recommend learning Bayesian inference. Evaluate it only after you're sufficiently exposed.
When reporting results, it can be helpful to supplement p-values with Bayes factors. This is now easy thanks to JASP (a free, open-source statistical software for both Bayesian and frequentist analysis).
Since most researchers still rely on NHST, I suggest being cautious when interpreting near-threshold p-values, especially around the .05 boundary.
V. Learning Resources for Bayesian Analysis in JASP
- Hu et al. (2018). Bayesian factors and their implementation in JASP. Advances in Psychological Science, 26(6), 951–965.
- van Doorn, J. et al. (in press). An in-class demonstration of Bayesian inference. Psychology Learning and Teaching.
- Marsman & Wagenmakers (2017). Bayesian benefits with JASP. Eur. J. Dev. Psychol., 14, 545–555.
- Wagenmakers et al. (2018). Bayesian inference for psychology. Part II. Psychonomic Bulletin & Review, 25(1), 58–76.
- Rouder et al. (2009). Bayesian t tests. Psychonomic Bulletin & Review, 16, 225–237.
- Gronau et al. (2017). Bayesian model-averaged meta-analysis. Comprehensive Results in Social Psychology, 2, 123–138.
- van Doorn et al. (2019). The JASP guidelines for conducting and reporting a Bayesian analysis.
References
- Benjamin et al. (2018). Redefine statistical significance. Nature Human Behaviour, 2(1), 6–10.
- Chambers, C. (2017). The Seven Deadly Sins of Psychology. Princeton University Press.
- Jeffreys, H. (1961). Theory of Probability. Oxford University Press.
- Lee & Wagenmakers (2013). Bayesian Cognitive Modeling. Cambridge University Press.
- Lindley, D. V. (1993). The analysis of experimental data. Teaching Statistics, 15, 22–25.
← Back to Blog Overview