Q: Why is p-hacking considered scientific malpractice?

p-hacking refers to manipulating data analysis until a statistically significant result is obtained — then reporting only that result. Common practices include collecting data until p < 0.05, dropping inconvenient outliers, trying different statistical tests until one gives p < 0.05, selectively including or excluding covariates, or running the same test on many subgroups and reporting only the significant ones. Each additional test at α=0.05 carries a 5% chance of a false positive. With 20 tests, you expect one significant result purely by chance. p-hacking inflates the actual false positive rate far above the nominal α. John Ioannidis's landmark 2005 paper 'Why Most Published Research Findings Are False' demonstrated mathematically that the majority of published findings in many fields are false positives, largely due to these practices. Solutions include pre-registration (publishing your analysis plan before data collection), multiple comparison corrections (Bonferroni, FDR), open data and code sharing, and replication. The replication crisis in psychology and medicine (2011–present) made p-hacking a major focus of scientific reform.

Q: Should I always reject the null hypothesis when p < 0.05?

No — p < 0.05 is a signal, not a verdict. Statistical significance and practical significance are completely different concepts. A study with n=100,000 might detect a mean difference of 0.001 units with p=0.001, but the effect is negligible in the real world. Always pair p-values with effect sizes (Cohen's d for means, η² for ANOVA, r for correlations) and confidence intervals. Second, statistical significance depends on sample size — with a large enough sample, virtually any hypothesis will be rejected. Third, the null hypothesis is rarely literally true (exact equality almost never holds in nature), so very large samples will always find significance. Fourth, context matters: α=0.05 may be appropriate for exploratory research but too lenient for drug approval decisions. Benjamin et al. (2018) proposed lowering the standard threshold to α=0.005 for new discoveries in science. The consensus view is to treat p-values as one piece of evidence among many — including effect size, CI, prior probability, and replication — rather than as the sole arbiter of truth.

Question 1

What is a p-value and what does it actually mean?

Accepted Answer

A p-value is the probability of observing a test statistic at least as extreme as the one computed from your sample, assuming the null hypothesis is true. More concisely: it measures how surprising your data would be if H₀ were correct. A small p-value (close to 0) means the observed data would be very unlikely under the null hypothesis, providing evidence against it. A large p-value means the data is consistent with H₀. Critically, the p-value does NOT tell you the probability that your hypothesis is true or false, the probability that the result was due to chance, or the probability that you'll get the same result if you repeat the study. These are common and serious misinterpretations. The p-value is a single number computed from your sample; it is itself a random variable that varies from sample to sample. R.A. Fisher introduced the p-value concept in 1925 as a measure of evidence against a null hypothesis, not as a binary accept/reject mechanism.

Question 2

Why is α = 0.05 the standard significance level?

Accepted Answer

The 0.05 threshold is largely a historical convention traced to R.A. Fisher's 1925 book 'Statistical Methods for Research Workers.' Fisher suggested that a result is 'significant' if it would occur by chance fewer than 1 in 20 times. He never intended this as a rigid rule — he considered it a rough guide. Neyman and Pearson later formalized the decision-theoretic framework with fixed α and β (Type I and Type II errors). The 0.05 threshold became entrenched through widespread adoption in textbooks, journals, and statistical software. Alternative thresholds exist for good reasons: α = 0.01 is used in high-stakes medical research where false positives are costly; α = 0.10 is used in exploratory research; particle physics uses α ≈ 0.0000003 (5-sigma standard) because confirming a new particle requires extraordinary evidence. The ASA 2016 statement explicitly cautioned against treating 0.05 as a bright line and encouraged reporting exact p-values with effect sizes and confidence intervals.

Question 3

What's the difference between p-value and confidence interval?

Accepted Answer

A p-value and confidence interval (CI) address related but different questions. A p-value tests whether an effect is zero under the null hypothesis — it's a binary-ish question. A confidence interval estimates the plausible range of the true effect size — it's a quantitative question. A 95% CI for a mean difference that excludes zero is exactly equivalent to a two-tailed p-value < 0.05. However, the CI provides far more information: it tells you both the statistical significance and the practical magnitude of the effect. A mean difference of 0.1 might be statistically significant with a huge sample (p=0.001) but the 95% CI of [0.05, 0.15] reveals the effect is tiny. Conversely, a CI of [-2, 18] (spanning zero) shows the effect might be large but the study is underpowered. The ASA and most journals now encourage reporting CIs alongside (or instead of) p-values. The APA Publication Manual recommends reporting both. In summary: p-value answers 'is there an effect?'; CI answers 'how large is the effect, and how precisely do we know?'

Question 4

Why is p-hacking considered scientific malpractice?

Accepted Answer

p-hacking refers to manipulating data analysis until a statistically significant result is obtained — then reporting only that result. Common practices include collecting data until p < 0.05, dropping inconvenient outliers, trying different statistical tests until one gives p < 0.05, selectively including or excluding covariates, or running the same test on many subgroups and reporting only the significant ones. Each additional test at α=0.05 carries a 5% chance of a false positive. With 20 tests, you expect one significant result purely by chance. p-hacking inflates the actual false positive rate far above the nominal α. John Ioannidis's landmark 2005 paper 'Why Most Published Research Findings Are False' demonstrated mathematically that the majority of published findings in many fields are false positives, largely due to these practices. Solutions include pre-registration (publishing your analysis plan before data collection), multiple comparison corrections (Bonferroni, FDR), open data and code sharing, and replication. The replication crisis in psychology and medicine (2011–present) made p-hacking a major focus of scientific reform.

Question 5

Should I always reject the null hypothesis when p < 0.05?

Accepted Answer

No — p < 0.05 is a signal, not a verdict. Statistical significance and practical significance are completely different concepts. A study with n=100,000 might detect a mean difference of 0.001 units with p=0.001, but the effect is negligible in the real world. Always pair p-values with effect sizes (Cohen's d for means, η² for ANOVA, r for correlations) and confidence intervals. Second, statistical significance depends on sample size — with a large enough sample, virtually any hypothesis will be rejected. Third, the null hypothesis is rarely literally true (exact equality almost never holds in nature), so very large samples will always find significance. Fourth, context matters: α=0.05 may be appropriate for exploratory research but too lenient for drug approval decisions. Benjamin et al. (2018) proposed lowering the standard threshold to α=0.005 for new discoveries in science. The consensus view is to treat p-values as one piece of evidence among many — including effect size, CI, prior probability, and replication — rather than as the sole arbiter of truth.

p-Value Calculator

p-Value & Decision

Error Rates

Scientific Integrity

How to Use This Calculator

Formula

Example

Frequently Asked Questions

Related Calculators