Reference notes
Coverage
Test selection · power analysis · software · common mistakes
Sources
Statistical guidelines from NEJM, Nature, Lancet + CONSORT + STROBE
Last reviewed
March 2026
Prepared by the Manusights editorial team.
Methods-and-analysis guide
Statistical Resources for Biomedical Researchers
Statistical errors are one of the top reasons manuscripts get rejected, revised extensively, or retracted after publication. They're also one of the most fixable problems. Most statistical issues in biomedical papers come from a small number of repeated mistakes, not from complex methodology.
This guide covers the fundamentals: how to pick the right test for your data, how to do a proper power calculation before you collect data, the mistakes reviewers catch most often, and the software tools that work best for different types of research.
Quick orientation
Use this page when the analysis plan is becoming visible to reviewers and you want to remove the most common statistical weak points before submission.
This guide is a practical methods screen, not a full biostatistics course. It is most useful for test selection, power logic, figure-level statistical hygiene, and deciding when simple software is enough versus when the analysis needs something stronger.
Best used with
Reporting guidelines
Use it when the statistical cleanup also depends on the formal checklist required by the study design.
Systematic reviews
Move there when heterogeneity, pooled effects, or evidence synthesis become the main statistical issue.
Pre-submission checklist
Run the final package check once the analysis and figures are no longer obvious reviewer targets.
Start here first
The 4-part statistics workflow
If you are preparing a manuscript, first match the design to the test, then confirm power and sample-size logic, then remove figure- and methods-level errors that reviewers catch immediately, and only then choose software or advanced modeling tools.
Match design to test
Use the study design, data type, and pairing structure to choose the right comparison test.
Check power logic
Confirm the effect size, alpha, power target, and variance assumptions before final claims.
Fix reviewer-visible mistakes
Clean up replication, multiple testing, effect-size reporting, and figure labeling errors.
Choose the right tool
Use the software that matches the complexity of the analysis instead of defaulting blindly.
Methodology
What this statistics guide is built from
This workflow synthesizes statistical reporting guidance from major biomedical publishers, CONSORT and STROBE expectations, and standard biostatistics teaching references. The test-selection and common-mistakes sections are emphasized because they drive a large share of revision requests and avoidable reviewer criticism.
Choosing the Right Test
The choice depends on your data type, study design, and sample size. This table covers the most common scenarios in biomedical research.
| Scenario | Test | Note |
|---|---|---|
| Two groups, continuous, normal distribution | Unpaired t-test (or paired t-test for before/after in same subjects) | Check normality with Shapiro-Wilk; equal variances with Levene's test |
| Two groups, continuous, non-normal or small n | Mann-Whitney U (unpaired) or Wilcoxon signed-rank (paired) | No distribution assumption; ranks rather than raw values |
| Three or more groups, continuous, normal | One-way ANOVA, then post-hoc: Tukey (all pairwise) or Dunnett (vs. control) | A significant F-statistic only tells you some group differs, not which |
| Three or more groups, non-normal | Kruskal-Wallis, then Dunn's post-hoc | Non-parametric equivalent of one-way ANOVA |
| Two or more factors, continuous | Two-way ANOVA (tests main effects + interaction) | Check interaction term first; main effects are misleading if interaction is significant |
| Categorical outcomes, two groups | Chi-square test (n > 5 per cell) or Fisher's exact test (small samples) | Fisher's is always valid; chi-square requires adequate cell counts |
| Time-to-event / survival data | Kaplan-Meier curves + log-rank test; Cox proportional hazards for multivariable | Cox assumes proportional hazards; test this assumption |
| Correlation between two continuous variables | Pearson (normal) or Spearman (non-normal or ordinal) | Spearman tests monotonic relationship, not just linear |
| Predicting an outcome from multiple variables | Linear regression (continuous), logistic regression (binary), Poisson (counts) | Check assumptions: linearity, independence, homoscedasticity, normality of residuals |
| Repeated measurements over time | Repeated-measures ANOVA or linear mixed-effects models | Mixed models handle missing data better and don't require sphericity |
When uncertain about which test to use, consult a biostatistician before collecting data. Choosing the wrong test after data collection creates problems that are hard to fix retrospectively.
Practical note
Three statistical habits that still create unnecessary reviewer skepticism
Sample Size and Power Analysis
Power analysis should happen before data collection, not after. A study that's underpowered wastes time, money, and animal lives. A study that's overpowered wastes resources that could go to other experiments.
Effect size
The smallest difference that would be biologically or clinically meaningful. Don't use Cohen's "small/medium/large" conventions for biomedical research. Those are social science defaults and don't translate to molecular biology or clinical work.
Significance level (alpha)
Typically 0.05 (two-sided). Some fields use 0.01 for genome-wide studies. The alpha sets your false positive rate.
Desired power (1 - beta)
Typically 0.80 (80% probability of detecting a real effect). Some grant agencies want 0.90 for clinical trials.
Variability (SD)
Get this from prior studies or pilot data. If you're guessing the SD, your power calculation is also a guess.
Common Statistical Mistakes Reviewers Catch
These eight mistakes account for the majority of statistical concerns raised in peer review of biomedical manuscripts.
1. Multiple comparisons without correction
Running 20 t-tests without adjusting inflates your false positive rate to ~64%. Use Bonferroni (conservative), Benjamini-Hochberg FDR (less conservative), or pre-specify your primary endpoint and treat the rest as exploratory.
2. Technical replicates counted as biological replicates
Three wells from the same mouse are NOT n = 3. They're one biological replicate measured three times. n refers to independent biological units: separate animals, separate patients, separate experiments done on different days. This is the single most common statistical error in preclinical biology papers.
3. Parametric tests on small non-normal samples
With n < 15-20 per group, normality is hard to verify and violations matter more. Use non-parametric tests or report results both ways. If results agree, your conclusions are robust regardless of the distribution.
4. No effect sizes reported
A p-value of 0.001 tells you an effect exists. It doesn't tell you the effect is meaningful. Always report the effect size (mean difference, odds ratio, hazard ratio, Cohen's d) with 95% confidence intervals. Reviewers increasingly flag papers that report only p-values.
5. ANOVA without post-hoc tests
A significant F-statistic means at least one group is different. It doesn't tell you which ones. Follow up with Tukey (all pairwise), Dunnett (vs. control), or Bonferroni-corrected t-tests depending on your question.
6. Wrong test for paired data
Comparing treatment to baseline in the same group requires a paired test, not an unpaired one. Paired designs have more power because they control for between-subject variability. Using an unpaired test on paired data wastes statistical power and can miss real effects.
7. Treating p = 0.05 as a binary threshold
A p-value of 0.049 and a p-value of 0.051 are not meaningfully different. Report exact p-values (p = 0.032, not p < 0.05) and emphasize confidence intervals and effect sizes. The "significance" threshold is a convention, not a law of nature.
8. Correlation described as causation
"X was associated with Y" is correct when your data shows a correlation. "X drove Y" or "X caused Y" requires experimental evidence of a causal mechanism. Observational data can suggest associations. It can't prove causation without additional study designs.
Figure-Level Mistakes Reviewers Notice Immediately
A lot of statistics criticism starts before a reviewer reaches the methods section. They see the figures first. If the figures look sloppy, the reader assumes the analysis may be sloppy too.
Reporting Statistics in Your Paper
For every statistical comparison, report: the test used, sample size per group, summary statistic (mean ± SD or SEM, defined explicitly), exact p-value, and effect size with confidence interval.
SD vs. SEM: know the difference
SD (standard deviation) describes how variable your data is. It's a property of the data itself.
SEM (standard error of the mean) describes how precisely you've estimated the mean. It shrinks as sample size increases (SEM = SD / √n), which means it can make data look less variable than it really is.
Using SEM to make error bars look smaller is a recognized problem. Nature, NEJM, Cell, and most clinical journals now recommend or require SD or 95% CI in figures. If you use SEM, state it explicitly and expect reviewer questions about why.
Minimum Stats Checklist for the Methods Section
Software Tools
GraphPad Prism
Paid (free trial)Best for: Cell biology, preclinical research. Good for t-tests, ANOVA, survival curves, nonlinear curve fitting. Interface-driven, no coding required. The default tool in many wet labs.
Limitation: Limited for complex models (mixed effects, multivariable regression). Expensive for individual licenses.
R
Free (open-source)Best for: The industry standard for complex statistical analysis and reproducible research. Key packages: ggplot2 (visualization), lme4 (mixed models), survival (Kaplan-Meier/Cox), tidyverse (data wrangling).
Limitation: Steep learning curve. Requires coding. But once you learn it, it handles everything.
SPSS
Paid (institutional)Best for: Clinical research, social sciences, epidemiology. Menu-driven interface. Good for standard tests without coding.
Limitation: Less common in basic science. Limited flexibility compared to R. IBM licensing can be expensive.
Stata
PaidBest for: Epidemiology, public health, health economics. Excellent for logistic regression, survival analysis, survey-weighted analyses, panel data.
Limitation: Less intuitive than SPSS. Smaller user community than R.
Python (scipy, statsmodels)
Free (open-source)Best for: Bioinformatics, computational biology, machine learning pipelines. Growing rapidly in quantitative biology.
Limitation: Statistical ecosystem less mature than R for classical biostatistics. Fewer purpose-built packages for clinical trial analysis.
jamovi
Free (open-source)Best for: Point-and-click interface built on R. Good alternative to SPSS for researchers who want GUI-driven analysis without the cost. Exports R code so you can learn R alongside it.
Limitation: Fewer advanced features than R directly. Smaller community.
G*Power
FreeBest for: The standard tool for power analysis. Covers most common test types. Available for Windows and Mac. Every grant application with a power calculation probably used this.
Limitation: Only does power analysis. For analysis itself, you need a different tool.
Resources to Go Deeper
Frequently Asked Questions
Should I use mean ± SD or mean ± SEM in my figures?
Use SD. It tells readers how variable your data is, which is what they need to judge biological variability. SEM shrinks as your sample size grows, so it can make data look more consistent than it really is. Most high-impact journals now recommend or require SD or 95% CI in figures. If you choose SEM, define it explicitly in the figure legend and be ready for reviewers to ask why.
What's the difference between statistical significance and biological significance?
Statistical significance (p < 0.05) means an observed effect is unlikely to be due to chance alone. It says nothing about whether the effect matters. A trial with 100,000 participants might find a statistically significant 0.5 mmHg blood pressure reduction. Real, but no doctor would change their practice over it. Always report effect sizes with confidence intervals so readers can judge the magnitude, not just the existence, of an effect.
My data isn't normally distributed. Do I have to use non-parametric tests?
Not necessarily. With large samples (n > 30 per group), parametric tests hold up well against non-normality because of the central limit theorem. For smaller samples, use non-parametric alternatives (Mann-Whitney U, Kruskal-Wallis) or try a log transformation, which often works for the right-skewed distributions common in biological data. Report which assumption you tested and how in your methods section. Many reviewers will ask.
References
- Motulsky H. Intuitive Biostatistics: A Nonmathematical Guide to Statistical Thinking. 4th ed. Oxford University Press, 2017. [oup.com ↗]
- Krzywinski M, Altman N. Points of Significance (series). Nature Methods. 2013-2020. [nature.com ↗]
- Altman DG, Bland JM. Statistics Notes. BMJ. 1994-present. [bmj.com ↗]
- GraphPad Software. Prism Statistics Guide. Retrieved March 2026. [graphpad.com ↗]
- Faul F, Erdfelder E, Lang AG, Buchner A. G*Power 3: A flexible statistical power analysis program for the social, behavioral, and biomedical sciences. Behavior Research Methods. 2007;39:175-191. [doi.org ↗]
Ready to apply this to a real draft?
Move from reference guidance to a manuscript-specific check
Use the public submission-readiness path when you already have a manuscript and need a draft-specific signal, not just a general guide.
Best for researchers who want a fast readiness read before deciding whether to revise, retarget, or submit.
Related guides in this collection