Scope: Test selection · power analysis · software · common mistakesData: Statistical guidelines from NEJM, Nature, Lancet + CONSORT + STROBELast reviewed: March 2026Source: Manusights editorial team (researchers with publications in Cell, Nature, Science)Cite this guide ↓

Statistical Resources for Biomedical Researchers

Statistical errors are one of the top reasons manuscripts get rejected, revised extensively, or retracted after publication. They're also one of the most fixable problems. Most statistical issues in biomedical papers come from a small number of repeated mistakes, not from complex methodology.

This guide covers the fundamentals: how to pick the right test for your data, how to do a proper power calculation before you collect data, the mistakes reviewers catch most often, and the software tools that work best for different types of research.

Choosing the Right Test

The choice depends on your data type, study design, and sample size. This table covers the most common scenarios in biomedical research.

ScenarioTestNote
Two groups, continuous, normal distributionUnpaired t-test (or paired t-test for before/after in same subjects)Check normality with Shapiro-Wilk; equal variances with Levene's test
Two groups, continuous, non-normal or small nMann-Whitney U (unpaired) or Wilcoxon signed-rank (paired)No distribution assumption; ranks rather than raw values
Three or more groups, continuous, normalOne-way ANOVA, then post-hoc: Tukey (all pairwise) or Dunnett (vs. control)A significant F-statistic only tells you some group differs, not which
Three or more groups, non-normalKruskal-Wallis, then Dunn's post-hocNon-parametric equivalent of one-way ANOVA
Two or more factors, continuousTwo-way ANOVA (tests main effects + interaction)Check interaction term first; main effects are misleading if interaction is significant
Categorical outcomes, two groupsChi-square test (n > 5 per cell) or Fisher's exact test (small samples)Fisher's is always valid; chi-square requires adequate cell counts
Time-to-event / survival dataKaplan-Meier curves + log-rank test; Cox proportional hazards for multivariableCox assumes proportional hazards; test this assumption
Correlation between two continuous variablesPearson (normal) or Spearman (non-normal or ordinal)Spearman tests monotonic relationship, not just linear
Predicting an outcome from multiple variablesLinear regression (continuous), logistic regression (binary), Poisson (counts)Check assumptions: linearity, independence, homoscedasticity, normality of residuals
Repeated measurements over timeRepeated-measures ANOVA or linear mixed-effects modelsMixed models handle missing data better and don't require sphericity

When uncertain about which test to use, consult a biostatistician before collecting data. Choosing the wrong test after data collection creates problems that are hard to fix retrospectively.

Sample Size and Power Analysis

Power analysis should happen before data collection, not after. A study that's underpowered wastes time, money, and animal lives. A study that's overpowered wastes resources that could go to other experiments.

Effect size

The smallest difference that would be biologically or clinically meaningful. Don't use Cohen's "small/medium/large" conventions for biomedical research. Those are social science defaults and don't translate to molecular biology or clinical work.

Significance level (alpha)

Typically 0.05 (two-sided). Some fields use 0.01 for genome-wide studies. The alpha sets your false positive rate.

Desired power (1 - beta)

Typically 0.80 (80% probability of detecting a real effect). Some grant agencies want 0.90 for clinical trials.

Variability (SD)

Get this from prior studies or pilot data. If you're guessing the SD, your power calculation is also a guess.

Reporting in your methods: "We estimated that n = X per group would provide 80% power to detect a Y% difference in [outcome] with a two-sided alpha of 0.05, based on a standard deviation of Z from [prior study or pilot data]."
Avoid post-hoc power analysis. Calculating power after seeing the data is circular and uninformative. If your study was underpowered, say so in the limitations and report confidence intervals for your effect size instead.

Common Statistical Mistakes Reviewers Catch

These eight mistakes account for the majority of statistical concerns raised in peer review of biomedical manuscripts.

1. Multiple comparisons without correction

Running 20 t-tests without adjusting inflates your false positive rate to ~64%. Use Bonferroni (conservative), Benjamini-Hochberg FDR (less conservative), or pre-specify your primary endpoint and treat the rest as exploratory.

2. Technical replicates counted as biological replicates

Three wells from the same mouse are NOT n = 3. They're one biological replicate measured three times. n refers to independent biological units: separate animals, separate patients, separate experiments done on different days. This is the single most common statistical error in preclinical biology papers.

3. Parametric tests on small non-normal samples

With n < 15-20 per group, normality is hard to verify and violations matter more. Use non-parametric tests or report results both ways. If results agree, your conclusions are robust regardless of the distribution.

4. No effect sizes reported

A p-value of 0.001 tells you an effect exists. It doesn't tell you the effect is meaningful. Always report the effect size (mean difference, odds ratio, hazard ratio, Cohen's d) with 95% confidence intervals. Reviewers increasingly flag papers that report only p-values.

5. ANOVA without post-hoc tests

A significant F-statistic means at least one group is different. It doesn't tell you which ones. Follow up with Tukey (all pairwise), Dunnett (vs. control), or Bonferroni-corrected t-tests depending on your question.

6. Wrong test for paired data

Comparing treatment to baseline in the same group requires a paired test, not an unpaired one. Paired designs have more power because they control for between-subject variability. Using an unpaired test on paired data wastes statistical power and can miss real effects.

7. Treating p = 0.05 as a binary threshold

A p-value of 0.049 and a p-value of 0.051 are not meaningfully different. Report exact p-values (p = 0.032, not p < 0.05) and emphasize confidence intervals and effect sizes. The "significance" threshold is a convention, not a law of nature.

8. Correlation described as causation

"X was associated with Y" is correct when your data shows a correlation. "X drove Y" or "X caused Y" requires experimental evidence of a causal mechanism. Observational data can suggest associations. It can't prove causation without additional study designs.

Figure-Level Mistakes Reviewers Notice Immediately

A lot of statistics criticism starts before a reviewer reaches the methods section. They see the figures first. If the figures look sloppy, the reader assumes the analysis may be sloppy too.

Bar graphs hiding the raw dataFor small n animal or cell-based experiments, dot plots or box plots are often better. A bar with SEM error bars can hide ugly spread, outliers, or tiny sample sizes.
No definition of nIf the legend says n = 3, the reviewer wants to know: three mice, three donors, or three technical wells? Put the biological unit in the legend or methods.
Asterisks with no test named**** is not a methods section. State the actual test, whether the comparison was paired, and whether multiple-comparison correction was used.
Inconsistent error bars across figuresIf Figure 2 uses SD and Figure 4 uses SEM, it looks careless unless there is a good reason. Pick one convention and state it clearly.
Axis truncation that exaggerates effectsSometimes a visually dramatic bar graph is just a y-axis starting at 92 instead of 0. Reviewers notice this fast, especially in translational and clinical papers.

Reporting Statistics in Your Paper

For every statistical comparison, report: the test used, sample size per group, summary statistic (mean ± SD or SEM, defined explicitly), exact p-value, and effect size with confidence interval.

"Treatment reduced tumor volume by 42% compared to vehicle (mean ± SD: 45 ± 12 mm³ vs. 78 ± 15 mm³; unpaired t-test: t(18) = 5.6, p = 0.00002, Cohen's d = 2.5, 95% CI of difference: 21–45 mm³)."

SD vs. SEM: know the difference

SD (standard deviation) describes how variable your data is. It's a property of the data itself.

SEM (standard error of the mean) describes how precisely you've estimated the mean. It shrinks as sample size increases (SEM = SD / √n), which means it can make data look less variable than it really is.

Using SEM to make error bars look smaller is a recognized problem. Nature, NEJM, Cell, and most clinical journals now recommend or require SD or 95% CI in figures. If you use SEM, state it explicitly and expect reviewer questions about why.

Minimum Stats Checklist for the Methods Section

1.Name the exact test used for each figure or analysis
2.Define n as the biological unit, not just the number
3.State whether tests were paired or unpaired, one-sided or two-sided
4.Name the software and version when relevant
5.State how multiple comparisons were handled
6.Say how outliers were defined and whether any were excluded
7.Report what error bars represent: SD, SEM, or 95% CI
8.For regression or survival analyses, state the model assumptions checked

Software Tools

GraphPad Prism

Paid (free trial)

Best for: Cell biology, preclinical research. Good for t-tests, ANOVA, survival curves, nonlinear curve fitting. Interface-driven, no coding required. The default tool in many wet labs.

Limitation: Limited for complex models (mixed effects, multivariable regression). Expensive for individual licenses.

R

Free (open-source)

Best for: The industry standard for complex statistical analysis and reproducible research. Key packages: ggplot2 (visualization), lme4 (mixed models), survival (Kaplan-Meier/Cox), tidyverse (data wrangling).

Limitation: Steep learning curve. Requires coding. But once you learn it, it handles everything.

SPSS

Paid (institutional)

Best for: Clinical research, social sciences, epidemiology. Menu-driven interface. Good for standard tests without coding.

Limitation: Less common in basic science. Limited flexibility compared to R. IBM licensing can be expensive.

Stata

Paid

Best for: Epidemiology, public health, health economics. Excellent for logistic regression, survival analysis, survey-weighted analyses, panel data.

Limitation: Less intuitive than SPSS. Smaller user community than R.

Python (scipy, statsmodels)

Free (open-source)

Best for: Bioinformatics, computational biology, machine learning pipelines. Growing rapidly in quantitative biology.

Limitation: Statistical ecosystem less mature than R for classical biostatistics. Fewer purpose-built packages for clinical trial analysis.

jamovi

Free (open-source)

Best for: Point-and-click interface built on R. Good alternative to SPSS for researchers who want GUI-driven analysis without the cost. Exports R code so you can learn R alongside it.

Limitation: Fewer advanced features than R directly. Smaller community.

G*Power

Free

Best for: The standard tool for power analysis. Covers most common test types. Available for Windows and Mac. Every grant application with a power calculation probably used this.

Limitation: Only does power analysis. For analysis itself, you need a different tool.

Resources to Go Deeper

Nature Methods "Points of Significance" seriesShort primers on specific statistical concepts. Free. Written by Krzywinski and Altman. Covers everything from error bars to multivariable regression in 1-2 page articles.
BMJ Statistics Notes (Altman & Bland)Over 60 short pieces on specific statistical problems. Free at bmj.com. Each one addresses a single concept in 600 words. The best quick reference for common questions.
Motulsky, Intuitive Biostatistics (4th ed., 2017)The most accessible statistics textbook for biomedical researchers. Written by the founder of GraphPad. Explains concepts without heavy math.
EQUATOR NetworkFor reporting guidelines (CONSORT for RCTs, STROBE for observational studies) that include statistical reporting requirements. If you're writing up a clinical study, check here first.
NEJM statistical review articlesThe journal publishes periodic primers on statistical methods used in clinical research. Free on NEJM.org.

Frequently Asked Questions

Should I use mean ± SD or mean ± SEM in my figures?

Use SD. It tells readers how variable your data is, which is what they need to judge biological variability. SEM shrinks as your sample size grows, so it can make data look more consistent than it really is. Most high-impact journals now recommend or require SD or 95% CI in figures. If you choose SEM, define it explicitly in the figure legend and be ready for reviewers to ask why.

What's the difference between statistical significance and biological significance?

Statistical significance (p < 0.05) means an observed effect is unlikely to be due to chance alone. It says nothing about whether the effect matters. A trial with 100,000 participants might find a statistically significant 0.5 mmHg blood pressure reduction. Real, but no doctor would change their practice over it. Always report effect sizes with confidence intervals so readers can judge the magnitude, not just the existence, of an effect.

My data isn't normally distributed. Do I have to use non-parametric tests?

Not necessarily. With large samples (n > 30 per group), parametric tests hold up well against non-normality because of the central limit theorem. For smaller samples, use non-parametric alternatives (Mann-Whitney U, Kruskal-Wallis) or try a log transformation, which often works for the right-skewed distributions common in biological data. Report which assumption you tested and how in your methods section. Many reviewers will ask.

References

  1. Motulsky H. Intuitive Biostatistics: A Nonmathematical Guide to Statistical Thinking. 4th ed. Oxford University Press, 2017. [oup.com ↗]
  2. Krzywinski M, Altman N. Points of Significance (series). Nature Methods. 2013-2020. [nature.com ↗]
  3. Altman DG, Bland JM. Statistics Notes. BMJ. 1994-present. [bmj.com ↗]
  4. GraphPad Software. Prism Statistics Guide. Retrieved March 2026. [graphpad.com ↗]
  5. Faul F, Erdfelder E, Lang AG, Buchner A. G*Power 3: A flexible statistical power analysis program for the social, behavioral, and biomedical sciences. Behavior Research Methods. 2007;39:175-191. [doi.org ↗]

Suggested Citation

APA

Manusights. (2026). Statistical resources for biomedical researchers. Retrieved from https://manusights.com/resources/statistical-resources-guide

MLA

Manusights. "Statistical Resources for Biomedical Researchers." Manusights, 2026, manusights.com/resources/statistical-resources-guide.

CC BY 4.0 - share and adapt freely with attribution to Manusights (manusights.com/resources).

About these resources: Manusights is a pre-submission manuscript review service staffed by researchers with publications in Cell, Nature, Science, and related journals. These reference guides are produced as free, independent resources for the research community. No sign-up required. Browse all resource guides or learn about Manusights.