Should I use mean plus or minus SD or mean plus or minus SEM in my figures?

Use SD. Standard deviation tells readers how variable your data is, which is what they need to judge biological variability. SEM shrinks as sample size increases and can make data look less variable than it is. Most high-impact journals (Nature, NEJM, Cell) now recommend or require SD or 95% CI rather than SEM in figures. If you use SEM, define it explicitly.

Reference notes

Coverage

Test selection · power analysis · software · common mistakes

Sources

Statistical guidelines from NEJM, Nature, Lancet + CONSORT + STROBE

Last reviewed

March 2026

Prepared by the Manusights editorial team.

Methods-and-analysis guide

Statistical Resources for Biomedical Researchers

Q: What is the difference between statistical significance and biological significance?

Statistical significance (p < 0.05) tells you an observed effect is unlikely due to chance alone. It says nothing about whether the effect is large enough to matter biologically. A study with 100,000 subjects might detect a statistically significant 0.5 mmHg blood pressure reduction that no clinician would act on. Always report effect sizes with confidence intervals alongside p-values.

Q: My data is not normally distributed. Do I have to use non-parametric tests?

Not necessarily. With large samples (n > 30 per group), parametric tests are robust to non-normality due to the central limit theorem. For smaller samples, use non-parametric alternatives (Mann-Whitney U, Kruskal-Wallis) or try a log transformation, which often helps with right-skewed biological data. Report in your methods which assumption you tested and how.

Statistical errors are one of the top reasons manuscripts get rejected, revised extensively, or retracted after publication. They're also one of the most fixable problems. Most statistical issues in biomedical papers come from a small number of repeated mistakes, not from complex methodology.

This guide covers the fundamentals: how to pick the right test for your data, how to do a proper power calculation before you collect data, the mistakes reviewers catch most often, and the software tools that work best for different types of research.

Quick orientation

Use this page when the analysis plan is becoming visible to reviewers and you want to remove the most common statistical weak points before submission.

This guide is a practical methods screen, not a full biostatistics course. It is most useful for test selection, power logic, figure-level statistical hygiene, and deciding when simple software is enough versus when the analysis needs something stronger.

10 test scenarios8 common mistakes7 software optionsBest for methods cleanup

Best used with

Reporting guidelines

Use it when the statistical cleanup also depends on the formal checklist required by the study design.

Systematic reviews

Move there when heterogeneity, pooled effects, or evidence synthesis become the main statistical issue.

Pre-submission checklist

Run the final package check once the analysis and figures are no longer obvious reviewer targets.

Start here first

The 4-part statistics workflow

If you are preparing a manuscript, first match the design to the test, then confirm power and sample-size logic, then remove figure- and methods-level errors that reviewers catch immediately, and only then choose software or advanced modeling tools.

Match design to test

Use the study design, data type, and pairing structure to choose the right comparison test.

Check power logic

Confirm the effect size, alpha, power target, and variance assumptions before final claims.

Fix reviewer-visible mistakes

Clean up replication, multiple testing, effect-size reporting, and figure labeling errors.

Choose the right tool

Use the software that matches the complexity of the analysis instead of defaulting blindly.

Methodology

What this statistics guide is built from

This workflow synthesizes statistical reporting guidance from major biomedical publishers, CONSORT and STROBE expectations, and standard biostatistics teaching references. The test-selection and common-mistakes sections are emphasized because they drive a large share of revision requests and avoidable reviewer criticism.

Best used with

Reporting guidelines Systematic reviews Pre-submission checklist

Choosing the Right Test

The choice depends on your data type, study design, and sample size. This table covers the most common scenarios in biomedical research.

Scenario	Test	Note
Two groups, continuous, normal distribution	Unpaired t-test (or paired t-test for before/after in same subjects)	Check normality with Shapiro-Wilk; equal variances with Levene's test
Two groups, continuous, non-normal or small n	Mann-Whitney U (unpaired) or Wilcoxon signed-rank (paired)	No distribution assumption; ranks rather than raw values
Three or more groups, continuous, normal	One-way ANOVA, then post-hoc: Tukey (all pairwise) or Dunnett (vs. control)	A significant F-statistic only tells you some group differs, not which
Three or more groups, non-normal	Kruskal-Wallis, then Dunn's post-hoc	Non-parametric equivalent of one-way ANOVA
Two or more factors, continuous	Two-way ANOVA (tests main effects + interaction)	Check interaction term first; main effects are misleading if interaction is significant
Categorical outcomes, two groups	Chi-square test (n > 5 per cell) or Fisher's exact test (small samples)	Fisher's is always valid; chi-square requires adequate cell counts
Time-to-event / survival data	Kaplan-Meier curves + log-rank test; Cox proportional hazards for multivariable	Cox assumes proportional hazards; test this assumption
Correlation between two continuous variables	Pearson (normal) or Spearman (non-normal or ordinal)	Spearman tests monotonic relationship, not just linear
Predicting an outcome from multiple variables	Linear regression (continuous), logistic regression (binary), Poisson (counts)	Check assumptions: linearity, independence, homoscedasticity, normality of residuals
Repeated measurements over time	Repeated-measures ANOVA or linear mixed-effects models	Mixed models handle missing data better and don't require sphericity

When uncertain about which test to use, consult a biostatistician before collecting data. Choosing the wrong test after data collection creates problems that are hard to fix retrospectively.

Practical note

Three statistical habits that still create unnecessary reviewer skepticism

Treating the test as the whole analysis while leaving replication, effect size, or figure labeling under-specified.

Explaining the statistics only in the methods when the figures themselves already suggest weak analytical discipline.

Using convenience software defaults without checking whether the design, pairing, or multiple-comparison structure actually matches the test.

Sample Size and Power Analysis

Power analysis should happen before data collection, not after. A study that's underpowered wastes time, money, and animal lives. A study that's overpowered wastes resources that could go to other experiments.

Effect size

The smallest difference that would be biologically or clinically meaningful. Don't use Cohen's "small/medium/large" conventions for biomedical research. Those are social science defaults and don't translate to molecular biology or clinical work.

Significance level (alpha)

Typically 0.05 (two-sided). Some fields use 0.01 for genome-wide studies. The alpha sets your false positive rate.

Desired power (1 - beta)

Typically 0.80 (80% probability of detecting a real effect). Some grant agencies want 0.90 for clinical trials.

Variability (SD)

Get this from prior studies or pilot data. If you're guessing the SD, your power calculation is also a guess.

Reporting in your methods: "We estimated that n = X per group would provide 80% power to detect a Y% difference in [outcome] with a two-sided alpha of 0.05, based on a standard deviation of Z from [prior study or pilot data]."

Avoid post-hoc power analysis. Calculating power after seeing the data is circular and uninformative. If your study was underpowered, say so in the limitations and report confidence intervals for your effect size instead.

Common Statistical Mistakes Reviewers Catch

These eight mistakes account for the majority of statistical concerns raised in peer review of biomedical manuscripts.

1. Multiple comparisons without correction

Running 20 t-tests without adjusting inflates your false positive rate to ~64%. Use Bonferroni (conservative), Benjamini-Hochberg FDR (less conservative), or pre-specify your primary endpoint and treat the rest as exploratory.

2. Technical replicates counted as biological replicates

Three wells from the same mouse are NOT n = 3. They're one biological replicate measured three times. n refers to independent biological units: separate animals, separate patients, separate experiments done on different days. This is the single most common statistical error in preclinical biology papers.

3. Parametric tests on small non-normal samples

With n < 15-20 per group, normality is hard to verify and violations matter more. Use non-parametric tests or report results both ways. If results agree, your conclusions are robust regardless of the distribution.

4. No effect sizes reported

A p-value of 0.001 tells you an effect exists. It doesn't tell you the effect is meaningful. Always report the effect size (mean difference, odds ratio, hazard ratio, Cohen's d) with 95% confidence intervals. Reviewers increasingly flag papers that report only p-values.

5. ANOVA without post-hoc tests

A significant F-statistic means at least one group is different. It doesn't tell you which ones. Follow up with Tukey (all pairwise), Dunnett (vs. control), or Bonferroni-corrected t-tests depending on your question.

6. Wrong test for paired data

Comparing treatment to baseline in the same group requires a paired test, not an unpaired one. Paired designs have more power because they control for between-subject variability. Using an unpaired test on paired data wastes statistical power and can miss real effects.

7. Treating p = 0.05 as a binary threshold

A p-value of 0.049 and a p-value of 0.051 are not meaningfully different. Report exact p-values (p = 0.032, not p < 0.05) and emphasize confidence intervals and effect sizes. The "significance" threshold is a convention, not a law of nature.

8. Correlation described as causation

"X was associated with Y" is correct when your data shows a correlation. "X drove Y" or "X caused Y" requires experimental evidence of a causal mechanism. Observational data can suggest associations. It can't prove causation without additional study designs.

Figure-Level Mistakes Reviewers Notice Immediately

A lot of statistics criticism starts before a reviewer reaches the methods section. They see the figures first. If the figures look sloppy, the reader assumes the analysis may be sloppy too.

Bar graphs hiding the raw dataFor small n animal or cell-based experiments, dot plots or box plots are often better. A bar with SEM error bars can hide ugly spread, outliers, or tiny sample sizes.

No definition of nIf the legend says n = 3, the reviewer wants to know: three mice, three donors, or three technical wells? Put the biological unit in the legend or methods.

Asterisks with no test named**** is not a methods section. State the actual test, whether the comparison was paired, and whether multiple-comparison correction was used.

Inconsistent error bars across figuresIf Figure 2 uses SD and Figure 4 uses SEM, it looks careless unless there is a good reason. Pick one convention and state it clearly.

Axis truncation that exaggerates effectsSometimes a visually dramatic bar graph is just a y-axis starting at 92 instead of 0. Reviewers notice this fast, especially in translational and clinical papers.

Reporting Statistics in Your Paper

For every statistical comparison, report: the test used, sample size per group, summary statistic (mean ± SD or SEM, defined explicitly), exact p-value, and effect size with confidence interval.

"Treatment reduced tumor volume by 42% compared to vehicle (mean ± SD: 45 ± 12 mm³ vs. 78 ± 15 mm³; unpaired t-test: t(18) = 5.6, p = 0.00002, Cohen's d = 2.5, 95% CI of difference: 21–45 mm³)."

SD vs. SEM: know the difference

SD (standard deviation) describes how variable your data is. It's a property of the data itself.

SEM (standard error of the mean) describes how precisely you've estimated the mean. It shrinks as sample size increases (SEM = SD / √n), which means it can make data look less variable than it really is.

Using SEM to make error bars look smaller is a recognized problem. Nature, NEJM, Cell, and most clinical journals now recommend or require SD or 95% CI in figures. If you use SEM, state it explicitly and expect reviewer questions about why.

Minimum Stats Checklist for the Methods Section

1.Name the exact test used for each figure or analysis

2.Define n as the biological unit, not just the number

3.State whether tests were paired or unpaired, one-sided or two-sided

4.Name the software and version when relevant

5.State how multiple comparisons were handled

6.Say how outliers were defined and whether any were excluded

7.Report what error bars represent: SD, SEM, or 95% CI

8.For regression or survival analyses, state the model assumptions checked

Software Tools

GraphPad Prism

Paid (free trial)

Best for: Cell biology, preclinical research. Good for t-tests, ANOVA, survival curves, nonlinear curve fitting. Interface-driven, no coding required. The default tool in many wet labs.

Limitation: Limited for complex models (mixed effects, multivariable regression). Expensive for individual licenses.

R

Free (open-source)

Best for: The industry standard for complex statistical analysis and reproducible research. Key packages: ggplot2 (visualization), lme4 (mixed models), survival (Kaplan-Meier/Cox), tidyverse (data wrangling).

Limitation: Steep learning curve. Requires coding. But once you learn it, it handles everything.

SPSS

Paid (institutional)

Best for: Clinical research, social sciences, epidemiology. Menu-driven interface. Good for standard tests without coding.

Limitation: Less common in basic science. Limited flexibility compared to R. IBM licensing can be expensive.

Stata

Paid

Best for: Epidemiology, public health, health economics. Excellent for logistic regression, survival analysis, survey-weighted analyses, panel data.

Limitation: Less intuitive than SPSS. Smaller user community than R.

Python (scipy, statsmodels)

Free (open-source)

Best for: Bioinformatics, computational biology, machine learning pipelines. Growing rapidly in quantitative biology.

Limitation: Statistical ecosystem less mature than R for classical biostatistics. Fewer purpose-built packages for clinical trial analysis.

jamovi

Free (open-source)

Best for: Point-and-click interface built on R. Good alternative to SPSS for researchers who want GUI-driven analysis without the cost. Exports R code so you can learn R alongside it.

Limitation: Fewer advanced features than R directly. Smaller community.

G*Power

Free

Best for: The standard tool for power analysis. Covers most common test types. Available for Windows and Mac. Every grant application with a power calculation probably used this.

Limitation: Only does power analysis. For analysis itself, you need a different tool.

Resources to Go Deeper

Nature Methods "Points of Significance" seriesShort primers on specific statistical concepts. Free. Written by Krzywinski and Altman. Covers everything from error bars to multivariable regression in 1-2 page articles.

BMJ Statistics Notes (Altman & Bland)Over 60 short pieces on specific statistical problems. Free at bmj.com. Each one addresses a single concept in 600 words. The best quick reference for common questions.

Motulsky, Intuitive Biostatistics (4th ed., 2017)The most accessible statistics textbook for biomedical researchers. Written by the founder of GraphPad. Explains concepts without heavy math.

EQUATOR NetworkFor reporting guidelines (CONSORT for RCTs, STROBE for observational studies) that include statistical reporting requirements. If you're writing up a clinical study, check here first.

NEJM statistical review articlesThe journal publishes periodic primers on statistical methods used in clinical research. Free on NEJM.org.

Frequently Asked Questions

Should I use mean ± SD or mean ± SEM in my figures?

Use SD. It tells readers how variable your data is, which is what they need to judge biological variability. SEM shrinks as your sample size grows, so it can make data look more consistent than it really is. Most high-impact journals now recommend or require SD or 95% CI in figures. If you choose SEM, define it explicitly in the figure legend and be ready for reviewers to ask why.

What's the difference between statistical significance and biological significance?

Statistical significance (p < 0.05) means an observed effect is unlikely to be due to chance alone. It says nothing about whether the effect matters. A trial with 100,000 participants might find a statistically significant 0.5 mmHg blood pressure reduction. Real, but no doctor would change their practice over it. Always report effect sizes with confidence intervals so readers can judge the magnitude, not just the existence, of an effect.

My data isn't normally distributed. Do I have to use non-parametric tests?

Not necessarily. With large samples (n > 30 per group), parametric tests hold up well against non-normality because of the central limit theorem. For smaller samples, use non-parametric alternatives (Mann-Whitney U, Kruskal-Wallis) or try a log transformation, which often works for the right-skewed distributions common in biological data. Report which assumption you tested and how in your methods section. Many reviewers will ask.

References

Motulsky H. Intuitive Biostatistics: A Nonmathematical Guide to Statistical Thinking. 4th ed. Oxford University Press, 2017. [oup.com ↗]
Krzywinski M, Altman N. Points of Significance (series). Nature Methods. 2013-2020. [nature.com ↗]
Altman DG, Bland JM. Statistics Notes. BMJ. 1994-present. [bmj.com ↗]
GraphPad Software. Prism Statistics Guide. Retrieved March 2026. [graphpad.com ↗]
Faul F, Erdfelder E, Lang AG, Buchner A. G*Power 3: A flexible statistical power analysis program for the social, behavioral, and biomedical sciences. Behavior Research Methods. 2007;39:175-191. [doi.org ↗]

Ready to apply this to a real draft?

Move from reference guidance to a manuscript-specific check

Use the public submission-readiness path when you already have a manuscript and need a draft-specific signal, not just a general guide.

Best for researchers who want a fast readiness read before deciding whether to revise, retarget, or submit.

Run submission-readiness check See sample report

Public pathHow manuscript handling works

Related guides in this collection

Reporting Guidelines

Pair statistical reporting with the right study-design checklist.

Systematic Reviews

Use this when synthesis, heterogeneity, or pooled effects become the main issue.

Submission Requirements

Check final figure, supplement, and manuscript requirements before upload.

Reporting Guidelines All Resources Systematic Reviews