Manuscript Preparation9 min readUpdated Apr 21, 2026

Statistical Mistakes That Get Papers Rejected (A Reviewer's Checklist)

Reviewers don't need a statistics PhD to spot these errors. Here are the 10 statistical mistakes that get papers rejected, and how to fix each one before you submit.

Assistant Professor, Cardiovascular & Metabolic Disease

Author context

Works across cardiovascular biology and metabolic disease, with expertise in navigating high-impact journal submission requirements for Circulation, JACC, and European Heart Journal.

Next step

Choose the next useful decision step first.

Use the guide or checklist that matches this page's intent before you ask for a manuscript-level diagnostic.

Open Journal Fit ChecklistAnthropic Privacy Partner. Zero-retention manuscript processing.Run Free Readiness Scan

Quick answer: Statistical mistakes papers rejected for most often are not exotic. Reviewers usually flag underpowered designs, multiple comparisons without correction, pseudoreplication, missing effect sizes, and causal language that the design cannot support. Most of these problems are visible before peer review if the paper is checked with reporting discipline rather than optimism.

Statistical problems are the single fastest way to lose a reviewer's trust. Not because reviewers are stats pedants, because statistical mistakes signal a lack of rigor, and once a reviewer spots one, every other claim gets more scrutiny.

The eLife paper (Makin & Orban de Xivry, 2019) found that in a simple 2x3x3 experimental design, the probability of finding at least one spurious significant result is 30%, even when no real effect exists. The same mistakes show up across every field. Fix them before you submit, and you remove the easiest reason for a reviewer to say no.

What reviewers do with common statistical mistakes

Statistical mistake
What reviewers usually conclude
Fastest honest fix
Where it hurts most
No power logic or sample-size justification
the paper may be underpowered and unreliable
add power reasoning or narrow the claim
editorial triage and methods review
Multiple comparisons without correction
significant results may be chance findings
correct, disclose all tests, and relabel exploratory work
reviewer trust in the results section
Pseudoreplication
the p-values are invalid at the unit of analysis used
reanalyze at the real experimental unit or use mixed models
biology, neuroscience, animal work
P-values without effect sizes or intervals
magnitude and precision are unclear
report effect sizes and confidence intervals
clinical and translational papers
Causal wording from observational data
the interpretation outruns the design
rewrite the claim to association language
abstract, discussion, cover letter

1. P-hacking and multiple comparisons without correction

What it looks like: You tested 20 different outcomes and reported the three that were statistically significant, without mentioning the other 17. Or you compared six treatment groups to each other (15 pairwise comparisons) and reported the p-values from individual t-tests.

A specific example: A study compares gene expression across four patient groups. Instead of an ANOVA, the authors run six pairwise t-tests and report two comparisons with p < 0.05. They don't apply Bonferroni correction. With six comparisons at α = 0.05, you'd expect about one "significant" result by chance alone.

Why reviewers flag it: Because it inflates false positives. If you run enough tests, something will be significant by accident. Every stats-literate reviewer knows this.

How to fix it: Use an appropriate correction (Bonferroni, Benjamini-Hochberg, Tukey's HSD). Report all outcomes in a supplementary table, not just the hits. Pre-register your primary outcomes if possible. Twenty independent tests at alpha = 0.05 yields a 64% chance of at least one false positive.

2. Small sample sizes with no power analysis

What it looks like: n=3 per group for an experiment that would need n=15 to detect a meaningful effect. No power calculation. No justification for the sample size.

A specific example: A drug reduces tumor volume by 20% compared to control. Sounds promising. But with n=4 mice per group, the confidence intervals span from -10% to +50%. The effect could easily be zero. A power analysis would show they'd need at least 12 animals per group to reliably detect a 20% difference.

Why reviewers flag it: Because underpowered studies produce unreliable results. A "significant" finding from a tiny sample is likely a false positive or a grossly overestimated effect size.

How to fix it: Do a power analysis before the experiment and report it in your methods. G*Power (free) or R's pwr package make this straightforward. If you couldn't achieve the ideal sample size, say so. Reviewers respect honesty about constraints far more than silence.

3. Using parametric tests on non-normal data

What it looks like: Running a t-test or ANOVA on data that's clearly skewed, bimodal, or has massive outliers. No test for normality mentioned.

A specific example: Comparing hospital length-of-stay between two patient groups using a t-test. Length-of-stay data is almost never normally distributed. The median is 4 days, but the mean is 11. A t-test on these data compares means that don't represent either group well.

Why reviewers flag it: Parametric tests assume normal distribution. When that assumption is violated, the p-value can be meaningless.

How to fix it: Test for normality (Shapiro-Wilk for n < 50, Q-Q plots for visual assessment). If your data isn't normal, use non-parametric alternatives: Mann-Whitney U instead of a t-test, Kruskal-Wallis instead of ANOVA. Or log-transform skewed data. Report which test you used and why.

4. Confusing correlation with causation in the writing

What it looks like: The results show a correlation. The discussion says "causes" or "drives" or "leads to."

A specific example: "Higher coffee consumption was associated with lower mortality (r = -0.32, p < 0.01). These findings demonstrate that coffee consumption reduces mortality risk." No. People who drink more coffee might also exercise more or have higher income. The study measured both variables and found they moved together. That's it.

How to fix it: Use precise language. "Associated with," not "caused." "Correlated with," not "driven by." If you believe the relationship is causal, make the argument explicitly using evidence from the study design.

5. Cherry-picking subgroups after seeing results

What it looks like: The overall analysis showed no effect. But when the authors split by age, gender, and disease severity, they found a significant effect in women over 65 with severe disease. The paper focuses on this subgroup as if it were the planned analysis.

A specific example: A clinical trial shows no difference in the primary endpoint (p=0.34). The authors test the drug in 12 pre-specified and 8 post-hoc subgroups. They find a significant benefit in patients with baseline CRP > 10 (p=0.03). The abstract leads with this subgroup finding.

How to fix it: Pre-specify your subgroups and register them. If you find an interesting subgroup effect post-hoc, label it clearly: "In an exploratory analysis..." Don't bury the negative primary result.

6. Missing confidence intervals (reporting only p-values)

"Treatment improved survival (p=0.03)." Improved by how much? One day? One year? You can't tell.

A specific example: A cardiovascular trial reports that a new antiplatelet agent "reduced major adverse cardiac events (p=0.04)." But the absolute risk reduction was 0.3% (3.2% vs 3.5%). The number needed to treat is 333. Statistically significant, but clinically marginal.

Why reviewers flag it: A p-value tells you whether an effect is likely to be non-zero, not how large or precise it is. The ASA's 2016 statement on p-values makes this point explicitly: "Scientific conclusions should not be based only on whether a p-value passes a specific threshold."

How to fix it: Report effect sizes with 95% confidence intervals for every comparison. Include absolute numbers, not just relative measures. CONSORT requires both absolute and relative effect measures for clinical trials, and journals like PNAS now require effect sizes for all statistical comparisons.

7. Inappropriate use of bar graphs for continuous data

What it looks like: Bar graphs with error bars for continuous data with 5-20 data points per group.

A specific example: Two treatment groups, n=8 each: the bar graph shows nearly identical means with overlapping error bars. But the individual data reveals Group B has a bimodal distribution, four complete responders and four non-responders. The bar graph hides the most interesting finding.

Why reviewers flag it: Weissgerber et al. (2015) in PLOS Biology showed that dramatically different distributions can produce identical bar graphs. Their paper has been cited over 3,000 times, and many journals updated their figure policies.

How to fix it: Use dot plots, box plots, or violin plots. Show individual data points. JBC, PLOS, and eLife now require individual data points for small n.

8. Survivorship bias in cohort selection

What it looks like: A study follows patients from diagnosis to outcome, but only includes patients who survived long enough to be enrolled. The sickest patients died before they could be included.

A specific example: A study examining five-year outcomes after chemotherapy requires participants to have completed at least two cycles. Patients who died after cycle one aren't in the data. The reported 70% five-year survival rate is inflated because the highest-risk patients were excluded by design.

Why reviewers flag it: This creates an artificially favorable picture of outcomes.

How to fix it: Use intention-to-treat analysis where appropriate. Draw a CONSORT flow diagram showing exactly how many patients were excluded at each stage and why. If your enrollment criteria exclude high-risk patients, run a sensitivity analysis showing how results change if they're included.

9. Not accounting for clustering in hierarchical data

What it looks like: A neuroscience paper reports n=200 neurons. But those came from 4 mice (50 per animal). The paper runs statistics as if each neuron is independent, but neurons within the same mouse are correlated.

A specific example: The effective sample size isn't 200, it's closer to 4. The intraclass correlation within a single animal might be 0.3 or higher, meaning the 200 "independent" observations carry the statistical weight of roughly 15-20 truly independent measurements. The paper's p-value of 0.001 recalculated at the animal level might be 0.15.

Why reviewers flag it: This is one of the most common errors in biology. The eLife review identified it as particularly widespread in neuroscience.

How to fix it: Analyze at the level of the independent unit, or use mixed-effects models with a random effect for animal/subject/cluster. Report both the number of independent units and total observations. ARRIVE 2.0 now requires explicit reporting of the experimental unit for each analysis.

10. Misusing fold-change without absolute numbers

What it looks like: "Treatment increased expression 10-fold (p < 0.01)." Impressive, until you learn baseline expression was 0.001 units.

A specific example: A 15-fold increase from 0.2 pg/mL to 3.0 pg/mL sounds dramatic. But when the normal physiological range is 50-200 pg/mL, you're still in the noise floor. Conversely, a 1.3-fold increase in serum creatinine (from 1.0 to 1.3 mg/dL) is clinically meaningful, it indicates acute kidney injury under KDIGO criteria.

How to fix it: Always report absolute numbers alongside fold-changes. Include reference ranges when available. If your fold-change is large but the absolute values are near the assay's detection limit, acknowledge that limitation.

Your pre-submission stats checklist

Before you submit, run through this:

Design and power:

  • Did you do a power analysis (and report it)?
  • Is your sample size justified?
  • Are your primary outcomes pre-specified?

Analysis:

  • Did you check distributional assumptions before choosing your test?
  • If you ran multiple comparisons, did you correct for them?
  • Did you account for clustering/nesting in hierarchical data?
  • Did you use the right level of analysis (independent experimental units, not pseudoreplicates)?

Reporting:

  • Did you report effect sizes with confidence intervals, not just p-values?
  • Did you include absolute numbers alongside fold-changes or percentages?
  • Did you show individual data points for small sample sizes?
  • Do numbers match across text, tables, and figures?
  • Are all subgroup analyses labeled as pre-specified or exploratory?

Interpretation:

  • Does your language match your evidence? (association ≠ causation)
  • Are your claims proportional to your data?
  • Have you acknowledged limitations from sample size, design, or statistical constraints?

Visualization:

  • Are you using dot plots or box plots instead of bar graphs for small n?
  • Do your figures show variability accurately (SD, not just SEM)?

The 10 Most Common Statistical Mistakes by Field

Statistical errors vary by discipline. Here's what reviewers flag most often in each area:

Field
Most common mistake
Why it leads to rejection
Clinical medicine
Missing power calculation
Reviewers can't assess whether the study was adequately powered for the primary endpoint
Epidemiology
Confounding not addressed
Observational studies without propensity matching or sensitivity analysis are considered unreliable
Cell biology
No quantification of imaging data
Western blots and microscopy without densitometry or cell counts are treated as anecdotal
Genomics
Multiple testing not corrected
Genome-wide analyses without Bonferroni or FDR correction inflate false discovery rates
Chemistry
Missing error bars or replicates
Single-run characterization data doesn't demonstrate reproducibility
Physics
Systematic uncertainty not separated
Lumping systematic and statistical errors together makes the result uninterpretable
Psychology
P-hacking or HARKing
Post-hoc hypothesis generation presented as a priori, reviewers detect this from the study design
Ecology
Pseudoreplication
Treating non-independent observations as independent samples inflates statistical power
Machine learning
Train/test leakage
Using test data features during training produces artificially inflated performance metrics
Social sciences
Inappropriate causal claims from observational data
"X causes Y" when the design only supports "X is associated with Y"

In our pre-submission review work

In our pre-submission review work, the most damaging statistics problems are often the ones authors think are too small to matter. A paper may have respectable analyses overall, but one subgroup result is oversold, one figure legend hides the error-bar definition, or one animal study treats measurements from the same subject as independent. After that, reviewers start distrusting the rest of the paper.

According to ICMJE recommendations, authors should describe statistical methods in enough detail that a knowledgeable reader can verify the reported results. ARRIVE 2.0 also requires authors to state the experimental unit explicitly. Those requirements line up with what we see in rejections: papers do not only fail because the statistics are wrong, they fail because the manuscript does not make the statistical logic auditable.

The Bottom Line

Statistical errors are the most common fixable reason for desk rejection. Most don't require rerunning analyses, they require correct framing and transparent reporting. A manuscript readiness and journal-fit check flags the statistical signaling issues that editors and reviewers look for before you submit.

What to do in the next 48 hours

Time window
Action
Why
Next 24 hours
mark every place where the claim is stronger than the design allows
this catches the fastest reviewer objection
Next 24 to 48 hours
verify experimental unit, corrections for multiple testing, and effect-size reporting
these are the statistical mistakes papers rejected for most often
Before submission
rewrite the abstract and figure legends so the analysis is auditable
reviewers trust clear reporting more than confident rhetoric

Readiness check

Run the scan while the topic is in front of you.

See score, top issues, and journal-fit signals before you submit.

Get free manuscript previewAnthropic Privacy Partner. Zero-retention manuscript processing.See sample report

Submit If / Think Twice If

Submit if:

  • the statistics section is mostly sound and you need a rejection-prevention checklist
  • you want to test whether the claims, plots, and reported tests line up
  • the study design is fixed and the remaining risk is analysis and reporting discipline

Think twice if:

  • the paper would need new data rather than cleaner analysis or wording
  • you still cannot identify the real experimental unit
  • the manuscript is using statistical polish to cover a weak design

Frequently asked questions

The most common are: p-hacking and multiple comparisons without correction, small sample sizes with no power analysis, using parametric tests on non-normal data, confusing correlation with causation, cherry-picking subgroups, and reporting only p-values without confidence intervals or effect sizes.

Yes. Do a power analysis before the experiment and report it in your methods. If you could not achieve the ideal sample size, say so and acknowledge the limitation. Reviewers respect honesty about constraints far more than silence.

Underpowered studies , studies that don't have enough participants to reliably detect the effect size they claim. Reviewers and editors at high-IF journals check sample size justification early. An underpowered study that finds a significant result raises immediate credibility concerns.

For clinical trials, epidemiological studies, and complex multivariable analyses , yes. For straightforward experimental work with standard statistical tests , a careful self-review against the relevant reporting guideline (CONSORT, STROBE, ARRIVE) is usually sufficient.

Pre-register your primary endpoint and analysis plan if you haven't already, report all outcomes including non-significant ones, and use confidence intervals alongside p-values. If your analysis plan deviated from the original protocol, be transparent about it.

References

Sources

  1. Makin & Orban de Xivry, "Ten common statistical mistakes to watch out for," eLife (2019)
  2. Borg et al., "Ten Common Statistical Errors from All Phases of Research," PM&R (2020)
  3. Weissgerber et al., "Beyond Bar and Line Graphs," PLOS Biology (2015)
  4. ASA statement on statistical significance and p-values (2016)
  5. ICMJE Recommendations for statistical reporting
  6. EQUATOR Network reporting guidelines by study type
  7. CONSORT statement for randomized trials
  8. STROBE statement for observational studies
  9. ARRIVE 2.0 guidelines for animal research
  10. Common mistakes in biostatistics, Clinical Kidney Journal (2024)

Reference library

Use the core publishing datasets alongside this guide

This article answers one part of the publishing decision. The reference library covers the recurring questions that usually come next: whether the package is ready, what drives desk rejection, how journals compare, and what the submission requirements look like across journals.

Open the reference library

Before you upload

Choose the next useful decision step first.

Move from this article into the next decision-support step. The scan works best once the journal and submission plan are clearer.

Use the scan once the manuscript and target journal are concrete enough to evaluate.

Anthropic Privacy Partner. Zero-retention manuscript processing.

Internal navigation

Where to go next

Open Journal Fit Checklist