Manuscript Preparation9 min read

Statistical Mistakes That Get Papers Rejected (A Reviewer's Checklist)

Assistant Professor, Cardiovascular & Metabolic Disease

Works across cardiovascular biology and metabolic disease, with expertise in navigating high-impact journal submission requirements for Circulation, JACC, and European Heart Journal.

Is your manuscript ready?

Run a free diagnostic before you submit. Catch the issues editors reject on first read.

Run Free Readiness ScanFree · No account needed

Decision cue: If you need a yes/no submission call today, compare your draft with 3 recent accepted papers from this journal and only submit when scope, methods depth, and claim strength line up.

Related: How to choose a journalHow to avoid desk rejectionPre-submission checklist

I've reviewed a lot of papers. And I can tell you: statistical problems are the single fastest way to lose a reviewer's trust.

Not because reviewers are stats pedants (some are, but most aren't). Because statistical mistakes usually signal something deeper. A lack of rigor in the analysis, or worse, a result that might not hold up. When I see a stats error in a manuscript, I start reading everything else more carefully. Every claim gets more scrutiny.

The thing is, most of these mistakes are fixable. They're not complex methodological failures. They're the same 10 or so errors that show up again and again, across every field I've reviewed for. Fix them before you submit, and you remove the easiest reason for a reviewer to say no.

Here they are, in roughly the order I notice them.

1. P-hacking and multiple comparisons without correction

What it looks like: You tested 20 different outcomes and reported the three that were statistically significant, without mentioning the other 17. Or you compared six treatment groups to each other (15 pairwise comparisons) and reported the p-values from individual t-tests.

A specific example: A study compares gene expression across four patient groups. Instead of an ANOVA, the authors run six pairwise t-tests and report two comparisons with p < 0.05. They don't apply Bonferroni correction. With six comparisons at α = 0.05, you'd expect about one "significant" result by chance alone.

Why reviewers flag it: Because it inflates false positives. If you run enough tests, something will be significant by accident. Every stats-literate reviewer knows this.

How to fix it: Use an appropriate correction (Bonferroni, Benjamini-Hochberg, Tukey's HSD). If you tested many outcomes, report all of them in a supplementary table, not just the hits. Pre-register your primary outcomes if possible.

2. Small sample sizes with no power analysis

What it looks like: n=3 per group for an experiment that would need n=15 to detect a meaningful effect. No power calculation. No justification for the sample size.

A specific example: A drug reduces tumor volume by 20% compared to control. Sounds promising. But with n=4 mice per group, the confidence intervals span from -10% to +50%. The effect could easily be zero. A power analysis would show they'd need at least 12 animals per group to reliably detect a 20% difference.

Why reviewers flag it: Because underpowered studies produce unreliable results. A "significant" finding from a tiny sample is likely a false positive or a grossly overestimated effect size.

How to fix it: Do a power analysis before the experiment. Report it in your methods. If you couldn't achieve the ideal sample size, say so and acknowledge the limitation. Reviewers respect honesty about constraints far more than silence.

3. Using parametric tests on non-normal data

What it looks like: Running a t-test or ANOVA on data that's clearly skewed, bimodal, or has massive outliers. No test for normality mentioned.

A specific example: Comparing hospital length-of-stay between two patient groups using a t-test. Length-of-stay data is almost never normally distributed. The median is 4 days, but the mean is 11. A t-test on these data compares means that don't represent either group well.

Why reviewers flag it: Parametric tests assume normal distribution. When that assumption is violated, the p-value can be meaningless.

How to fix it: Test for normality (Shapiro-Wilk, Q-Q plots). If your data isn't normal, use non-parametric alternatives: Mann-Whitney U instead of a t-test, Kruskal-Wallis instead of ANOVA. Or log-transform skewed data. Report which test you used and why.

4. Confusing correlation with causation in the writing

What it looks like: The results show a correlation. The discussion says "causes" or "drives" or "leads to."

A specific example: "Higher coffee consumption was associated with lower mortality (r = -0.32, p < 0.01). These findings demonstrate that coffee consumption reduces mortality risk." No. People who drink more coffee might also exercise more, have higher income, or have other confounding factors. The study didn't manipulate coffee intake. It measured both variables and found they moved together.

Why reviewers flag it: Because this is Logic 101, and getting it wrong undermines every other claim in the paper.

How to fix it: Use precise language. "Associated with," not "caused." "Correlated with," not "driven by." If you believe the relationship is causal, make the argument explicitly using evidence from the study design.

5. Cherry-picking subgroups after seeing results

What it looks like: The overall analysis showed no effect. But when the authors split by age, gender, and disease severity, they found a significant effect in women over 65 with severe disease. The paper focuses on this subgroup as if it were the planned analysis.

A specific example: A clinical trial shows no difference in the primary endpoint (p=0.34). The authors then test the drug's effect in 12 pre-specified and 8 post-hoc subgroups. They find a significant benefit in patients with baseline CRP > 10 (p=0.03). The abstract leads with this subgroup finding.

Why reviewers flag it: Because with enough subgroups, you'll always find one that looks significant. This is just p-hacking in a trench coat.

How to fix it: Pre-specify your subgroups. Register them. If you find an interesting subgroup effect post-hoc, label it clearly: "In an exploratory analysis..." Don't bury the negative primary result.

6. Missing confidence intervals (reporting only p-values)

"Treatment improved survival (p=0.03)." Improved by how much? One day? One year? You can't tell. A p-value tells you whether an effect is likely to be non-zero, not how large or precise it's. A result can be statistically significant and practically meaningless (p < 0.001, effect size = 0.01). Report effect sizes with 95% confidence intervals for every comparison. The ASA's 2016 statement on p-values recommends this explicitly. Journals like PNAS now require effect sizes for all statistical comparisons.

7. Inappropriate use of bar graphs for continuous data

Bar graphs with error bars for continuous data with 5-20 data points per group hide everything interesting. Two treatment groups, n=8 each: the bar graph shows nearly identical means, but the individual data reveals Group B has a bimodal distribution (four responders, four non-responders). Use dot plots, box plots, or violin plots. Show individual data points. Many journals now explicitly require this for small n.

8. Survivorship bias in cohort selection

A study follows patients from diagnosis to outcome, but only includes patients who survived long enough to be enrolled. The sickest patients died before they could be included. Example: a study examining five-year outcomes requires completing at least two chemotherapy cycles. Patients who died after cycle one aren't in the data. The reported 70% survival rate is inflated. Use intention-to-treat analysis where appropriate, draw a CONSORT flow diagram, and acknowledge survivorship bias in your limitations.

9. Not accounting for clustering in hierarchical data

A neuroscience paper reports n=200 neurons. But those came from 4 mice (50 per animal). The paper runs statistics as if each neuron is independent, but neurons within the same mouse are correlated. The effective sample size isn't 200, it's closer to 4. This is one of the most common errors in biology. PLOS ONE specifically checks for this in their rigor-only review model. Analyze at the level of the independent unit, or use mixed-effects models with a random effect for animal. Report both the number of independent units and total observations.

10. Misusing fold-change without absolute numbers

"Treatment increased expression 10-fold (p < 0.01)." Impressive, until you learn baseline expression was 0.001 units. A 15-fold increase from 0.2 pg/mL to 3.0 pg/mL sounds dramatic, but when the normal range is 50-200 pg/mL, you're still deep in the noise floor. Always report absolute numbers alongside fold-changes. Include reference ranges when available. Let the reader judge whether the magnitude matters.

The AI paper problem

One more thing. AI-generated manuscripts are especially prone to statistical inconsistencies.

LLMs can produce text that sounds statistically literate without understanding what the numbers mean. I've seen AI-assisted drafts that report a p-value of 0.03 in the text and p < 0.001 in the corresponding table. Or a sample size in the methods that doesn't match the degrees of freedom in the results. A human who ran the analysis knows the numbers because they looked at the output. An AI is pattern-matching, and sometimes the patterns conflict.

If you're using AI to help draft your stats sections, check every number against your actual analysis output. Every single one.

Your pre-submission stats checklist

Before you submit, run through this:

Design and power:

  • Did you do a power analysis (and report it)?
  • Is your sample size justified?
  • Are your primary outcomes pre-specified?

Analysis:

  • Did you check distributional assumptions before choosing your test?
  • If you ran multiple comparisons, did you correct for them?
  • Did you account for clustering/nesting in hierarchical data?
  • Did you use the right level of analysis (independent experimental units, not pseudoreplicates)?

Reporting:

  • Did you report effect sizes with confidence intervals, not just p-values?
  • Did you include absolute numbers alongside fold-changes or percentages?
  • Did you show individual data points for small sample sizes?
  • Do numbers match across text, tables, and figures?
  • Are all subgroup analyses labeled as pre-specified or exploratory?

Interpretation:

  • Does your language match your evidence? (association ≠ causation)
  • Are your claims proportional to your data?
  • Have you acknowledged limitations from sample size, design, or statistical constraints?

Visualization:

  • Are you using dot plots or box plots instead of bar graphs for small n?
  • Can readers see the actual data, not just summaries?
  • Do your figures show variability accurately (SD, not just SEM)?

If you can check every box, your statistics section is solid enough that a reviewer won't trip over the easy stuff and start questioning the hard stuff.


Not sure if your statistics hold up? Our reviewers include scientists who've served on editorial boards and reviewed for journals across biomedicine, clinical research, and computational biology. We'll flag the problems reviewers would flag - before they see your paper.

Resources

The Bottom Line

Statistical errors are the most common fixable reason for desk rejection and peer review failure. Most of them don't require rerunning analyses , they require correct framing and transparent reporting. Our diagnostic flags the statistical signaling issues that editors and reviewers look for before your manuscript leaves your hands.

Sources

  • Published editorial guidelines from high-impact journals
  • International Committee of Medical Journal Editors (ICMJE) reporting standards
  • CONSORT, PRISMA, STROBE, and ARRIVE reporting guidelines
  • Pre-Submission Checklist , 25-point audit before you submit

See also

Free scan in about 60 seconds.

Run a free readiness scan before you submit.

Drop your manuscript here, or click to browse

PDF or Word · max 30 MB

Security and data handling

Manuscripts are processed once for this scan, then deleted after analysis. We do not use submitted files for model training. Built with Anthropic privacy controls.

Need NDA coverage? Request an NDA

Only email + manuscript required. Optional context can be added if needed.

Upload Manuscript Here - Free Scan