Statistical Mistakes That Get Papers Rejected (A Reviewer's Checklist)
Reviewers don't need a statistics PhD to spot these errors. Here are the 10 statistical mistakes that get papers rejected, and how to fix each one before you submit.
Assistant Professor, Cardiovascular & Metabolic Disease
Author context
Works across cardiovascular biology and metabolic disease, with expertise in navigating high-impact journal submission requirements for Circulation, JACC, and European Heart Journal.
Next step
Choose the next useful decision step first.
Use the guide or checklist that matches this page's intent before you ask for a manuscript-level diagnostic.
Quick answer: Statistical mistakes papers rejected for most often are not exotic. Reviewers usually flag underpowered designs, multiple comparisons without correction, pseudoreplication, missing effect sizes, and causal language that the design cannot support. Most of these problems are visible before peer review if the paper is checked with reporting discipline rather than optimism.
Statistical problems are the single fastest way to lose a reviewer's trust. Not because reviewers are stats pedants, because statistical mistakes signal a lack of rigor, and once a reviewer spots one, every other claim gets more scrutiny.
The eLife paper (Makin & Orban de Xivry, 2019) found that in a simple 2x3x3 experimental design, the probability of finding at least one spurious significant result is 30%, even when no real effect exists. The same mistakes show up across every field. Fix them before you submit, and you remove the easiest reason for a reviewer to say no.
What reviewers do with common statistical mistakes
Statistical mistake | What reviewers usually conclude | Fastest honest fix | Where it hurts most |
|---|---|---|---|
No power logic or sample-size justification | the paper may be underpowered and unreliable | add power reasoning or narrow the claim | editorial triage and methods review |
Multiple comparisons without correction | significant results may be chance findings | correct, disclose all tests, and relabel exploratory work | reviewer trust in the results section |
Pseudoreplication | the p-values are invalid at the unit of analysis used | reanalyze at the real experimental unit or use mixed models | biology, neuroscience, animal work |
P-values without effect sizes or intervals | magnitude and precision are unclear | report effect sizes and confidence intervals | clinical and translational papers |
Causal wording from observational data | the interpretation outruns the design | rewrite the claim to association language | abstract, discussion, cover letter |
1. P-hacking and multiple comparisons without correction
What it looks like: You tested 20 different outcomes and reported the three that were statistically significant, without mentioning the other 17. Or you compared six treatment groups to each other (15 pairwise comparisons) and reported the p-values from individual t-tests.
A specific example: A study compares gene expression across four patient groups. Instead of an ANOVA, the authors run six pairwise t-tests and report two comparisons with p < 0.05. They don't apply Bonferroni correction. With six comparisons at α = 0.05, you'd expect about one "significant" result by chance alone.
Why reviewers flag it: Because it inflates false positives. If you run enough tests, something will be significant by accident. Every stats-literate reviewer knows this.
How to fix it: Use an appropriate correction (Bonferroni, Benjamini-Hochberg, Tukey's HSD). Report all outcomes in a supplementary table, not just the hits. Pre-register your primary outcomes if possible. Twenty independent tests at alpha = 0.05 yields a 64% chance of at least one false positive.
2. Small sample sizes with no power analysis
What it looks like: n=3 per group for an experiment that would need n=15 to detect a meaningful effect. No power calculation. No justification for the sample size.
A specific example: A drug reduces tumor volume by 20% compared to control. Sounds promising. But with n=4 mice per group, the confidence intervals span from -10% to +50%. The effect could easily be zero. A power analysis would show they'd need at least 12 animals per group to reliably detect a 20% difference.
Why reviewers flag it: Because underpowered studies produce unreliable results. A "significant" finding from a tiny sample is likely a false positive or a grossly overestimated effect size.
How to fix it: Do a power analysis before the experiment and report it in your methods. G*Power (free) or R's pwr package make this straightforward. If you couldn't achieve the ideal sample size, say so. Reviewers respect honesty about constraints far more than silence.
3. Using parametric tests on non-normal data
What it looks like: Running a t-test or ANOVA on data that's clearly skewed, bimodal, or has massive outliers. No test for normality mentioned.
A specific example: Comparing hospital length-of-stay between two patient groups using a t-test. Length-of-stay data is almost never normally distributed. The median is 4 days, but the mean is 11. A t-test on these data compares means that don't represent either group well.
Why reviewers flag it: Parametric tests assume normal distribution. When that assumption is violated, the p-value can be meaningless.
How to fix it: Test for normality (Shapiro-Wilk for n < 50, Q-Q plots for visual assessment). If your data isn't normal, use non-parametric alternatives: Mann-Whitney U instead of a t-test, Kruskal-Wallis instead of ANOVA. Or log-transform skewed data. Report which test you used and why.
4. Confusing correlation with causation in the writing
What it looks like: The results show a correlation. The discussion says "causes" or "drives" or "leads to."
A specific example: "Higher coffee consumption was associated with lower mortality (r = -0.32, p < 0.01). These findings demonstrate that coffee consumption reduces mortality risk." No. People who drink more coffee might also exercise more or have higher income. The study measured both variables and found they moved together. That's it.
How to fix it: Use precise language. "Associated with," not "caused." "Correlated with," not "driven by." If you believe the relationship is causal, make the argument explicitly using evidence from the study design.
5. Cherry-picking subgroups after seeing results
What it looks like: The overall analysis showed no effect. But when the authors split by age, gender, and disease severity, they found a significant effect in women over 65 with severe disease. The paper focuses on this subgroup as if it were the planned analysis.
A specific example: A clinical trial shows no difference in the primary endpoint (p=0.34). The authors test the drug in 12 pre-specified and 8 post-hoc subgroups. They find a significant benefit in patients with baseline CRP > 10 (p=0.03). The abstract leads with this subgroup finding.
How to fix it: Pre-specify your subgroups and register them. If you find an interesting subgroup effect post-hoc, label it clearly: "In an exploratory analysis..." Don't bury the negative primary result.
6. Missing confidence intervals (reporting only p-values)
"Treatment improved survival (p=0.03)." Improved by how much? One day? One year? You can't tell.
A specific example: A cardiovascular trial reports that a new antiplatelet agent "reduced major adverse cardiac events (p=0.04)." But the absolute risk reduction was 0.3% (3.2% vs 3.5%). The number needed to treat is 333. Statistically significant, but clinically marginal.
Why reviewers flag it: A p-value tells you whether an effect is likely to be non-zero, not how large or precise it is. The ASA's 2016 statement on p-values makes this point explicitly: "Scientific conclusions should not be based only on whether a p-value passes a specific threshold."
How to fix it: Report effect sizes with 95% confidence intervals for every comparison. Include absolute numbers, not just relative measures. CONSORT requires both absolute and relative effect measures for clinical trials, and journals like PNAS now require effect sizes for all statistical comparisons.
7. Inappropriate use of bar graphs for continuous data
What it looks like: Bar graphs with error bars for continuous data with 5-20 data points per group.
A specific example: Two treatment groups, n=8 each: the bar graph shows nearly identical means with overlapping error bars. But the individual data reveals Group B has a bimodal distribution, four complete responders and four non-responders. The bar graph hides the most interesting finding.
Why reviewers flag it: Weissgerber et al. (2015) in PLOS Biology showed that dramatically different distributions can produce identical bar graphs. Their paper has been cited over 3,000 times, and many journals updated their figure policies.
How to fix it: Use dot plots, box plots, or violin plots. Show individual data points. JBC, PLOS, and eLife now require individual data points for small n.
8. Survivorship bias in cohort selection
What it looks like: A study follows patients from diagnosis to outcome, but only includes patients who survived long enough to be enrolled. The sickest patients died before they could be included.
A specific example: A study examining five-year outcomes after chemotherapy requires participants to have completed at least two cycles. Patients who died after cycle one aren't in the data. The reported 70% five-year survival rate is inflated because the highest-risk patients were excluded by design.
Why reviewers flag it: This creates an artificially favorable picture of outcomes.
How to fix it: Use intention-to-treat analysis where appropriate. Draw a CONSORT flow diagram showing exactly how many patients were excluded at each stage and why. If your enrollment criteria exclude high-risk patients, run a sensitivity analysis showing how results change if they're included.
9. Not accounting for clustering in hierarchical data
What it looks like: A neuroscience paper reports n=200 neurons. But those came from 4 mice (50 per animal). The paper runs statistics as if each neuron is independent, but neurons within the same mouse are correlated.
A specific example: The effective sample size isn't 200, it's closer to 4. The intraclass correlation within a single animal might be 0.3 or higher, meaning the 200 "independent" observations carry the statistical weight of roughly 15-20 truly independent measurements. The paper's p-value of 0.001 recalculated at the animal level might be 0.15.
Why reviewers flag it: This is one of the most common errors in biology. The eLife review identified it as particularly widespread in neuroscience.
How to fix it: Analyze at the level of the independent unit, or use mixed-effects models with a random effect for animal/subject/cluster. Report both the number of independent units and total observations. ARRIVE 2.0 now requires explicit reporting of the experimental unit for each analysis.
10. Misusing fold-change without absolute numbers
What it looks like: "Treatment increased expression 10-fold (p < 0.01)." Impressive, until you learn baseline expression was 0.001 units.
A specific example: A 15-fold increase from 0.2 pg/mL to 3.0 pg/mL sounds dramatic. But when the normal physiological range is 50-200 pg/mL, you're still in the noise floor. Conversely, a 1.3-fold increase in serum creatinine (from 1.0 to 1.3 mg/dL) is clinically meaningful, it indicates acute kidney injury under KDIGO criteria.
How to fix it: Always report absolute numbers alongside fold-changes. Include reference ranges when available. If your fold-change is large but the absolute values are near the assay's detection limit, acknowledge that limitation.
Your pre-submission stats checklist
Before you submit, run through this:
Design and power:
- Did you do a power analysis (and report it)?
- Is your sample size justified?
- Are your primary outcomes pre-specified?
Analysis:
- Did you check distributional assumptions before choosing your test?
- If you ran multiple comparisons, did you correct for them?
- Did you account for clustering/nesting in hierarchical data?
- Did you use the right level of analysis (independent experimental units, not pseudoreplicates)?
Reporting:
- Did you report effect sizes with confidence intervals, not just p-values?
- Did you include absolute numbers alongside fold-changes or percentages?
- Did you show individual data points for small sample sizes?
- Do numbers match across text, tables, and figures?
- Are all subgroup analyses labeled as pre-specified or exploratory?
Interpretation:
- Does your language match your evidence? (association ≠ causation)
- Are your claims proportional to your data?
- Have you acknowledged limitations from sample size, design, or statistical constraints?
Visualization:
- Are you using dot plots or box plots instead of bar graphs for small n?
- Do your figures show variability accurately (SD, not just SEM)?
The 10 Most Common Statistical Mistakes by Field
Statistical errors vary by discipline. Here's what reviewers flag most often in each area:
Field | Most common mistake | Why it leads to rejection |
|---|---|---|
Clinical medicine | Missing power calculation | Reviewers can't assess whether the study was adequately powered for the primary endpoint |
Epidemiology | Confounding not addressed | Observational studies without propensity matching or sensitivity analysis are considered unreliable |
Cell biology | No quantification of imaging data | Western blots and microscopy without densitometry or cell counts are treated as anecdotal |
Genomics | Multiple testing not corrected | Genome-wide analyses without Bonferroni or FDR correction inflate false discovery rates |
Chemistry | Missing error bars or replicates | Single-run characterization data doesn't demonstrate reproducibility |
Physics | Systematic uncertainty not separated | Lumping systematic and statistical errors together makes the result uninterpretable |
Psychology | P-hacking or HARKing | Post-hoc hypothesis generation presented as a priori, reviewers detect this from the study design |
Ecology | Pseudoreplication | Treating non-independent observations as independent samples inflates statistical power |
Machine learning | Train/test leakage | Using test data features during training produces artificially inflated performance metrics |
Social sciences | Inappropriate causal claims from observational data | "X causes Y" when the design only supports "X is associated with Y" |
In our pre-submission review work
In our pre-submission review work, the most damaging statistics problems are often the ones authors think are too small to matter. A paper may have respectable analyses overall, but one subgroup result is oversold, one figure legend hides the error-bar definition, or one animal study treats measurements from the same subject as independent. After that, reviewers start distrusting the rest of the paper.
According to ICMJE recommendations, authors should describe statistical methods in enough detail that a knowledgeable reader can verify the reported results. ARRIVE 2.0 also requires authors to state the experimental unit explicitly. Those requirements line up with what we see in rejections: papers do not only fail because the statistics are wrong, they fail because the manuscript does not make the statistical logic auditable.
The Bottom Line
Statistical errors are the most common fixable reason for desk rejection. Most don't require rerunning analyses, they require correct framing and transparent reporting. A manuscript readiness and journal-fit check flags the statistical signaling issues that editors and reviewers look for before you submit.
What to do in the next 48 hours
Time window | Action | Why |
|---|---|---|
Next 24 hours | mark every place where the claim is stronger than the design allows | this catches the fastest reviewer objection |
Next 24 to 48 hours | verify experimental unit, corrections for multiple testing, and effect-size reporting | these are the statistical mistakes papers rejected for most often |
Before submission | rewrite the abstract and figure legends so the analysis is auditable | reviewers trust clear reporting more than confident rhetoric |
Readiness check
Run the scan while the topic is in front of you.
See score, top issues, and journal-fit signals before you submit.
Submit If / Think Twice If
Submit if:
- the statistics section is mostly sound and you need a rejection-prevention checklist
- you want to test whether the claims, plots, and reported tests line up
- the study design is fixed and the remaining risk is analysis and reporting discipline
Think twice if:
- the paper would need new data rather than cleaner analysis or wording
- you still cannot identify the real experimental unit
- the manuscript is using statistical polish to cover a weak design
Frequently asked questions
The most common are: p-hacking and multiple comparisons without correction, small sample sizes with no power analysis, using parametric tests on non-normal data, confusing correlation with causation, cherry-picking subgroups, and reporting only p-values without confidence intervals or effect sizes.
Yes. Do a power analysis before the experiment and report it in your methods. If you could not achieve the ideal sample size, say so and acknowledge the limitation. Reviewers respect honesty about constraints far more than silence.
Underpowered studies , studies that don't have enough participants to reliably detect the effect size they claim. Reviewers and editors at high-IF journals check sample size justification early. An underpowered study that finds a significant result raises immediate credibility concerns.
For clinical trials, epidemiological studies, and complex multivariable analyses , yes. For straightforward experimental work with standard statistical tests , a careful self-review against the relevant reporting guideline (CONSORT, STROBE, ARRIVE) is usually sufficient.
Pre-register your primary endpoint and analysis plan if you haven't already, report all outcomes including non-significant ones, and use confidence intervals alongside p-values. If your analysis plan deviated from the original protocol, be transparent about it.
Sources
- Makin & Orban de Xivry, "Ten common statistical mistakes to watch out for," eLife (2019)
- Borg et al., "Ten Common Statistical Errors from All Phases of Research," PM&R (2020)
- Weissgerber et al., "Beyond Bar and Line Graphs," PLOS Biology (2015)
- ASA statement on statistical significance and p-values (2016)
- ICMJE Recommendations for statistical reporting
- EQUATOR Network reporting guidelines by study type
- CONSORT statement for randomized trials
- STROBE statement for observational studies
- ARRIVE 2.0 guidelines for animal research
- Common mistakes in biostatistics, Clinical Kidney Journal (2024)
Reference library
Use the core publishing datasets alongside this guide
This article answers one part of the publishing decision. The reference library covers the recurring questions that usually come next: whether the package is ready, what drives desk rejection, how journals compare, and what the submission requirements look like across journals.
Checklist system / operational asset
Elite Submission Checklist
A flagship pre-submission checklist that turns journal-fit, desk-reject, and package-quality lessons into one operational final-pass audit.
Flagship report / decision support
Desk Rejection Report
A canonical desk-rejection report that organizes the most common editorial failure modes, what they look like, and how to prevent them.
Dataset / reference hub
Journal Intelligence Dataset
A canonical journal dataset that combines selectivity posture, review timing, submission requirements, and Manusights fit signals in one citeable reference asset.
Dataset / reference guide
Peer Review Timelines by Journal
Reference-grade journal timeline data that authors, labs, and writing centers can cite when discussing realistic review timing.
Before you upload
Choose the next useful decision step first.
Move from this article into the next decision-support step. The scan works best once the journal and submission plan are clearer.
Use the scan once the manuscript and target journal are concrete enough to evaluate.
Anthropic Privacy Partner. Zero-retention manuscript processing.
Where to go next
Supporting reads
Conversion step
Choose the next useful decision step first.
Use the scan once the manuscript and target journal are concrete enough to evaluate.