Back to AuditMethodology

How Manusights Audit works

Full algorithm disclosure for the four consistency checks Audit runs against a paste of a Results section: statcheck-equivalent p-recompute, GRIM, GRIMMER, and DEBIT. Every check ties back to a citable published algorithm. Recompute math is closed-form and runs in pure code — no LLM in the verdict path.

Last reviewed: April 2026 · Audit v1.0

Architecture: LLM extraction + deterministic recompute

Audit splits the pipeline into two stages by design:

  1. Extraction (LLM): Claude Haiku 4.5 reads the pasted Results section and emits structured JSON via a tool-use schema — one record per reported NHST claim, descriptive triple, and binary proportion. Haiku is used because regex-only extraction (the original statcheck approach) has ~60% recall on real psychology papers per Nuijten et al. (2016). Modern manuscripts use LaTeX, Markdown, mixed prose conventions, and copy-pasted Word tables that break regex.
  2. Recompute (pure TypeScript): all consistency math runs in deterministic code. No LLM in the verdict path. Mathematical certainty is the point.

The split exists because reviewers and meta-analysts who use these checks specifically want closed-form math. “Claude says your p is wrong” is not defensible in peer review; “the recomputed p from the reported t and df is 0.078, not 0.048” is.

p-value recompute (statcheck-equivalent)

Nuijten, Hartgerink, van Assen, Epskamp, & Wicherts (2016)

What it checks

For every reported NHST test (t, F, χ², r, z) with degrees of freedom and a reported p-value, Audit recomputes the two-tailed p from the test-statistic CDF and compares it to the reported value. Reported and recomputed are considered consistent when they agree within a rounding tolerance of 0.5 × 10^(-decimalsReported).

How the math is computed

  • t with df: two-tailed p = I_x(df/2, 1/2) where x = df / (df + t²) and I_x is the regularized incomplete beta function (Numerical Recipes 6.4 continued fraction).
  • F with df1, df2: p = I_x(df2/2, df1/2) where x = df2 / (df2 + df1 · F).
  • χ² with df: p = 1 − P(df/2, χ²/2) where P is the regularized lower incomplete gamma function (Numerical Recipes 6.2).
  • r with df: converted to t = r · sqrt(df / (1 − r²)) then computed as t.
  • z: two-tailed p = 2 · (1 − Φ(|z|)), where Φ uses the Abramowitz & Stegun 7.1.26 erf approximation.

The CDFs are accurate to ≥1e−9 across the test-statistic ranges that occur in real manuscripts. Math is verified against textbook values in apps/web/scripts/test_audit_math.ts (13 deterministic test cases, all passing).

Severity classification

  • Decision-flipping. Recomputed p crosses α = 0.05 boundary in the opposite direction from reported. The reported value claims significance when the recompute says non-significant, or vice-versa. This is the highest-priority bucket because it changes the substantive interpretation. Nuijten et al. (2016) found this in ~13% of inconsistencies across 30,000+ psychology papers.
  • Inconsistency. Math doesn’t add up but significance call doesn’t flip.
  • Benign-likely. Two-tailed math disagrees but a one-tailed alternative or correction-applied alternative would resolve. Surfaces a “recompute as one-tailed” affordance instead of a hard flag.
  • Unverifiable. Inputs missing (e.g. df not reported), correction explicitly applied, or extraction ambiguous. Surfaced rather than silently dropped.

Suppression cases

  • Multiple-comparisons correction (Bonferroni / FDR / Holm). When the paper explicitly applies a correction near the reported p, recompute is suppressed and a “correction applied” note is surfaced. Correction math is family-specific and depends on what was corrected for; we don’t guess.
  • One-tailed test. Recomputed as one-tailed instead of suppressed, when the paper explicitly states directionality.

Reference

Nuijten, M. B., Hartgerink, C. H. J., van Assen, M. A. L. M., Epskamp, S., & Wicherts, J. M. (2016). The prevalence of statistical reporting errors in psychology (1985–2013). Behavior Research Methods, 48(4), 1205–1226. https://doi.org/10.3758/s13428-015-0664-2

Nuijten, M. B., & Polanin, J. R. (2020). “statcheck”: Automatically detect statistical reporting inconsistencies to increase reproducibility of meta-analyses. Research Synthesis Methods, 11(5), 574–579. https://doi.org/10.1002/jrsm.1408

GRIM (Granularity-Related Inconsistency of Means)

Brown & Heathers (2017)

What it checks

GRIM tests whether a reported mean of integer-bounded data is mathematically possible. For an integer-data sample of size N, the mean must equal k / N for some integer k in the legal scale range. Audit checks whether the reported mean is within rounding distance of any such legal value. When inconsistent, we surface the three nearest legal values for the “did you mean ___?” UX.

When it applies

GRIM is informative for N ≤ 200 with integer-bounded scales — Likert items, count outcomes (correct trials, error rates as integer counts), and similar. Above N ≈ 200, the rounding band swallows the integer scale and almost any decimal value passes; in that range Audit surfaces “skipped” rather than risking a false-positive.

GRIM does not apply to:

  • real-valued continuous measures (reaction time, weight, concentration)
  • percentages or rates expressed as decimals
  • composite scores (sum-of-Likert / 7), which Audit currently handles conservatively as integer-scale and may flag false positives

The LLM extractor is instructed to set integerScale: false in doubt, because false-positive integer-scale flags produce wrong GRIM verdicts and damage tool credibility.

Reference

Brown, N. J. L., & Heathers, J. A. J. (2017). The GRIM test: A simple technique detects numerous anomalies in the reporting of results in psychology. Social Psychological and Personality Science, 8(4), 363–369. https://doi.org/10.1177/1948550616673876

GRIMMER (Granularity-Related Inconsistency of Means Mapped to Error Repeats)

Anaya (2017); Heathers & Brown (2019)

What it checks

GRIMMER extends GRIM to the standard deviation. Given a GRIM-passing mean, an integer scale, and N, only certain SD values are mathematically achievable. The tightest bound comes from the case where all values cluster at the scale extremes; SDs above this maximum are impossible.

v1.0 limitation: upper-bound check only

Audit v1.0 implements an upper-bound check. SDs that exceed the theoretical maximum given the integer scale and N are flagged. Anaya 2017’s full integer-partition enumeration ships in a future update — it would additionally catch sub-maximal SDs that are nonetheless impossible (e.g. SDs that would require fractional integer counts).

Practical implication: flagged cases in v1.0 are correct (the math is one-sided), but some genuinely-impossible sub-maximal SDs may currently pass. We surface this limitation prominently in the result UI rather than silently under-flagging.

References

Anaya, J. (2017). The GRIMMER test: A method for testing the validity of reported measures of variability. PeerJ Preprints 5:e2400v1. https://doi.org/10.7287/peerj.preprints.2400v1

Heathers, J. A. J., & Brown, N. J. L. (2019). DEBIT: A simple consistency test for binary data. OSF Preprints. https://osf.io/5vb3u/ (covers GRIMMER context as part of the consistency-test family)

DEBIT (Descriptive Binary data Inconsistency Test)

Heathers, van der Zee, & Jung (2018); Heathers & Brown (2019)

What it checks

For binary data (0/1 outcomes), the proportion p and the standard deviation are linked: SD = sqrt(p × (1 − p) × N / (N − 1)). DEBIT tests whether a reported (proportion, SD, N) triple is mathematically consistent within rounding tolerance. When inconsistent, we surface the nearest legal proportion values.

When it applies

DEBIT applies only to genuinely binary outcomes — proportion correct, fraction male, fraction yes, etc. The LLM extractor is instructed to set isBinary: true only when the outcome is unambiguously 0/1. Continuous proportions (e.g. percentage of variance explained, fraction of a continuous quantity) are not subject to DEBIT.

References

Heathers, J. A. J. (2018). The DEBIT method: a tool for spotting reporting errors in studies of binary data. Personal blog (later formalized in Heathers & Brown 2019). @jamesheathers on Medium

Heathers, J. A. J., van der Zee, T., & Jung, A. (2018). Method development for DEBIT (binary-data consistency test). OSF.

Heathers, J. A. J., & Brown, N. J. L. (2019). DEBIT: A simple consistency test for binary data. OSF Preprints. https://osf.io/5vb3u/

Data handling

  • Pasted text is never used to train any model. Anthropic’s API runs in zero-retention mode under our contract. Audit does not call OpenAI.
  • Cache TTL: 7 days. Results are cached server-side keyed by content hash so repeat pastes return instantly. The original Results paste is not stored alongside the cache key.
  • No account, no email gate. The tool works without sign-up. Rate limits are per-IP, not per-account.
  • Rate limits. 30 audits per hour per IP and 3 audits per browser per day (localStorage). Share URLs are not indexed (per-result snapshots, no SEO duplication).

How to cite Manusights Audit

If you reference Audit results in a manuscript, methods section, or supplementary materials, please cite as:

Manusights. (2026). Audit v1.0: Stats Sanity Checker (statcheck-equivalent
  p-recompute + GRIM, GRIMMER, and DEBIT consistency checks) [Free academic tool].
  https://manusights.com/tools/stats-audit
  Methodology: https://manusights.com/tools/stats-audit/methodology
  (Accessed: YYYY-MM-DD)

When citing the underlying algorithms directly (which we recommend for methods-section disclosures), please credit the original authors of statcheck (Nuijten et al. 2016), GRIM (Brown & Heathers 2017), GRIMMER (Anaya 2017), and DEBIT (Heathers, van der Zee, & Jung 2018; Heathers & Brown 2019). See the About page for full credits.

Want manuscript-level statistical rigor, not paste-level? The full Manusights Readiness Scan reads your entire manuscript: it runs the same statcheck/GRIM/GRIMMER/DEBIT consistency suite and additionally flags missing power analyses, multiple-comparison gaps, methodology issues, and reviewer-flag patterns alongside arithmetic checks. Free preview, $29 only if you want the full report.

Run the full readiness scan