Research Brief · March 2026 · v2.0

The verification gap in scientific publishing

Scientific publishing now faces an integrity problem that is operational, not only ethical. Paper mills, high-volume redundant publication, and fluent AI-generated text all exploit the same weakness: evidence is often judged after prose quality, not before.

Executive summary

  • Scientific fraud now includes organized, resilient entities that operate at industrial scale, rather than only isolated misconduct.[1]
  • Large-scale screening studies detect paper mill-like signatures across substantial portions of biomedical corpora, including cancer literature.[2]
  • Redundant publication appears to have increased sharply in the generative AI period, stressing existing editorial controls.[3]
  • Citation hallucination remains frequent in LLM-assisted scientific writing and can persist unless explicit verification workflows are applied.[4][5][8]
  • A practical response is a mandatory reference-integrity layer with defined failure thresholds and transparent uncertainty handling.[6][10]

1. Scope and research question

This brief asks a narrow question: what is the minimum verification architecture needed to reduce fabricated, unstable, or misleading citation chains before publication decisions are finalized?

We focus on three linked threat classes: paper mills, AI slop, and citation hallucination. We define AI slop as high-volume, low-rigor text that is publication-shaped but weakly grounded in verifiable evidence.

2. Evidence base and confidence grading

We prioritized peer-reviewed studies and editorials from journals focused on research integrity, medical publishing, and applied AI safety. We classified evidence as high confidence (large cross-sectional datasets or methodologically explicit studies), medium confidence (field reports and case studies), and exploratory (pilot or niche domain studies).

The strongest evidence in this brief comes from corpus-scale studies in PNAS, BMJ, and BMC Medicine.[1][2][3] Case reports and discipline-specific analyses are used to characterize failure modes, not to estimate universal prevalence.[6][11][12]

3. What changed: from episodic misconduct to system risk

Research fraud historically appeared as episodic anomalies. Recent evidence suggests a structural shift. The PNAS analysis on fraud-enabling entities describes a resilient ecosystem that adapts as journals harden controls.[1]

The BMJ machine-learning screening study in cancer literature pushes this further by showing that paper mill-like patterns can be detected at very large scale.[2] The implication is operational: integrity threats are no longer rare enough to depend on ad hoc reviewer intuition.

In parallel, BMC Medicine reports dramatic increases in redundant publication in the generative AI era.[3] Redundant claims are not always simple copy-paste events. They can appear as syntactic novelty with overlapping scientific assertions, which weakens the utility of text-similarity screening alone.

4. Threat taxonomy: paper mills, AI slop, and citation instability

Paper mills: organized workflows producing low-integrity manuscripts, often with templated structure, manipulated images, and synthetic references.[1][13][14]

AI slop: high-volume generated text where fluency exceeds evidential grounding. It may be non-malicious, but still risky when integrated into formal publication pipelines without verification.

Citation instability: references that are fabricated, mismatched, partially true, or semantically misused. This includes real citations attached to unsupported claims and non-existent citations that appear plausible.[4][5][8]

These categories overlap in output signatures: polished language, weak reproducibility context, and low traceability from claim to source.

5. Pipeline failure modes by stage

Stage A, authoring: generated drafts can include fabricated references or stale evidence. Human review often focuses on readability first.

Stage B, submission triage: editorial checks emphasize formatting, scope fit, and novelty framing under time pressure.

Stage C, peer review: reviewers rarely have bandwidth to manually validate every citation chain. Emerging case reports suggest AI-generated review artifacts are already entering workflows.[6][7]

Stage D, post-publication correction: retraction and correction lag allows unstable claims to accumulate downstream citations before cleanup.

6. Evidence table: representative findings

StudyDesign / ScopeKey finding
PNAS 2025[1]System-level fraud network analysisFraud-enabling entities are large and adaptive
BMJ 2026[2]ML screening, multi-million cancer corpusPaper mill-like signatures detectable at scale
BMC Medicine 2025[3]Trend analysis in AI eraRedundant publication increased sharply
Cureus 2023[4]Empirical evaluation of generated referencesHigh fabricated/inaccurate citation rates
ESE 2025[5]Editorial/policy analysisCalls for full-text reference deposit and verification
RIPR 2025 case studies[6][7]Workflow incident reportsAI artifacts can bypass social trust assumptions

7. Proposed minimum verification protocol

We propose a five-step protocol for manuscript-level reference integrity checks. The goal is not perfection. The goal is bounded risk before decision.

  1. Existence: every reference must resolve in at least one trusted index (Crossref, PubMed, OpenAlex, Semantic Scholar).
  2. Metadata coherence: title, author set, journal, and year must align. Significant mismatch yields an uncertainty flag.
  3. Identifier integrity: DOI and PMID resolution checks. Redirects or dead identifiers are logged.
  4. Claim linkage: sample check whether in-text claim is actually supported by cited source, not just topically related.
  5. Thresholding: define unresolved-reference tolerance. Above threshold, classify as elevated integrity risk and request revision before progression.

This protocol aligns with the broader retrieve-summarize-verify framing in medical information workflows.[10]

8. Mini case studies

Case 1, generated references in medical content: empirical work found a high fraction of fabricated or inaccurate references in LLM-generated outputs, with plausible formatting masking invalid sources.[4]

Case 2, reviewer-side risk: research-integrity case studies describe AI-generated peer review experiences and false authorship incidents that expose identity and accountability gaps.[6][7]

Case 3, mitigation by retrieval and verification: domain-specific studies show that retrieval-augmented and bibliometrics-guided approaches can reduce hallucination and improve citation accuracy, though they do not fully remove interpretive error.[9][10][12]

9. Implementation roadmap for journals and institutions

Level 1 (baseline): mandatory AI-use disclosure and random citation audits.

Level 2 (operational): automated existence/metadata checks for all submissions.

Level 3 (assurance): claim-to-citation sampling and unresolved-reference thresholds tied to editorial decisions.

Level 4 (networked integrity): cross-journal fraud signal sharing, standardized integrity metrics, and retraction-lag monitoring.

10. Limits and non-claims

This brief does not claim that all AI-assisted writing is deceptive. It does not claim that reference verification can establish experimental validity, novelty, or causal inference quality.

It does claim something narrower: in the current publishing environment, reference-layer verification is a necessary control. Without it, polished text can outrun evidence quality.

11. Conclusion

Publishing integrity is now an architecture question. High-volume manuscript generation and industrial fraud both scale faster than manual trust-based checks. A journal can no longer assume that coherent prose implies reliable evidence.

The first practical correction is straightforward: promote reference verification from optional diligence to mandatory infrastructure.

References

  1. Richardson R, Hong SS, Byrne JA, Stoeger T, Nunes Amaral LA. The entities enabling scientific fraud at scale are large, resilient, and growing rapidly. PNAS (2025). DOI: 10.1073/pnas.2420092122
  2. Scancar B, Byrne JA, Causeur D, Barnett AG. Machine learning based screening of potential paper mill publications in cancer research. BMJ (2026). DOI: 10.1136/bmj-2025-087581
  3. Maupin D, Suchak T, Barnett A, Spick M. Dramatic increases in redundant publications in the Generative AI era. BMC Medicine (2025). DOI: 10.1186/s12916-025-04569-y
  4. Bhattacharyya M, Miller VM, Bhattacharyya D, Miller LE. High Rates of Fabricated and Inaccurate References in ChatGPT-Generated Medical Content. Cureus (2023). DOI: 10.7759/cureus.39238
  5. Glynn A. Guarding against artificial intelligence hallucinated citations: The case for full-text reference deposit. European Science Editing (2025). DOI: 10.3897/ese.2025.e153973
  6. Lo Vecchio N. Personal experience with AI-generated peer reviews: a case study. Research Integrity and Peer Review (2025). DOI: 10.1186/s41073-025-00161-3
  7. Spinellis D. False authorship: an explorative case study around an AI-generated article published under my name. Research Integrity and Peer Review (2025). DOI: 10.1186/s41073-025-00165-z
  8. Jain A, Nimonkar P, Jadhav P. Citation integrity in the age of AI. Journal of Cranio-Maxillofacial Surgery (2025). DOI: 10.1016/j.jcms.2025.08.004
  9. Hoshiar MH. Artificial intelligence reliability in implant dentistry. Journal of Prosthetic Dentistry (2025). DOI: 10.1016/j.prosdent.2025.11.005
  10. Jin Q, Leaman R, Lu Z. Retrieve, Summarize, and Verify: How Will ChatGPT Affect Information Seeking from the Medical Literature? JASN (2023). DOI: 10.1681/ASN.0000000000000166
  11. Kendall G, Teixeira da Silva JA. Risks of abuse of large language models in scientific publishing. Learned Publishing (2023). DOI: 10.1002/leap.1578
  12. Kurland DB et al. Augmenting Large Language Models With Automated, Bibliometrics-Powered Literature Search. Neurosurgery (2025). DOI: 10.1227/neu.0000000000003354
  13. Seifert R. How Naunyn-Schmiedeberg's Archives of Pharmacology deals with fraudulent papers from paper mills. (2021). DOI: 10.1007/s00210-021-02056-8
  14. van der Heyden MAG. The 1-h fraud detection challenge. (2021). DOI: 10.1007/s00210-021-02120-3