Validity in machine learning for extreme event attribution

Why this is more than a review

Full Review

Surfaces the issues.

Reviewer-objection map and verdict
Quality scorecard
The problems that get papers desk-rejected

Submission-Ready Dossier

Hands you the fixes.

Everything in the Full Review, plus:
5 paste-ready submission materials (cover letter, rewritten abstract, response-to-reviewers)
Target-journal fit decision and reviewer strategy
27 expert checks across 5 phases

You'll complete payment on Stripe's secure page, then return to Manusights.

Overall Feedback

NO-GO / revise before submission

If you read nothing else in this dossier, read this.

Verdict: NO-GO / revise before submission · Likely outcome: rapid editorial rejection at Nature Climate Change, most likely on scope-positioning grounds before full external review, with additional concern about methodological anchoring and reporting completeness

See §5.1 Editor's Letter for the specific critical issues and §5.17 Pre-Submission Reviewer Report for the full 12-issue audit.

Recommended target: Weather and Climate Extremes (fit 92/100 · ~56% accept) — after revision; see §5.2 cascade for fallback options.

How to read this dossier

This dossier is organized in the order you actually need it: decide, pack, prepare, fix, verify. Read top-to-bottom for the full picture, or jump to whichever section matches what you're doing right now.

§	Section	What it answers
1	Submission Decision	Should I submit, to which journal, and what's my fallback if rejected?...

Ready to paste into your submission

A Full Review tells you what to fix. Your Dossier also writes the submission materials for you: drafts you revise, not blank pages you start from.

Cover letter

Drafted to your target journal, addressing the editor’s findings.

Cassandra C. Chou Department of Biostatistics, Johns Hopkins University, [Address] cchou24@jh.edu

[Date]

The Editors Nature Climate Change Springer Nature, London

Dear Editors,

Why this work belongs in Nature Climate Change

Extreme event attribution now informs climate policy, adaptation planning and legal proceedings, where an unstable attribution estimate can have consequences beyond the scientific record. Our manuscript, Validity in machine learning for extreme event attribution, shows that machine-learning-based attribution of California wildfires from 2003 to 2020 is vulnerable to three validity threats: sensitivity to algorithmic design choices, misleading model selection by common predictive metrics and degradation under climate-scenario distribution shift. We also propose a more robust workflow based on aggregate attribution estimates, mean calibration error and diagnostics for subgroup performance and propensity overlap. This matters to the broad Nature Climate Change readership spanning climate attribution, climate impacts, AI for climate science, policy analysis and climate law, because it addresses when fast, flexible machine-learning tools can be trusted for high-stakes attribution claims. The manuscript complements your recent Bevacqua et al. (2025) paper on interpreting a year above 1.5 °C in relation to the Paris Agreement by focusing on the statistical validity of claims that connect climate change to consequential events. It also complements Andre et al. (2024) on public support for climate action, since credible attribution evidence is one of the empirical inputs linking climate science to public and institutional decision-making.

The specific advance

We make four interconnected findings:

Individual-event estimates are unstable. In California wildfire data from 2003 to 2020, attribution estimates for single events varied substantially across machine-learning design choices. This instability is consequential because EEA compares observed climate conditions with counterfactual pre-industrial conditions, and individual-event estimates are often the quantities most visible in policy and legal contexts.
Prediction metrics do not guarantee attribution validity. In simulation analyses, area under the ROC curve and Brier score were not strongly correlated with attribution error. As a result, model selection based on familiar predictive-performance criteria can select models that are poorly suited for causal or counterfactual attribution.
Climate distribution shift degrades performance. Temperature changes across climate scenarios produced distribution shift that substantially reduced predictive performance. This identifies a climate-specific machine-learning validity problem: models can appear adequate under historical evaluation while failing under the counterfactual or projected conditions central to EEA.
A practical robustness workflow improves interpretability. We propose aggregate machine-learning attribution estimates, mean calibration error as an additional model-performance metric and subgroup plus propensity diagnostics to assess distribution shift. Together, these steps give attribution researchers a transparent framework for evaluating when machine-learning estimates are credible.

Submission declarations

The manuscript is not under review elsewhere.
All co-authors have approved this manuscript and agree with its submission.
Competing interests: The authors declare no competing interests.
Funding: [funders and grant numbers].
Ethics and IRB approval: Not applicable; the study used environmental and simulation data and did not involve human participants or animals.
Informed consent: Not applicable.
Data availability: The analyses use California wildfire and climate data from 2003 to 2020 as described in the manuscript; derived analysis data will be made available upon publication.
Code availability: Analysis code implementing the simulation analyses, aggregate attribution estimates, mean calibration error and distribution-shift diagnostics will be made available in a public repository upon publication.
Preprint disclosure: A preprint of this work is available at arXiv:2511.19039v1.

Thank you for considering our manuscript. We look forward to your editorial decision.

Sincerely,

Cassandra C. Chou Department of Biostatistics, Johns Hopkins University

Title & abstract

Three ranked title options plus a paste-ready rewritten abstract.

Title Critique

Current title: Validity in machine learning for extreme event attribution

Specificity: 3/5 | Key-finding visibility: 2/5 | Length: too short

Issues:

The title identifies the broad topic, machine learning for extreme event attribution, but does not name the empirical setting, California wildfires from 2003-2020.
The title does not reveal the headline finding: event-level attribution is sensitive to algorithmic choices, standard predictive metrics can fail to track attribution error, and distribution shift degrades performance.
At 7 words, the title is concise but underspecified for Nature Climate Change, where titles often communicate a clear result or conceptual advance.
The phrase "Validity in" is abstract and less journal-ready than a title that states what validity threat was found or what methodological caution follows.
Jargon is mostly appropriate for the field, but "validity" is broad and should be anchored to specific threats or robustness in attribution.

Three Alternative Titles, Ranked by Predicted Impact

1. (highest impact)

Rationale: This title foregrounds the empirical setting and the two most actionable threats to validity identified in the study.

2. (high impact)

Rationale: This title emphasizes the counterintuitive methodological finding that good prediction metrics may not imply low attribution error.

3. (moderate impact)

Rationale: This is a conservative, accurate title that adds the case study while retaining the manuscript's validity framing.

Abstract Critique

Structure: The abstract follows an implicit Background, Methods, Results, Conclusion structure, but it would be clearer if the motivation, study design, three validity threats, and proposed diagnostics were separated more explicitly.

Key-finding clarity: The main findings are clearly enumerated, but they are not quantified. The abstract states sensitivity, weak metric correlation, and degraded performance under distribution shift without reporting effect sizes, error magnitudes, uncertainty intervals, or p-values.

Missing elements:

Quantitative effect sizes for sensitivity to algorithmic design choices, if available
Correlation values between area under the ROC curve, Brier score, and attribution error, if available
Magnitude of predictive degradation under temperature-driven distribution shift, if available
Sample size or number of wildfire observations analyzed, if available
Primary attribution outcome definition, if available
Uncertainty intervals or p-values for simulation analyses, if available
Clear statement of how the findings generalize beyond California wildfires

Word count: current ≈ 146, target ≈ 250

Named entities preserved in revision: Extreme event attribution (EEA), Machine learning, Simulation analyses, California wildfire data from 2003-2020, Individual event attribution estimates, Algorithmic design choices, Area under the ROC curve, Brier score, Attribution error, Distribution shift, Temperature across climate scenarios, Aggregate machine learning estimates, Mean calibration error, Subgroup diagnostics, Propensity diagnostics, Climate policy, Legal proceedings

Jargon-handling decisions:

Extreme event attribution → defined_inline: The term is standard in climate science but was retained with a concise definition because readers across Nature Climate Change may include policy and legal audiences.
Distribution shift → defined_inline: The term is common in machine learning but was defined as changes in temperature distributions across climate scenarios to clarify its meaning in this application.
Mean calibration error → kept_as_is: The term is a specific model-performance metric proposed by the manuscript and is interpretable in context as an added calibration diagnostic.
Propensity diagnostics → kept_as_is: The term is a specific diagnostic approach proposed by the manuscript and is paired with subgroup diagnostics to indicate its methodological role.

Revised Abstract (paste-ready)

Extreme event attribution (EEA), which estimates how climate change alters the probability of damaging weather events, is increasingly used to inform climate policy, adaptation planning, and legal proceedings. Machine learning could expand EEA by modelling rare events that are difficult or computationally intensive to represent with conventional simulation approaches, but its validity for attribution remains uncertain. Here we use machine learning and simulation analyses to evaluate EEA with California wildfire data from 2003-2020. We identify three threats to valid attribution. First, individual event attribution estimates are highly sensitive to algorithmic design choices, indicating that event-level conclusions can depend strongly on modelling decisions. Second, standard predictive performance metrics, including area under the ROC curve and Brier score, are not strongly correlated with attribution error, so models selected for prediction can produce suboptimal attribution estimates. Third, distribution shift, defined here as changes in temperature distributions across climate scenarios, substantially degrades predictive performance. To reduce these risks, we propose an attribution workflow based on aggregate machine learning estimates rather than isolated event-level estimates, mean calibration error as an additional performance metric, and subgroup and propensity diagnostics to assess distribution shift. These results show that machine learning can support EEA only when predictive accuracy is evaluated alongside attribution-specific validity checks, especially when models are transferred across climate scenarios.

Keywords

Issues:

No keywords were provided, which reduces discoverability across climate science, machine learning, wildfire, and attribution indexing systems.
The keyword set should include terms not already prominent in the title where possible, while still capturing the core methods, application, and policy relevance.
For Nature Climate Change, concise topical keywords are preferable to long phrases or manuscript-specific wording.

Suggested keywords (paste-ready):

Plain-language summary

Lay summary and significance statement, submission-ready.

Format notes: Default format used because no Nature Climate Change-specific PLS or significance requirements were provided. The significance statement is kept within the common PNAS-style 120-word cap.

Plain-Language Summary (149 words)

We asked whether machine learning can safely help judge how much climate change contributes to extreme events. This question matters for climate policy and court cases after disasters. We studied California wildfire data from 2003 to 2020 using machine learning and computer simulations. We found three main problems. First, estimates for a single fire changed when researchers adjusted the model’s “recipe,” such as its design choices. Second, common model report cards, including area under the ROC curve and Brier score, did not track attribution error well. This means a model could look strong but still give poor answers for blame-sharing. Third, models struggled when temperatures changed across climate scenarios, like a map failing when the road has shifted. We suggest using grouped, not single-event, estimates; adding mean calibration error to check whether predicted risks match reality; and using subgroup and propensity checks to spot climate mismatch. For decision-makers, this offers a safer way to use machine learning evidence.

Audience check: The summary uses short sentences, defined terms, and concrete metaphors suitable for an educated high-school reader.

Significance Statement (116 words)

Extreme event attribution (EEA) informs climate policy and legal proceedings, but machine learning applications to rare weather events face unresolved concerns about validity, bias, and robustness. We combine machine learning and simulation analyses of California wildfire data from 2003-2020 to evaluate when attribution estimates are reliable relative to traditional simulation-based approaches. We identify a shared validity problem: model performance does not necessarily imply attribution accuracy. Individual event attribution estimates are sensitive to algorithmic design choices; area under the ROC curve and Brier score correlate weakly with attribution error; and temperature-driven distribution shift across climate scenarios degrades prediction. Aggregate machine learning estimates, mean calibration error, and subgroup and propensity diagnostics support more reliable EEA for policy and legal use.

Warnings:

Significance statement is 116 words, close to the 120-word PNAS-style cap but within it.
No manuscript-body findings beyond the abstract were available in the supplied excerpt, so outputs are based on the abstract and visible introduction.

Response to reviewers

A pre-emptive reply to the objections reviewers will raise.

For each major or critical comment surfaced by the pre-submission reviewer report (5.17), here is a draft response paragraph the author can adapt for the rebuttal letter when the journal returns a major-revision decision. Each response is grounded in the same verbatim manuscript evidence the reviewer cited.

Response to Comment 0: Simulation ground truth is bootstrapped from the same models being evaluated

Severity: critical | Stance: agrees and commits

Reviewer Comment:

Author Response:

We agree that using model-derived probabilities as the simulation target can overstate recoverability and risks a circular design. In the revised manuscript, we will add independent data-generating processes, including a parametric logistic DGP with fixed coefficients and a second emulator-based DGP that is architecturally distinct from the evaluated learners. We will then re-estimate the comparative performance of calibration, discrimination, and Brier-based metrics under these alternative truths. This will allow us to test whether mean calibration error remains the most informative indicator of FAR error when the target is not mechanically aligned with the candidate models.

Manuscript Change:

Methods and Supplementary Simulation section; new robustness simulations in Results

Response to Comment 1: Asymmetric FAR estimator mixes empirical frequencies with model predictions and departs from standard EEA practice

Severity: critical | Stance: partially agrees revises scope

Reviewer Comment:

Author Response:

We appreciate this point and agree that the estimator should be made fully transparent. Our intent was to anchor the factual arm to the observed event frequency as a descriptive baseline, but we agree that this choice should not be conflated with a symmetric probabilistic FAR formulation. In the revision, we will report both the empirical-baseline estimator and the fully model-based symmetric FAR estimator, with the empirical proportion clearly labeled as a calibration reference for the factual arm. We will also quantify how the choice of estimator affects the multiplicity results and whether any conclusions are sensitive to this design decision.

Manuscript Change:

Methods section defining FAR; new sensitivity analysis in Results and Supplement

Response to Comment 2: Universal validity claims are supported by a single hazard, single region, single 18-year window

Severity: critical | Stance: agrees and commits

Reviewer Comment:

Author Response:

We agree that the current empirical study is specific to California wildfire attribution and should not be read as a universal proof across all hazards. Our broader claim is that three validity threats can arise in ML-based attribution workflows, not that their magnitude is identical across hazards. In the revision, we will narrow the framing to wildfire attribution and explicitly distinguish general methodological implications from hazard-specific empirical evidence. We will also add language noting that extension to heat waves or other hazards remains an important direction for future work.

Manuscript Change:

Title, Abstract, Introduction, and Conclusions

Response to Comment 3: Temperature subgroup analysis conflates training-data sparsity with covariate shift

Severity: major | Stance: agrees and commits

Reviewer Comment:

Author Response:

This is a valuable clarification, and we agree that the current analysis does not separate sparsity from distribution shift cleanly. We will report the number of observations in each temperature bin and add an analysis that controls for bin sample size when relating temperature to calibration error. In addition, we will include a subsampling experiment that equalizes bin sizes to assess whether the temperature effect persists after removing sparsity as a confounder. If the residual association remains, we will interpret it as evidence more consistent with covariate shift; otherwise we will revise the interpretation accordingly.

Manuscript Change:

Results subsection on temperature bins; new Supplementary analysis

Response to Comment 4: No code repository, environment specification, or accessible Supplement

Severity: major | Stance: agrees and commits

Reviewer Comment:

Author Response:

We agree that reproducibility needs to be improved substantially. We will release a version-controlled repository with all preprocessing, training, simulation, and bootstrap code, along with an archived snapshot and complete environment specification. We will also ensure that the full Supplementary Information, including propensity checks and weighted-metric analyses, is publicly accessible alongside the revised preprint and submission. These materials will allow independent readers to audit the pipeline end to end.

Manuscript Change:

Data and Code Availability statement; Supplementary Information availability; repository link in revised manuscript

Response to Comment 5: Hyperparameter tuning on the full present-day dataset risks leakage into counterfactual predictions

Severity: major | Stance: agrees and commits

Reviewer Comment:

Author Response:

We appreciate this concern and agree that the training and tuning protocol must be specified unambiguously. In the revision, we will provide explicit pseudocode detailing how folds are assigned for factual prediction, counterfactual prediction, and simulation, including the exact separation between tuning and evaluation. If any part of the current counterfactual analysis relied on full-dataset tuning, we will rerun it using nested cross-validation with hyperparameters selected only on training folds. We will then report whether the multiplicity statistics or model rankings change under the leakage-free protocol.

Manuscript Change:

Methods section and Supplementary pseudocode; revised sensitivity analysis if needed

Response to Comment 6: Mean calibration error is presented as proposed but is a long-standing forecasting metric

Severity: major | Stance: agrees and commits

Reviewer Comment:

Author Response:

We agree and thank the reviewer for pointing out that this metric has a long history in probabilistic forecasting and verification. In the revision, we will replace any wording that suggests novelty of the metric itself with language emphasizing our empirical recommendation of this metric for EEA applications. We will cite the calibration and proper-scoring-rule literature, including Murphy's decomposition and subsequent work on reliability and bias. We will also sharpen our contribution to the finding that this simple calibration summary aligns more closely with FAR error than discrimination metrics in the settings we study.

Manuscript Change:

Introduction and Discussion; references added to forecasting and calibration literature

Response to Comment 7: No benchmark against physics-based or extreme-value EEA estimates for the same wildfire events

Severity: major | Stance: partially agrees revises scope

Reviewer Comment:

Author Response:

We agree that an external benchmark would strengthen the context of the study and help readers interpret the scope of our findings. In the revision, we will add a comparison to at least one published non-ML attribution estimate or fire-weather-based analysis relevant to California wildfire risk, where such a benchmark is available and methodologically comparable. We will use this comparison to clarify whether the identified validity threats are unique to ML or reflect broader challenges in attribution workflows. If a fully matched benchmark is not available for the exact event definition, we will state that limitation explicitly and frame the comparison as contextual rather than definitive.

Manuscript Change:

Results and Discussion; new benchmark or contextual comparison subsection

Notes for the author

We have kept the responses concise, constructive, and journal-appropriate while committing to concrete revisions where feasible.

Data availability statement

A compliant statement you can drop straight into the manuscript.

Concerns to address before pasting:

(critical) No repository DOI or accession is provided for the processed datasets, simulation outputs, model predictions, and code supporting the study. Nature-family journals generally require stable public identifiers for deposited research data.
- Action: Deposit the full reproducibility package in Zenodo, Dryad, or Figshare before submission or revision, then replace the author-fill slot with the final DOI.
(major) The manuscript describes extensive machine-learning model training, cross-validation, simulation analyses, and metric calculations, but no code repository is identified. Reproducibility will be difficult without deposited scripts and software environment information.
- Action: Include all analysis scripts, model-training code, simulation code, figure-generation code, and an environment file or requirements file in the same DOI-linked repository.
(major) The Methods cite public source datasets, but the exact Brown et al. dataset location and any derived data products generated for this study are not specified in the excerpt. The data statement should not rely only on narrative source descriptions.
- Action: Add persistent links or citations for the Brown et al. dataset and include all derived analysis-ready tables and generated outputs in the deposited reproducibility package.

Reviewer findings

High severity.
#1Major Comments§5.17
Simulation ground truth is bootstrapped from the same models being evaluated
We generate datasets by treating our predicted probabilities of extreme daily growth in the observed and counterfactual scenarios as the true probabilities of extreme daily growth.
I will grant that constructing a synthetic ground truth is necessary when true counterfactual probabilities are unobservable. But the authors have not avoided the most basic pitfall of such a design: they treat predicted probabilities from their own ML models as 'true' probabilities, then evaluate whether ML models can recover those truths. This is circular—the 'truth' inherits the biases, calibration patterns, and functional form of the generating models. The headline finding that aggregate FAR is recovered with median log RR error of 0.031 may be a near-tautology, and the correlation of r=0.87 between mean calibration error and FAR error may simply reflect that calibration of a model against itself is mechanically tight. This single design choice undermines the paper's central empirical claim about metric superiority.

Suggested action: Re-run the simulation experiments with at least one independent data-generating process: (a) a parametric logistic model with hand-specified coefficients, and (b) probabilities derived from a physics-based wildfire/climate emulator architecturally distinct from the five ML algorithms under test. Demonstrate that mean calibration error retains...
High severity.
#2Major Comments§5.17
Asymmetric FAR estimator mixes empirical frequencies with model predictions and departs from standard EEA practice
The fire-day average probability of extreme daily growth in the real world is calculated as the fraction of extreme daily growth days in our observed dataset: 380/17910.
The conventional probabilistic FAR (Stott et al., Allen) uses model-estimated probabilities for both the factual and counterfactual arms. Here the authors plug the raw empirical proportion 380/17910 into E[Y(0)] while using ML-predicted means for E[Y(1)]. This asymmetry means model error is propagated into only one arm of the ratio, structurally biasing FAR estimates and confounding the multiplicity analysis (since the factual baseline is artificially fixed across models). It also creates an internal inconsistency with Section 4.2.3, where 'truth' for the factual world is defined from model-predicted probabilities, not the empirical fraction.

Suggested action: Use model-predicted probabilities consistently in both arms of FAR, with the empirical proportion reported only as a calibration check on E[Y(0)]. Re-derive all RR/FAR estimates under the symmetric estimator and report how much of the apparent multiplicity in Section 2.1 is attributable to the asymmetric design vs. genuine model disagreement on counterfactual predictions.
High severity.
#3Major Comments§5.17
Universal validity claims are supported by a single hazard, single region, single 18-year window
we only consider wildfires – we do not assess other types of extreme events. As such, we identify issues and solutions related broadly to concepts and metrics for machine learning methods, rather than seeking the best specific models, which will vary depending on context.
Rigor, Novelty, and Domain reviewers converged here, but the deepest framing is domain-specific: wildfires have spatial autocorrelation, fuel-load confounding, and fire-weather feedbacks that differ fundamentally from heat waves, tropical cyclones, or flooding—the very hazards the paper invokes in its related-work section to claim breadth. For Nature Climate Change, a paper titled 'Validity in machine learning for extreme event attribution' must demonstrate that the three identified threats (multiplicity, metric misalignment, distribution shift) are not artifacts of California wildfire data structure (e.g., the 2.1% base rate, the 380 positive days, the 2003–2020 fire regime).

Suggested action: Add at least one additional hazard case study—heat waves are the most tractable given existing EEA literature—and show whether the three validity threats manifest with comparable magnitude. Alternatively, retitle and reframe as 'Validity in machine learning for wildfire attribution' and explicitly scope all claims to fire-weather settings. The current framing oversells a case study as a general framework.
Medium severity.
#4Major Comments§5.17
Temperature subgroup analysis conflates training-data sparsity with covariate shift
Predictive performance decreases with temperature in historically observed data, with average mean calibration error increasing from 0.000821 at the lowest temperatures to 0.0288 at the highest temperatures, corresponding to a relative increase of 3408%
This is a Rigor specialty issue the other personas did not catch. The 3,408% increase in mean calibration error from coolest to hottest temperature bin is presented as evidence that distribution shift degrades performance under warmer counterfactual climates. But the same pattern is fully predicted by training-data sparsity at high temperatures, which is unrelated to covariate shift in the EEA sense. The two mechanisms have opposing implications: sparsity is curable with more data; genuine non-stationarity is not. The causal claim driving Recommendation #3 of the paper rests on this confound.

Suggested action: Report n per temperature bin alongside calibration error. Fit a partial regression of calibration error on temperature, controlling for bin sample size, and a complementary simulation in which sample size is held constant across bins (e.g., by subsampling). Only the residual temperature effect after controlling for sparsity supports a covariate-shift interpretation.
Medium severity.
#5Major Comments§5.17
No code repository, environment specification, or accessible Supplement
We train and tuned models using cross validation on the full present-day dataset to predict the probability of extreme daily growth in the pre-industrial and SSP5-8.5 scenarios. For each time period, we train our models using five different machine learning algorithms: LightGBM36, Random Forest37, XGBoost38, logistic regression, and Elastic Net Regression39.
Reproducibility and Rigor both flagged opacity in the pipeline, but the deepest reproducibility issue is the complete absence of executable artifacts: no GitHub link, no Zenodo DOI, no requirements.txt, no container, and the Supplement (which contains the propensity robustness checks, PPI details, and weighted-metric analyses repeatedly referenced) is not publicly available with the arXiv preprint. For a paper whose central claim is that ML-based EEA is vulnerable to cherry-picking, the inability of an independent party to re-execute the analysis is fatal to the recommendations' credibility in a litigation context.

Suggested action: Release a version-controlled repository (GitHub + archived Zenodo snapshot) containing: data preprocessing, all five model training scripts with fixed hyperparameter grids, the 300-replicate simulation, propensity model code, bootstrap routines with explicit random seeds, and an environment specification pinning package versions. Make the full Supplement publicly accessible at submission. Without this, the paper's prescriptions for 'responsible' EEA cannot themselves be audited.
Medium severity.
#6Major Comments§5.17
Hyperparameter tuning on the full present-day dataset risks leakage into counterfactual predictions
We train and tuned models using cross validation on the full present-day dataset to predict the probability of extreme daily growth in the pre-industrial and SSP5-8.5 scenarios.
Rigor specialty: the manuscript describes two seemingly contradictory training regimes—temporal 3-year-fold CV for out-of-sample factual prediction, and tuning 'on the full present-day dataset' for counterfactual prediction. If the latter genuinely uses all six time periods to select hyperparameters and then predicts on counterfactual data, model selection is contaminated by data used for evaluation elsewhere, and the multiplicity analysis (which compares across algorithms tuned this way) is biased toward agreement. Given the paper's thesis turns on what 'similar predictive performance' means, the tuning protocol is load-bearing.

Suggested action: Provide explicit pseudocode showing, at each stage (factual out-of-sample prediction, counterfactual prediction, simulation), exactly which folds are used for tuning, training, and evaluation. If full-dataset tuning was indeed used for counterfactual prediction, redo this analysis using nested CV with hyperparameters selected only on training folds, and report how multiplicity statistics change.
Medium severity.
#7Major Comments§5.17
Mean calibration error is presented as proposed but is a long-standing forecasting metric
we propose a fourth metric: mean calibration error. This is represented by the absolute value of the difference between the mean predicted probability and the true proportion of outcomes
Novelty specialty. The 'fourth metric' the authors 'propose'—|mean predicted probability − mean observed outcome|—is reliability/bias, the calibration-in-the-large component of Murphy's decomposition of the Brier score and a staple of probabilistic forecast verification (Gneiting & Raftery, DeGroot & Fienberg, Bröcker). Without crediting this lineage the contribution looks overstated, and reviewers familiar with calibration theory will note that the Brier skill score the authors dismiss already contains this term as a sub-component. The genuine contribution is the empirical demonstration that calibration-in-the-large dominates discrimination for FAR estimation under low-to-moderate distribution shift; that framing is both true and defensible.

Suggested action: Reframe from 'we propose' to 'we recommend, and empirically validate for EEA, mean calibration error (a.k.a. calibration-in-the-large / reliability bias)'. Cite Murphy's Brier decomposition, Gneiting & Raftery's proper scoring rules review, and the meteorological reliability literature. Then sharpen the genuinely novel claim: that calibration-in-the-large correlates with FAR error far more tightly than discrimination metrics under the conditions you specify.
Medium severity.
#8Major Comments§5.17
No benchmark against physics-based or extreme-value EEA estimates for the same wildfire events
Attribution analyses typically depend on global climate models, using physics-based simulations of the climate system under different forcing or warming scenarios to assess human influence on extreme weather events.
Domain specialty. The paper's implicit comparator is 'traditional simulation methods,' yet no ML FAR estimate is benchmarked against a published physics-based attribution or extreme-value-theory estimate for California wildfire risk. Readers cannot determine whether ML's validity problems are unique, worse than, or better than the alternatives the authors implicitly defend. For Nature Climate Change, where readers must decide whether to trust ML-based attribution in court, this comparison is the difference between a useful critique and a sectarian one.

Suggested action: Add a benchmark section comparing the ML-derived aggregate RR for California wildfire risk under the pre-industrial counterfactual against at least one published physics-based or EVT-based estimate (e.g., from CMIP6 large-ensemble fire-weather studies). Discuss whether multiplicity, metric misalignment, and distribution shift are unique to ML or are simply more legible in ML than in physics-based ensembles where the same uncertainties exist implicitly.
Medium severity.
#9Major Comments§5.17
Bootstrap and headline statistics ignore temporal/within-fire clustering and lack confidence bounds
We use a bootstrap method to estimate a 95% confidence interval for our estimates of ATE and FAR. We sample 17,910 matching days with replacement from our real and counterfactual world dataset and estimate ATE and FAR using the four methods, repeating this process 1,000 times.
Rigor specialty consolidating two related problems: (a) the 1,000-replicate bootstrap resamples 17,910 days IID despite acknowledged within-fire dependence, producing anti-conservative CIs; (b) the headline 41.84% conflicting-sign statistic is reported as a point estimate without any uncertainty quantification, even though it is the central empirical hook of Section 2.1. Both flow from treating dependent daily observations as exchangeable.

Suggested action: Replace IID bootstrap with a block or cluster bootstrap keyed to individual wildfires (or to the 3-year temporal folds already in use). Use this to attach a CI to the 41.84% multiplicity statistic and all aggregate RR estimates. Report sensitivity of CI width to block length. Also stratify the conflicting-sign analysis by predicted-probability bin to separate genuine multiplicity from near-threshold flipping in a 2.1%-base-rate setting.
Medium severity.
#10Major Comments§5.17
Shallow engagement with domain adaptation theory weakens the distribution-shift contribution
We calculated proxy-A distance as a measure of distribution shift, which is represented as: bdA = 2 −4e, e = 1 −acc.
Novelty specialty. Covariate shift, importance weighting, and proxy-A distance come from a deep ML theory literature (Shimodaira; Sugiyama; Ben-David et al., who introduced the very proxy-A distance the paper imports). Existing H-divergence bounds make explicit predictions about when out-of-distribution generalization should fail. The paper cites Ben-David for the formula but does not engage with the theory that could (a) explain why mean calibration error decouples from FAR error under SSP5-8.5, and (b) predict the threshold of proxy-A distance beyond which ML EEA becomes unusable. This is where the paper could move from observation to mechanism.

Suggested action: Add a subsection connecting the empirical distribution-shift results to H-divergence and importance-weighting theory. Specifically, test whether the theoretical bound on target-domain error from Ben-David et al. predicts the observed degradation in mean calibration error between pre-industrial (proxy-A = 0.405) and SSP5-8.5 (proxy-A = 1.38). A confirmed or refuted prediction here is a genuine conceptual contribution.
Medium severity.
#11Major Comments§5.17
Conflation of storyline and probabilistic FAR-based attribution paradigms
This approach is a form of storyline attribution, where computational models are used to assess how historical events may have played out differently under different climate conditions.
Domain specialty. The paper labels its approach 'a form of storyline attribution' while estimating FAR/RR—the canonical estimands of probabilistic event attribution (Stott et al. 2004; Allen 2003), which Shepherd's storyline framework was developed precisely as an alternative to. Climate scientists reading Nature Climate Change will read this conflation as a misunderstanding of the field's central methodological debate, which materially weakens the paper's authority on validity questions in EEA.

Suggested action: Pick one paradigm and stay in it. Given FAR/RR are the estimands, frame the work as probabilistic attribution and cite Allen 2003 and Stott et al. 2004. If a storyline framing is desired, drop FAR/RR and use conditional storyline metrics (e.g., changes in event intensity given large-scale circulation), citing Shepherd et al. 2018 and the subsequent debate. Either path is defensible; the current straddle is not.
Low severity.
#12Minor Comments§5.17
SSP5-8.5 end-of-century used as policy-relevant comparator without noting its contested likelihood
we therefore analyze two counterfactual climate scenarios: a standard pre-industrial scenario and a worst-case SSP5-8.5 end-of-century scenario.
Domain specialty. SSP5-8.5 to 2100 is now widely characterized as a low-likelihood, high-warming pathway rather than a baseline policy scenario (Hausfather & Peters 2020; Pielke & Ritchie 2021). Using it as a distribution-shift stress test is methodologically fine, but the paper frames it as a co-equal policy-relevant comparison alongside pre-industrial, which is precisely the kind of framing Nature Climate Change editors have explicitly cautioned against. The distinction matters because legal and policy audiences—the very stakeholders the paper invokes—may overweight worst-case ML degradation if the scenario's probability is not contextualized.

Suggested action: Explicitly frame SSP5-8.5-end-of-century as a methodological stress test for distribution shift, not a likely policy scenario, and cite recent commentary on its declining plausibility. Add an intermediate scenario such as SSP2-4.5 or SSP3-7.0 to characterize how the three validity threats scale with moderate warming—this is also a more useful operating regime for litigation and adaptation planning.

Full analysis, every section

§ 14 sections

Submission Decision

about 14 min

§ 31 sections

Reviewer Strategy

about 6 min

§ 49 sections

Pre-Submission Audit

about 15 min

§ 56 sections

Verification Evidence (skim)

about 7 min

§ 1

Submission Decision

4 sections · 14 min

§ 3

Reviewer Strategy

1 sections · 6 min

§5.18

Predicted Reviewer Profiles

Editors typically pick 3 reviewers from a pool that includes the author's suggested + excluded lists. These profiles predict which researcher archetypes are most likely to be picke

6 min

Editors typically pick 3 reviewers from a pool that includes the author's suggested + excluded lists. These profiles predict which researcher archetypes are most likely to be picked, what concerns each will raise, and which groups to consider asking the editor to exclude. Anchored to the specific issues the pre-submission reviewer report (5.17) surfaced for this manuscript.

Suggested Reviewers (6)

Lead with named candidates the model has high confidence in; lower-confidence entries surface as search signals so you can verify before pasting into the portal.

#	Name	Affiliation	Identifying paper	Confidence	Action
1	S. Salcedo-Sanz	unknown — see identifying paper	Analysis, characterization, prediction, and attribution of extreme atmospheric events with machine learning and deep lea… (2023, Theoretical and Applied Climatology; DOI: 10.1007/s00704-023-04571-5)	`high`	Suggest
2	G. Camps-Valls	unknown — see identifying paper	Artificial intelligence for modeling and understanding extreme weather and climate events (2025, Nature Communications; DOI: 10.1038/s41467-025-56573-8)	`high`	Suggest
3	Nafsika Antoniadou	unknown — see identifying paper	Comparison of data-driven methods for linking extreme precipitation events to local and large-scale meteorological varia… (2023, Stochastic Environmental Research and Risk Assessment; DOI: 10.1007/s00477-023-02511-3)	`high`	Suggest
4	Olivier C. Pasche	unknown — see identifying paper	Validating Deep-Learning Weather Forecast Models on Recent High-Impact Extreme Events (2024, ArXiv / DOI record: 10.1175/aies-d-24-0033.1)	`high`	Suggest
5	Ethan Pickering	unknown — see identifying paper	[no verified citation — author should search Google Scholar / OpenAlex to confirm]	`search_signal_only`	Suggest
6	Zi‐ying Xuan	unknown — see identifying paper	Improving the Assimilation Ability for the Extreme Events by Proposing a Nonlinear Machine Learning Data Assimilation Ap… (2025, Geophysical Research Letters; DOI: 10.1029/2025gl118319)	`high`	Suggest

1. S. Salcedo-Sanz

Archetype: extreme-atmospheric-events ML + attribution reviewer

Why this person helps: This reviewer is well positioned to assess whether the manuscript's validity framework is genuinely informative for machine-learning-based attribution rather than a relabeling of existing climate-ML practice.

Concerns they will raise:

Issue 1: whether the asymmetric FAR estimator is defensible relative to standard extreme-event attribution practice.
Issue 2: whether claims about validity/generalizability are overstated given the single hazard, region, and 18-year window.
Issue 9: whether the distribution-shift discussion is adequately grounded in existing domain-adaptation theory.

2. G. Camps-Valls

Archetype: climate-AI methods and robustness reviewer

Why this person helps: This reviewer would be constructive on the manuscript's cross-disciplinary claim about when ML-based attribution is scientifically valid under climate-data distribution shift.

Concerns they will raise:

Issue 9: whether the paper's treatment of domain adaptation and distribution shift is deep enough for the strength of its claims.
Issue 6: whether mean calibration error is being presented as novel when it is already established in forecasting/ML evaluation.
Issue 2: whether the empirical demonstration is broad enough to justify the paper's framing.

3. Nafsika Antoniadou

Archetype: environmental-statistics benchmark reviewer

Why this person helps: This reviewer can evaluate whether the manuscript distinguishes ML validity from ordinary data sparsity, dependence, and benchmark-selection problems in environmental extremes.

Concerns they will raise:

Issue 3: whether the temperature subgroup analysis is separating covariate shift from simple sample sparsity.
Issue 7: whether ML attribution should be benchmarked against physics-based or extreme-value attribution estimates on the same events.
Issue 8: whether temporal and within-fire clustering invalidate the current bootstrap and uncertainty summaries.

4. Olivier C. Pasche

Archetype: validation-of-deep-learning-on-extremes reviewer

Why this person helps: This reviewer is a good fit for stress-testing the paper's central validation logic, especially around how one validates ML systems on rare, high-impact events.

Concerns they will raise:

Issue 0: whether the manuscript's simulation ground truth is circular because it is derived from the same models being evaluated.
Issue 5: whether hyperparameter tuning on the full present-day dataset leaks information into counterfactual predictions.
Issue 4: whether the absence of code, environment specification, and accessible Supplement undermines reproducibility.

5. Ethan Pickering

Archetype: ML robustness for extreme events reviewer

Why this person helps: This reviewer can assess whether the manuscript's claims about validity under rare-event and distribution-shift conditions are methodologically sound from a modern ML perspective.

Concerns they will raise:

Issue 0: whether the validation target is independent enough to support causal or attributional conclusions.
Issue 5: whether model selection and tuning were separated cleanly from the evaluation setting.
Issue 2: whether a single hazard/region/time window can support broad claims about ML validity in extreme-event attribution.

6. Zi‐ying Xuan

Archetype: extreme-regime ML under shift reviewer

Why this person helps: This reviewer would be useful for evaluating whether the manuscript has correctly framed extreme-event validity as a shift/assimilation problem rather than only a predictive-accuracy problem.

Concerns they will raise:

Issue 3: whether the subgroup analysis confounds regime shift with data sparsity.
Issue 2: whether the stated conclusions extend beyond the specific case study configuration.
Issue 9: whether the manuscript engages enough with the theoretical literature on adaptation under distributional change.

Reviewers to Consider Excluding (2)

#	Name	Affiliation	Confidence	COI category	Reason
1	Jared T Trok	unknown — see identifying paper	`high`	`scooped_or_scoopable`	Jared T Trok is first author of the closely overlapping paper 'Machine learning–based extreme event attribution' (2024, Science Advances; DOI: 10.1126/sciadv.adl3242), which addresses the same central methodological space as the present manuscript.
2	S. Salcedo-Sanz	unknown — see identifying paper	`high`	`scooped_or_scoopable`	S. Salcedo-Sanz is lead author on a recent paper focused on machine-learning-based analysis and attribution of extreme atmospheric events (2023, Theoretical and Applied Climatology; DOI: 10.1007/s00704-023-04571-5), making the present submission a direct extension/contrast within the same methodological niche.

1. Jared T Trok

Archetype: directly overlapping ML-based EEA competitor

Why exclude: Jared T Trok is first author of the closely overlapping paper 'Machine learning–based extreme event attribution' (2024, Science Advances; DOI: 10.1126/sciadv.adl3242), which addresses the same central methodological space as the present manuscript.

Paste into cover letter / 'request to exclude' field:

2. S. Salcedo-Sanz

Archetype: closely overlapping ML-for-attribution researcher

Why exclude: S. Salcedo-Sanz is lead author on a recent paper focused on machine-learning-based analysis and attribution of extreme atmospheric events (2023, Theoretical and Applied Climatology; DOI: 10.1007/s00704-023-04571-5), making the present submission a direct extension/contrast within the same methodological niche.

Paste into cover letter / 'request to exclude' field:

Editor-Facing Rationale (paste into 'Suggested reviewers' field)

Paste-ready

This manuscript's main contribution is to clarify when machine-learning-based extreme event attribution can be considered scientifically valid, with particular attention to validation design, counterfactual estimation, and distribution shift. We therefore suggest reviewers spanning climate-extremes AI, methodological validation of deep learning on rare/high-impact events, and environmental-statistical benchmarking. Collectively, the proposed reviewers are well suited to assess the manuscript's core issues around circular validation, leakage, benchmark comparison against established attribution approaches, uncertainty under clustered events, and the strength of its generalizability claims. We have aimed for reviewers who can evaluate both the climate-attribution framing and the ML-validity arguments in a constructive, cross-disciplinary way.

§ 4

Pre-Submission Audit

9 sections · 15 min · 1 appendix

§5.4

Reporting Guideline Compliance

Study type: observationalcohort | Guideline: STROBE | Items checked: 18 Critical items missing: 1 Items needing attention: - 6 Participants — partiallyaddressed (critical): Add exp

1 min

Study type: observational_cohort | Guideline: STROBE | Items checked: 18 Critical items missing: 1

Items needing attention:

[6] Participants — partially_addressed (critical): Add explicit eligibility criteria and selection steps for wildfire days, including inclusion/exclusion rules and how the 17,910 fire days were derived from source data.
[8] Data sources / measurement — partially_addressed (critical): Provide variable-by-variable measurement details, preprocessing, spatial/temporal resolution, units, and how comparability was maintained across observed and counterfactual climate datasets.
[12c] Statistical methods — missing (critical): Add a statement describing whether any variables had missing data and, if so, the extent of missingness and the imputation, exclusion, or model-based handling approach used.
[13] Participants flow — partially_addressed (critical): Add a flow diagram or table showing the number of wildfire days identified, excluded, eligible, included in model training/validation/testing, and analyzed for each climate scenario.
[19] Limitations — partially_addressed (critical): Expand the limitations section to state the likely direction and approximate magnitude or sensitivity of each major potential bias or imprecision source.
[22] Funding — missing (minor): Add a funding statement naming all funding sources and describing the funders’ role in study design, analysis, interpretation, writing, and publication decisions, or state that there was no external funding.

§5.5

Format Compliance

Article type: (unknown) | Limits source: (none — manual review needed) Mechanical checks: - ? wordcount: actual=2870, limit=—, status=unknown - ? figurecount: actual=5, limit=—, st

1 min

Article type: (unknown) | Limits source: (none — manual review needed)

Mechanical checks:

? word_count: actual=2870, limit=—, status=unknown
? figure_count: actual=5, limit=—, status=unknown
? reference_count: actual=0, limit=—, status=unknown

Missing structural elements:

coi_statement (missing): Add a conflict of interest statement declaring any competing interests or stating that the authors have none.
funding_declaration (missing): Add a funding declaration identifying all financial support and sponsor roles, or state that no funding was received.
author_contributions (missing): Add an author contributions statement describing each author’s roles according to ICMJE or CRediT taxonomy.
data_availability (partial): Add a dedicated data availability statement with repository links, accession identifiers, or precise instructions for accessing the datasets used.
code_availability (missing): Add a code availability statement indicating where the analysis code can be accessed or explaining why it is not available.

§5.6

Citation Audit (+ 5.10 Context Verification)

Appendix

Two audits running together: this section (5.6) finds MISSING-citation candidates the manuscript should add; 5.10 Citation Context Verification (below) checks the cited papers' con

3 min

Two audits running together: this section (5.6) finds MISSING-citation candidates the manuscript should add; 5.10 Citation Context Verification (below) checks the cited papers' content actually supports the manuscript's claim. Read both before submitting.

Missing-Citation Candidates (5)

Recall target: 5-8 missing-citation candidates per manuscript. Surfaced this run: 5. Higher counts trade precision for recall; lower counts may indicate a comprehensive existing reference list OR an underweighted backend-augmented search. Backend used: paperclip.

1. Stott et al (2004) — landmark

Where it should appear: Introduction, paragraph beginning "EEA approaches generally involve comparing the probability of extreme events..."
Why needed: This is the canonical early probabilistic extreme event attribution paper and a foundational example of estimating anthropogenic changes in event probability. Any manuscript using FAR-like ratios of counterfactual to factual event probabilities should anchor that framing in this work.
DOI: 10.1038/nature03089 ✓ Crossref-verified

2. Hannart et al (2016) — methodology_precedent

Where it should appear: Introduction, paragraph defining observed and counterfactual climate scenarios; also near the discussion of FAR as the scientific target
Why needed: The manuscript repeatedly frames EEA as a counterfactual comparison, but does not cite the key formal causal-counterfactual treatment of event attribution. This paper is directly relevant to the validity claims because it clarifies what attribution estimands mean and under what assumptions they are interpretable.
DOI: 10.1175/bams-d-14-00034.1 ✓ Crossref-verified

3. Shepherd, Theodore G. (2016) — landmark

Where it should appear: Introduction, paragraph beginning "This approach is a form of storyline attribution..."
Why needed: The manuscript invokes storyline attribution but does not cite the paper that systematized the distinction between probabilistic attribution and storyline/conditional approaches. A reviewer would expect this citation because the paper directly bears on whether the manuscript’s ML counterfactual exercise is estimating event probability changes, conditional event changes, or a hybrid estimand.
DOI: 10.1007/s40641-016-0033-y ✓ Crossref-verified

4. Abatzoglou et al (2016) — landmark

Where it should appear: Introduction, where California wildfire attribution is introduced; also in the data/application section describing the wildfire case study
Why needed: This is one of the central attribution papers linking anthropogenic warming to increased fuel aridity and wildfire activity in western U.S. forests. Because the manuscript analyzes California wildfire data and makes claims about climate-change effects on wildfire risk, omission of this paper would be conspicuous.
DOI: 10.1073/pnas.1607171113 ✓ Crossref-verified

5. Williams et al (2019) — recent_comparator

Where it should appear: Introduction and wildfire application section, especially where the California 2003–2020 case study is motivated
Why needed: This is a directly comparable California wildfire attribution study and is highly relevant to any analysis of California wildfire data from the 2000s onward. The manuscript should position its ML-based estimates against this existing physical/statistical attribution literature.
DOI: 10.1029/2019ef001210 ✓ Crossref-verified

Paste-ready additions (top 3 with DOIs):

Drop these into Zotero / Mendeley / EndNote via DOI import; they'll resolve to full reference entries.

Citation Context Verification (5.10)

Refs attempted: 0 | verified: 0 | partial: 0 | unsupported: 0

No verification attempts ran for this manuscript (typically because no context backend was supplied OR no parsed refs matched the lookup criteria).

§5.9

Novelty Assessment

ℹ️ All 3 extracted claims matched the manuscript's OWN preprint (bioRxiv / medRxiv / arXiv version of this work). The novelty audit treats this as expected extendsownpriorwork — NO

1 min

ℹ️ All 3 extracted claims matched the manuscript's OWN preprint (bioRxiv / medRxiv / arXiv version of this work). The novelty audit treats this as expected extends_own_prior_work — NOT a real novelty concern. Action: cite your own preprint in the Methods or Acknowledgments to make the relationship explicit.

§5.12

Statistical Rigor Audit

Statistical claims extracted by GPT-5.4-mini, then verified by validated open-source forensics: scipy p-value recomputation (statcheck-style), GRIM (impossible means), SPRITE-lite

1 min

Statistical claims extracted by GPT-5.4-mini, then verified by validated open-source forensics: scipy p-value recomputation (statcheck-style), GRIM (impossible means), SPRITE-lite (implausible SDs), DEBIT (impossible proportions), and statsmodels post-hoc power. Every flagged finding below is the output of a deterministic check — not LLM judgment.

Claims extracted: 6 NHST · 0 descriptive · 1 binary · 0 power

Findings: 0 critical · 0 major · 0 minor

All 7 extracted statistical claims passed every deterministic check (statcheck p-value recompute, GRIM impossible-mean detection, SPRITE-lite SD plausibility, DEBIT impossible-proportion detection, and post-hoc power). No statistical-rigor anomalies detected by automated forensics.

Methodology issues

Reviewer-flagged methodology concerns identified by GPT-5.5 review of the Methods + Results sections. Each item names the specific reviewer-ask pattern + a paste-ready fix.

🟡 [multiple_comparisons] (minor): The Results report several hypothesis tests across multiple metrics and climate scenarios, including AUC, Brier skill score, and mean calibration error correlations, without mentioning any multiplicity adjustment. A reviewer may ask whether these p-values are intended as formal inference and, if so, how family-wise error or false discovery was controlled.
- Paste-ready
  
  Traditional predictive performance metrics such as AUC and Brier skill score showed weak to moderate correlations with log risk ratio error in simulation analyses (r = −0.26 and −0.46, respectively; p < 0.001).
- Fix: Clarify whether p-values are descriptive or inferential. If inferential, apply and report a correction such as Holm, Bonferroni, or FDR across the tested metric-scenario combinations.

§5.13

Figure Critique

Figures critiqued: 4 | critical: 2 | major: 2 | minor: 2 Publication readiness: majorrevisions - page9 (p9): The figure is broadly publication-ready and the caption aligns with the

2 min

Figures critiqued: 4 | critical: 2 | major: 2 | minor: 2 Publication readiness: major_revisions

page_9 (p9): The figure is broadly publication-ready and the caption aligns with the content. The main improvement needed is increasing the size of small subplot text and legends for readability.
- [resolution_cropping] (minor): The figure is not cropped, but some small internal labels, especially AUC annotations and legends in the density plots, are difficult to read at page scale. → fix: Increase font sizes for small annotations and legends or enlarge the subpanels.
page_10 (p10): The figure is mostly readable and structurally complete, but it needs revision before submission. The most important problems are the caption-axis mismatch for the main metric and the lack of a clear legend for the model-specific colors.
- [legend_palette] (major): Multiple colors are used for model estimates, but there is no legend mapping colors to machine-learning methods in panel a. The palette also relies on several similar hues and would be difficult to distinguish in grayscale. → fix: Add a clear legend or direct labels for model colors and consider using colorblind-safe colors plus distinct line/marker styles.
- [caption_claim_match] (critical): The caption states that panel a shows estimated fractions of attributable risk, but the visible x-axis is labeled Estimated Risk Ratio and uses risk-ratio values. This metric mismatch could substantially affect interpretation. → fix: Revise either the caption or axis label so the plotted quantity is consistently described as risk ratio or FAR, and clarify any transformation if both are related.
page_11 (p11): The figure is mostly publication-ready but would benefit from clearer axis wording and improved accessibility of the color/marker encoding before submission.
- [axis_labels] (minor): Both axes are labeled, but the Y-axis reads "Log Risk Ratio Error" while the caption describes the absolute value of the log risk ratio error. This could cause ambiguity about whether negative errors were transformed. → fix: Change the Y-axis label to "Absolute log risk ratio error" or "|log risk ratio error|".
- [legend_palette] (major): The legend is present and identifies the models, but the groups appear to rely entirely on color with identical point shapes. Some colors are similar in luminance and may be difficult to distinguish in grayscale or for colorblind readers. → fix: Use a colorblind-safe palette and/or distinguish models with different point shapes or line styles.
page_12 (p12): The figure is generally well labeled and readable, with appropriate count legends and no need for error bars. However, the caption-to-axis mismatch around attributable-risk accuracy versus log risk ratio error should be corrected before submission.
- [caption_claim_match] (critical): The caption says the figure shows accuracy of the fraction of attributable risk estimate, but every y-axis is labeled 'Log Risk Ratio Error'. This is a substantive mismatch unless log risk ratio error is explicitly intended as the proxy for attributable-risk accuracy. → fix: Revise the caption or y-axis label to use consistent terminology, or explicitly state that log risk ratio error is being used as the accuracy metric for the attributable-risk estimate.

§5.14

Reproducibility Assessment

3 min

🔴 Sample size & power justification (weak) — critical: The manuscript reports the observational sample size and number of events, and it also specifies the number of simulation replicates. However, it does not provide an a priori power calculation or a principled rationale for the time window, number of
- Evidence: > To replicate experiments from Brown et al., we use their data on wildfire days across California from 2003- 2020 (n = 17,910 fire days).
- Fix: Add a sample-size/rationale paragraph in Methods explaining why 2003–2020 and n = 17,910 were used, whether all eligible data were included, and why 300 simulation replicates were sufficient for stable estimates.
🔴 Data availability (weak) — critical: The manuscript identifies broad public data sources and states that it uses data from Brown et al. However, it does not provide a formal data availability statement with repository links, accession IDs/DOIs, exact datasets or versions, licenses, or i
- Evidence: > Their data are obtained from a variety of publicly available sources: wildfire days are sourced from MODIS satellite estimates from NASA, predictor variables from reanalysis produced from the National
- Fix: Add a Data Availability section naming the exact repositories, dataset identifiers/DOIs/accession numbers, versions, licenses, and any processed-data location or controlled-access constraints.
🔴 Code & software availability (weak) — critical: The manuscript names the main algorithms and describes the analysis workflow in detail. However, it does not provide analysis code, a GitHub/Zenodo/Code Ocean repository, version tags, software/package versions, random seeds, or scripts needed to rep
- Evidence: > For each time period, we train our models using five different machine learning algorithms: LightGBM36, Random Forest37, XGBoost38, logistic regression, and Elastic Net Regression39.
- Fix: Add a Code Availability section with a public repository and archived release DOI, plus package/software versions, environment files, random seeds, and instructions for rerunning the analyses.
🔴 Statistical reporting completeness (weak) — critical: The manuscript reports many statistical quantities, including n, formulas, uncertainty intervals, bootstrap procedures, correlations, and p-value thresholds. Key reporting elements are still incomplete: the correlation test type is not specified, p-v
- Evidence: > Traditional predictive performance metrics such as AUC and Brier skill score showed weak to moderate correlations with log risk ratio error in simulation analyses (r = −0.26 and −0.46, respectively; p
- Fix: Add a statistical analysis subsection specifying exact tests, correlation type, software and package versions, random seeds, exact p-values where feasible, and the construction of all uncertainty intervals.
· Randomization & blinding (not_applicable): This is an observational and computational machine-learning/simulation study, not an interventional experiment involving treatment allocation or blinded outcome assessment. Randomization and blinding of participants, providers, or assessors are there
· Materials & reagents (RRIDs) (not_applicable): The study is computational and uses environmental/climate datasets rather than biological reagents, antibodies, cell lines, animal strains, plasmids, or viral vectors. RRIDs and reagent catalog information are therefore not applicable.
· Replication & validation (adequate): The manuscript includes several internal validation and robustness components: replication of a prior approach, temporal cross-validation, multiple machine-learning models, simulation datasets with known truth, bootstrapping, and stated robustness ch
- Evidence: > We have five different “truth” scenarios representing each machine learning method of generating the predicted probabilities, and generate 300 datasets per scenario.
· Pre-registration (not_applicable): The work is framed as an exploratory methodological and simulation analysis evaluating threats to validity in machine-learning extreme event attribution. There is no indication that it is a confirmatory clinical, interventional, or systematic-review

§5.37

Numeric Consistency Audit

Deterministic regex sweep for cross-section numeric inconsistencies — the kind reviewers reliably catch and authors reliably miss. Extracts every "n=" sample-size claim and every p

1 min

Deterministic regex sweep for cross-section numeric inconsistencies — the kind reviewers reliably catch and authors reliably miss. Extracts every "n=" sample-size claim and every percentage with its noun-phrase anchor + section location, clusters by the underlying quantity, and flags clusters where the SAME quantity has DIFFERENT values across manuscript sections. No LLM — pure pattern matching, $0 cost.

Verdict: ✅ Clean

Claims extracted: 1 sample-size · 8 percentage | Sections detected: 6 | Findings: 0 critical · 0 major · 0 minor

No cross-section numeric inconsistencies detected via deterministic regex sweep. This is a NEGATIVE result with honest limits: the sweep cannot detect inconsistencies where the SAME number has different MEANINGS across sections (semantic mismatch), nor inconsistencies in figures that cite raw data we can't parse from the text. Manuscript still needs human review for those.

§5.38

Required Statements Audit

Most major journals (Cell, Nature, Lancet, BMJ, NEJM, JAMA, eLife, PLOS family) require two paste-ready statements that 5.36 ethics doesn't cover: a CRediT author-contributions sta

2 min

Most major journals (Cell, Nature, Lancet, BMJ, NEJM, JAMA, eLife, PLOS family) require two paste-ready statements that 5.36 ethics doesn't cover: a CRediT author-contributions statement (using the formal Contributor Roles Taxonomy) and a Conflict of Interest declaration (per ICMJE). Both are deterministically detected here; paste-ready templates provided when missing.

Verdict: 🔴 2 major gaps — likely desk-return

CRediT statement: ❌ missing (taxonomy terms found: 5 of 14) COI declaration: ❌ missing

🔴 MAJOR (missing_credit_statement): No CRediT (Contributor Roles Taxonomy) author-contributions statement detected. Required by Cell, Nature family, Lancet, BMJ, eLife, PLOS family, and increasingly by mid-tier biomed journals — typically auto-flagged at submission portal level.
🔴 MAJOR (missing_coi_declaration): No competing-interests / conflicts-of-interest declaration detected. ICMJE requires this for all clinical journals + most basic-research journals — absence is a guaranteed desk-return.

Paste-ready CRediT statement

Add this to your manuscript before References, filling in each author's contribution. Use the 14 official CRediT taxonomy terms (italicized below) — journals' submission portals validate against this exact vocabulary.

The 14 CRediT taxonomy terms (use these EXACT phrases):

Conceptualization · Methodology · Software · Validation · Formal analysis · Investigation · Resources · Data Curation · Writing - Original Draft · Writing - Review & Editing · Visualization · Supervision · Project administration · Funding acquisition

Paste-ready competing-interests declaration

Add this to your manuscript before References. Choose the version that matches your situation:

If no competing interests:

If one or more authors have competing interests:

Disclose any potentially-perceivable conflict — speaker honoraria, travel reimbursement, family-member employment, patent royalties, equity holdings (any amount). Editors treat under-disclosed conflicts much more harshly than fully-disclosed ones.

§ 5

Verification Evidence (skim)

6 sections · 7 min · 1 appendix

§5.11

AI Fingerprint

Pangram verdict: Human Written | AI: 0.0% | AI-assisted: 0.0% | Human: 100.0% Disclosure recommendation: NODISCLOSURENEEDED Pangram v3 classified every analyzed prose window as hum

1 min

Pangram verdict: Human Written | AI: 0.0% | AI-assisted: 0.0% | Human: 100.0% Disclosure recommendation: NO_DISCLOSURE_NEEDED

Pangram v3 classified every analyzed prose window as human-written (0 AI-flagged segments across 14 windows): "We believe that this document is fully human-written." This is an AI-policy risk screen, not proof of authorship.

Calibration: With AI < 5% and AI-assisted < 10%, this manuscript reads as essentially human-authored at the granularity Pangram detects. No journal AI-policy currently in force requires disclosure at this level. If you used AI for narrow grammar/phrasing assistance, most policies (Nature, Cell, Springer Nature) explicitly exempt copy-editing from disclosure requirements.

§5.24

Author Identity Verification

For each named author, verify ORCID-recorded employment matches the affiliation claimed in the manuscript byline. Affiliation drift is a common reviewer/editor flag; missing ORCIDs

1 min

For each named author, verify ORCID-recorded employment matches the affiliation claimed in the manuscript byline. Affiliation drift is a common reviewer/editor flag; missing ORCIDs are increasingly required by major journals.

Authors extracted: 3 | With ORCID: 0 | Affiliation verified: 0 | Mismatches: 0 | Missing ORCID: 3

Verified institutions (via Research Organization Registry)

Even when ORCIDs aren't printed, the Research Organization Registry (ror.org) lets us canonicalize each claimed affiliation against ~110,000 verified institutions. Score 0-1; ≥0.95 is a strong match. Paste the ROR ID into your submission portal where supported (Crossref, Datacite, NIH PMC all use ROR canonical IDs).

Author	Verified institution	ROR ID	Country	Score
Cassandra C. Chou	Johns Hopkins University	`00za53h95`	United States	1.00

1 unique institutions ROR-verified across 3 authors.

⚠️ No ORCID identifiers detected for any author

All 3 authors are listed without ORCID iDs in the manuscript byline. This is normal in preprint PDFs (the upload-version often strips ORCIDs that the author has set in their submission portal), but most major journals — including Nature, Cell, JAMA, eLife, PLOS, BMJ, Lancet, Science, PNAS — now REQUIRE the corresponding author to provide an ORCID at submission, and increasingly require all co-authors to do the same.

Action — paste into the byline:

For Cassandra C. Chou (corresponding author): add (ORCID: 0000-XXXX-XXXX-XXXX) immediately after the name. Get the iD at https://orcid.org/register if not already created.

Why this matters beyond compliance: ORCID lets editors + reviewers verify your career trajectory (publications, funding, affiliations) in 30 seconds. Authors without ORCIDs trigger extra editorial scrutiny — even when everything else is clean.

⚠️ Issues requiring author attention

Cassandra C. Chou: No ORCID printed for this author — most journals now require ORCID for at least the corresponding author. Affiliation canonicalized via ROR: Johns Hopkins University (ROR:00za53h95, United States, score 1.00).

§5.29

Journal Legitimacy Check

🟢 GREEN — verified legitimate Target journal: Nature Climate Change Nature Climate Change verified legitimate via listed in DOAJ + indexed in major databases (OpenAlex iscore). Sa

1 min

🟢 GREEN — verified legitimate

Target journal: Nature Climate Change

Nature Climate Change verified legitimate via listed in DOAJ + indexed in major databases (OpenAlex is_core). Safe to submit per the standard verification signals.

Signal-by-signal check

Signal	Result	Source
DOAJ presence	✅ Listed	doaj.org
OpenAlex `is_core` (Scopus/WoS-like indexing)	✅ Yes	openalex.org
Beall's archived predatory-journal list	✅ Not present (of ~1,317 archived journals)	github.com/stop-predatory-journals

Journal-level metrics (OpenAlex)

h-index: 372 — top-tier (>100 = leading journal in any biomed field)
2-year mean citedness: 11.55 (JCR-Impact-Factor analog on the open OpenAlex graph; ~12 — comparable to top-tier impact factors)

What this check DOES and DOES NOT cover

Covers: open-database legitimacy signals (DOAJ presence, OpenAlex core-indexing flag, Beall's archived predatory list, venue h-index + citedness). These are the signals an editor would check at desk-review.

Does NOT cover: editorial-fit (does this journal publish your kind of work?), realistic acceptance rate, specific review-process culture. See 5.2 Cross-Journal Cascade for editorial-fit + alternative-venue analysis.

Paste-ready submission tracker line

When logging this submission to your tracker / advisor email:

§5.30

Materials & Reagents Audit

For each named antibody, cell line, mouse strain, and software tool: validate against the RRID Portal (canonical resource IDs) and Cellosaurus (cell-line authentication + ICLAC mis

1 min

For each named antibody, cell line, mouse strain, and software tool: validate against the RRID Portal (canonical resource IDs) and Cellosaurus (cell-line authentication + ICLAC misidentified-line database). Missing RRIDs + ICLAC-flagged cell lines are reviewer-bait issues that desk-reject in major journals.

Antibodies: 0 named, 0 with RRID (0%) | Cell lines: 0 named, 0 flagged as problematic by ICLAC | Mouse strains: 0 | Software: 3 named, 0 with RRID

Software tools missing RRIDs (3 of 3)

Find canonical SCR RRIDs at https://scicrunch.org/resources_

Weather Research Forecasting model
LightGBM
XGBoost

§5.35

Related-Work Recommender

Appendix

We search a 200M-paper academic corpus to identify the published work most similar to yours. Two tiers below: high-cited papers you should VERIFY are in your reference list (concre

2 min

We search a 200M-paper academic corpus to identify the published work most similar to yours. Two tiers below: high-cited papers you should VERIFY are in your reference list (concrete action), and recent adjacent work for novelty calibration (background context).

Verify these 7 high-cited papers are in your reference list

Action: open your manuscript's References + Ctrl-F each title below. Any that's NOT cited is a reviewer-bait gap — either add it OR add a one-sentence differentiation of why your work isn't redundant with it.

Artificial intelligence for modeling and understanding extreme weather and climate events — G. Camps-Valls, M. Fernandez-Torres et al., Nature Communications (2025) — 92 citations · doi:10.1038/s41467-025-56573-8
A machine learning approach for monitoring ship safety in extreme weather events — A. Rawson, M. Brito et al., Safety Science (2021) — 81 citations · doi:10.1016/j.ssci.2021.105336
Discovering and forecasting extreme events via active learning in neural operators — Ethan Pickering, G. Karniadakis et al., Nature computational science (2022) — 73 citations · doi:10.48550/arxiv.2204.02488
Analysis, characterization, prediction, and attribution of extreme atmospheric events with machine learning and deep learning techniques: a review — S. Salcedo-Sanz, J. Pérez-Aracil et al., Theoretical and Applied Climatology (2023) — 51 citations · doi:10.1007/s00704-023-04571-5
Machine learning–based extreme event attribution — Jared T Trok, E. A. Barnes et al., Science Advances (2024) — 31 citations · doi:10.1126/sciadv.adl3242
Validating Deep-Learning Weather Forecast Models on Recent High-Impact Extreme Events — Olivier C. Pasche, Jonathan Wider et al., ArXiv (2024) — 28 citations · doi:10.1175/aies-d-24-0033.1
Analysis, Characterization, Prediction and Attribution of Extreme Atmospheric Events with Machine Learning: a Review — S. Salcedo-Sanz, Jorge P'erez-Aracil et al., ArXiv (2022) — 16 citations · doi:10.48550/arxiv.2207.07580

Recent adjacent work (3 papers, 0-4 citations)

These are recent (last 1-2 years) papers Semantic Scholar ranks as similar to yours. Most have not yet accumulated citations — useful for novelty calibration ('this work is in active competition with X recent groups') but lower-priority for the reference list. Skim only if you have spare time before submission.

Validity in machine learning for extreme event attribution · ? (2025)
Comparison of data-driven methods for linking extreme precipitation events to local and large-scale meteorological varia · Stochastic Environmental Research and Ri (2023) · doi:10.1007/s00477-023-02511-3
Improving the Assimilation Ability for the Extreme Events by Proposing a Nonlinear Machine Learning Data Assimilation Ap · Geophysical Research Letters (2025) · doi:10.1029/2025gl118319

§5.36

Ethics / IRB Statement Audit

Most journals require an explicit ethics statement when a manuscript reports human-subject research, animal research, or secondary use of human-derived material. Missing statements

1 min

Most journals require an explicit ethics statement when a manuscript reports human-subject research, animal research, or secondary use of human-derived material. Missing statements are one of the top desk-return reasons across biomedical publishing. This component classifies the requirement type, checks for an existing statement, and provides paste-ready templates when gaps are detected.

Verdict: ✅ Exempt — no ethics statement required

Manuscript category: exempt_computational | Classifier confidence: high The manuscript is a machine-learning/simulation study using California wildfire data and climate scenarios, with no human participants, biospecimens, or animal experiments.

Paste-ready exempt-status statement (optional but recommended)

Some journals' submission portals require an explicit no-subjects declaration even for computational / perspective work. Adding this preempts a portal-form blocker:

Limits of this audit

This is a deterministic regex sweep + LLM classifier check on the manuscript text. It detects PRESENCE of standard ethics-statement keywords; it does NOT verify that the named IRB or approval ID is real, that the consent process was adequate, or that the local ethics committee's terms cover what your study actually did. This is not legal or IRB advice. Verify with your institution's IRB office before submission.

Submission-Ready Dossier · $99

Get this submission package for your manuscript.

Same reviewer-calibrated engine, built to pressure-test the submission risks selective journals notice first.

A full Dossier turns one manuscript and one target journal into a submission plan: reviewer-objection map, target-journal risk, citation checks, reviewer strategy, and ready-to-use submission materials. Local pricing shown before checkout.

You'll complete payment on Stripe's secure page, then return to Manusights.

Run free preview first

Your manuscript is never used to train any AI model, and access is limited to the review workflow.

How to read this dossier

Title Critique

Three Alternative Titles, Ranked by Predicted Impact

Abstract Critique

Revised Abstract (paste-ready)

Keywords

Plain-Language Summary (149 words)

Significance Statement (116 words)

Response to Comment 0: Simulation ground truth is bootstrapped from the same models being evaluated

Response to Comment 1: Asymmetric FAR estimator mixes empirical frequencies with model predictions and departs from standard EEA practice

Response to Comment 2: Universal validity claims are supported by a single hazard, single region, single 18-year window

Response to Comment 3: Temperature subgroup analysis conflates training-data sparsity with covariate shift

Response to Comment 4: No code repository, environment specification, or accessible Supplement

Response to Comment 5: Hyperparameter tuning on the full present-day dataset risks leakage into counterfactual predictions

Response to Comment 6: Mean calibration error is presented as proposed but is a long-standing forecasting metric

Response to Comment 7: No benchmark against physics-based or extreme-value EEA estimates for the same wildfire events

Notes for the author

Cross-Journal Cascade

First Choice: Weather and Climate Extremes (fit=92, accept_pct=56)

Second Choice If First Rejects: Climatic Change (fit=84, accept_pct=45)

Safe Fallback: Environmental Research: Climate (fit=78, accept_pct=44)

Reach After Revision: Nature Climate Change (fit=?, accept_pct=?)

Submission Readiness

Editor-Perspective Memo

Cascade-Fail Recovery Timeline

Rejection-Shape Decision Matrix

A_scope_fit_style — Scope or audience fit

B_fixable_scientific_gaps — Fixable validation and reporting gaps ⭐ most likely match for this manuscript

C_deep_scientific_concerns — Core attribution validity challenged

Start lining these up THIS WEEK

Week 1: Triage validity threats and journal fit

Week 2: Strengthen validation and uncertainty evidence

Week 3: Reframe for Climatic Change

Week 4: Submission package and final audit

Predicted Reviewer Profiles

Suggested Reviewers (6)

1. S. Salcedo-Sanz

2. G. Camps-Valls

3. Nafsika Antoniadou

4. Olivier C. Pasche

5. Ethan Pickering

6. Zi‐ying Xuan

Reviewers to Consider Excluding (2)

1. Jared T Trok

2. S. Salcedo-Sanz

Editor-Facing Rationale (paste into 'Suggested reviewers' field)

Reporting Guideline Compliance

Format Compliance

Citation Audit (+ 5.10 Context Verification)

Missing-Citation Candidates (5)

Citation Context Verification (5.10)

Novelty Assessment

Statistical Rigor Audit

Methodology issues

Figure Critique

Reproducibility Assessment

Numeric Consistency Audit

Verdict: ✅ Clean

Required Statements Audit

Verdict: 🔴 2 major gaps — likely desk-return

Paste-ready CRediT statement

Paste-ready competing-interests declaration

AI Fingerprint

Author Identity Verification

Verified institutions (via Research Organization Registry)

⚠️ No ORCID identifiers detected for any author

⚠️ Issues requiring author attention

Journal Legitimacy Check

Signal-by-signal check

Journal-level metrics (OpenAlex)

What this check DOES and DOES NOT cover

Paste-ready submission tracker line

Materials & Reagents Audit

Software tools missing RRIDs (3 of 3)

Related-Work Recommender

Verify these 7 high-cited papers are in your reference list

Recent adjacent work (3 papers, 0-4 citations)

Ethics / IRB Statement Audit

Verdict: ✅ Exempt — no ethics statement required