For each major or critical comment surfaced by the pre-submission reviewer report (5.17), here is a draft response paragraph the author can adapt for the rebuttal letter when the journal returns a major-revision decision. Each response is grounded in the same verbatim manuscript evidence the reviewer cited.
Response to Comment 0: Simulation ground truth is bootstrapped from the same models being evaluated
Severity: critical | Stance: agrees and commits
Reviewer Comment:
Author Response:
We agree that using model-derived probabilities as the simulation target can overstate recoverability and risks a circular design. In the revised manuscript, we will add independent data-generating processes, including a parametric logistic DGP with fixed coefficients and a second emulator-based DGP that is architecturally distinct from the evaluated learners. We will then re-estimate the comparative performance of calibration, discrimination, and Brier-based metrics under these alternative truths. This will allow us to test whether mean calibration error remains the most informative indicator of FAR error when the target is not mechanically aligned with the candidate models.
Manuscript Change:
Methods and Supplementary Simulation section; new robustness simulations in Results
Response to Comment 1: Asymmetric FAR estimator mixes empirical frequencies with model predictions and departs from standard EEA practice
Severity: critical | Stance: partially agrees revises scope
Reviewer Comment:
Author Response:
We appreciate this point and agree that the estimator should be made fully transparent. Our intent was to anchor the factual arm to the observed event frequency as a descriptive baseline, but we agree that this choice should not be conflated with a symmetric probabilistic FAR formulation. In the revision, we will report both the empirical-baseline estimator and the fully model-based symmetric FAR estimator, with the empirical proportion clearly labeled as a calibration reference for the factual arm. We will also quantify how the choice of estimator affects the multiplicity results and whether any conclusions are sensitive to this design decision.
Manuscript Change:
Methods section defining FAR; new sensitivity analysis in Results and Supplement
Response to Comment 2: Universal validity claims are supported by a single hazard, single region, single 18-year window
Severity: critical | Stance: agrees and commits
Reviewer Comment:
Author Response:
We agree that the current empirical study is specific to California wildfire attribution and should not be read as a universal proof across all hazards. Our broader claim is that three validity threats can arise in ML-based attribution workflows, not that their magnitude is identical across hazards. In the revision, we will narrow the framing to wildfire attribution and explicitly distinguish general methodological implications from hazard-specific empirical evidence. We will also add language noting that extension to heat waves or other hazards remains an important direction for future work.
Manuscript Change:
Title, Abstract, Introduction, and Conclusions
Response to Comment 3: Temperature subgroup analysis conflates training-data sparsity with covariate shift
Severity: major | Stance: agrees and commits
Reviewer Comment:
Author Response:
This is a valuable clarification, and we agree that the current analysis does not separate sparsity from distribution shift cleanly. We will report the number of observations in each temperature bin and add an analysis that controls for bin sample size when relating temperature to calibration error. In addition, we will include a subsampling experiment that equalizes bin sizes to assess whether the temperature effect persists after removing sparsity as a confounder. If the residual association remains, we will interpret it as evidence more consistent with covariate shift; otherwise we will revise the interpretation accordingly.
Manuscript Change:
Results subsection on temperature bins; new Supplementary analysis
Response to Comment 4: No code repository, environment specification, or accessible Supplement
Severity: major | Stance: agrees and commits
Reviewer Comment:
Author Response:
We agree that reproducibility needs to be improved substantially. We will release a version-controlled repository with all preprocessing, training, simulation, and bootstrap code, along with an archived snapshot and complete environment specification. We will also ensure that the full Supplementary Information, including propensity checks and weighted-metric analyses, is publicly accessible alongside the revised preprint and submission. These materials will allow independent readers to audit the pipeline end to end.
Manuscript Change:
Data and Code Availability statement; Supplementary Information availability; repository link in revised manuscript
Response to Comment 5: Hyperparameter tuning on the full present-day dataset risks leakage into counterfactual predictions
Severity: major | Stance: agrees and commits
Reviewer Comment:
Author Response:
We appreciate this concern and agree that the training and tuning protocol must be specified unambiguously. In the revision, we will provide explicit pseudocode detailing how folds are assigned for factual prediction, counterfactual prediction, and simulation, including the exact separation between tuning and evaluation. If any part of the current counterfactual analysis relied on full-dataset tuning, we will rerun it using nested cross-validation with hyperparameters selected only on training folds. We will then report whether the multiplicity statistics or model rankings change under the leakage-free protocol.
Manuscript Change:
Methods section and Supplementary pseudocode; revised sensitivity analysis if needed
Response to Comment 6: Mean calibration error is presented as proposed but is a long-standing forecasting metric
Severity: major | Stance: agrees and commits
Reviewer Comment:
Author Response:
We agree and thank the reviewer for pointing out that this metric has a long history in probabilistic forecasting and verification. In the revision, we will replace any wording that suggests novelty of the metric itself with language emphasizing our empirical recommendation of this metric for EEA applications. We will cite the calibration and proper-scoring-rule literature, including Murphy's decomposition and subsequent work on reliability and bias. We will also sharpen our contribution to the finding that this simple calibration summary aligns more closely with FAR error than discrimination metrics in the settings we study.
Manuscript Change:
Introduction and Discussion; references added to forecasting and calibration literature
Response to Comment 7: No benchmark against physics-based or extreme-value EEA estimates for the same wildfire events
Severity: major | Stance: partially agrees revises scope
Reviewer Comment:
Author Response:
We agree that an external benchmark would strengthen the context of the study and help readers interpret the scope of our findings. In the revision, we will add a comparison to at least one published non-ML attribution estimate or fire-weather-based analysis relevant to California wildfire risk, where such a benchmark is available and methodologically comparable. We will use this comparison to clarify whether the identified validity threats are unique to ML or reflect broader challenges in attribution workflows. If a fully matched benchmark is not available for the exact event definition, we will state that limitation explicitly and frame the comparison as contextual rather than definitive.
Manuscript Change:
Results and Discussion; new benchmark or contextual comparison subsection
Notes for the author
- We have kept the responses concise, constructive, and journal-appropriate while committing to concrete revisions where feasible.
What reviewers will catch — and what to fix first. Reporting guidelines, format compliance, citation completeness, novelty claims, statistics, figures, reproducibility. Address every critical-severity finding here before submitting.