Dominance analysis of Strokes Gained, honest archetype clustering, and leakage-aware predictive models across 333 tournaments and 29,180 player–event rows.
V2Across 29,180 player–event rows with complete Strokes Gained attribution, LMG dominance analysis attributes the variance in finish position to approach 34.0% [32.8, 35.3], putting 31.7% [30.1, 33.2], off-the-tee 20.4% [18.9, 22.0], and around-the-green 13.9% [12.0, 15.2].
V2The v1 "19-of-20 top players in a single ball-striker cluster" claim is largely circular: absolute-scope clustering includes the SG_total level dimension, and removing it via residual clustering drops top-20 concentration from 90% to 45%.
V2A same-tournament cut classifier achieves AUC 0.89 under any CV scheme but is near-tautological; a genuine trailing-SG cut classifier achieves AUC 0.63 and tells us what recent form is actually worth for predicting the upcoming cut.
V2Neither winner SG nor within-tournament scoring dispersion exhibits a detectable linear trend or 2018 Chow structural break — this tour era was structurally stable.
This is a peer-review revision. v1 reported an additive-identity decomposition, raw within-tournament betas, circular top-20 concentration, and random 5-fold CV only. v2 replaces each analysis with a methodologically honest counterpart. The v1 analyses remain in git history (tag v1); the full mapping from reviewer concerns to code / paper changes is in CHANGELOG_V2.md.
| Concern | v1 | v2 |
|---|---|---|
| Identity vs importance | Unique R² in SG_total identity was called "variance decomposition" of skill importance | Identity and LMG dominance (Shapley attribution of R² on finish position) reported side-by-side with 95% CIs |
| Within-tournament betas | Raw means only | Raw + 95% CIs + attenuation correction + VIFs; APP and PUTT confirmed co-dominant but the CIs overlap |
| Archetype clustering | k=4, silhouette 0.205 accepted as weak structure; 19/20 top players in C0 | Gaussian-null benchmark (99 perms), residual clustering (45% vs 90%), 200-boot cluster stability |
| Cut classifier | AUC 0.89 reported as predictive achievement | Relabelled sanity check; genuine trailing-SG AUC 0.63 as honest predictive benchmark |
| CV | Random 5-fold only | Random + GroupKFold-tourn + GroupKFold-player + OOT 2021-22 |
| Stationarity | Asserted from visual inspection | Linear-trend test, season-FE joint F with HC0 SEs, 2018 Chow break test |
| sg_t2g in §4 / Fig 7 | Double-counted OTT+APP+ARG | Removed |
| Title "Elite" | Ambiguous | Retitled to explicit "within-tour" framing |
The source is the ASA All PGA Raw Data – Tournament Level CSV distributed on Kaggle — one row per player–tournament, with Strokes Gained attribution on 79.2% of rows covering eight PGA Tour seasons. We coerce “NA” strings to missing, drop rows lacking complete SG attribution, and rank only the cut-making subset to avoid the systematic bias introduced when comparing strokes totals of cut-makers to cut-missers.
Because SG_total is defined as the sum of its four components, a regression of total on components returns an identity with R² ≈ 0.99 — the unique-R² per component is a property of the SG metric's distribution, not an importance claim on winning. We therefore also fit Lindeman/Merenda/Gold dominance analysis (Shapley-value attribution of R²) on an actual outcome — finish position:
The two panels answer different questions. In the identity panel, APP and PUTT each carry ~34% of Var(SG_total). In the LMG panel, APP carries 34.0% and PUTT 31.7% of the finish-position R², with OTT at 20.4% and ARG at 13.9%. This is the importance ordering the paper reports.
The extreme-groups Cohen's d between top-decile and bottom-decile cut-makers is 1.88 for APP, 1.64 for PUTT, 1.36 for OTT, and 1.02 for ARG (500-bootstrap 95% CIs). Because dichotomisation inflates d relative to a continuous-r effect, we also report the Fitzsimons back-transform and the paired full-sample Pearson r with −pos.
In v1 we reported a 4-cluster k-means solution on the absolute SG profile with silhouette 0.205 and claimed "19 of 20 top players in one ball-striker cluster". The peer reviewer identified three methodological issues with this. v2 addresses all three.
Finding (a): observed silhouette at k=4 is 0.205 against a Gaussian-null 95th pct of 0.214 — no cluster structure above a Gaussian baseline with matched covariance.
Finding (b): absolute-scope clustering inflates the top-20 concentration by construction, because the components sum to SG_total so the highest-centroid cluster contains the top players mechanically. Residual-scope clustering (subtracting each player's mean level) drops the concentration from 90% to 45%.
Finding (c): 200-bootstrap Hungarian-matched stability: 70% of players have per-player agreement above 0.70 in absolute scope, 64% in residual scope; mean stability 0.76 for both.
We run three formal tests: linear-trend regression on 8 annual means, season fixed-effects OLS on the tournament-level panel with HC0 robust SEs and a joint F-test for "all season dummies = 0", and a Chow structural-break test at the 2018 boundary. All three decisively fail to reject stationarity.
Every model is evaluated under 4 CV schemes: random 5-fold (for comparability with v1), GroupKFold by tournament, GroupKFold by player, and out-of-time 2021–22.
| Task | Model | Random 5-fold | GKF tournament | GKF player | OOT 2021-22 |
|---|---|---|---|---|---|
| (a) Sanity cut (AUC) | Logistic | 0.885 | 0.885 | 0.885 | 0.871 |
| Random Forest | 0.893 | 0.894 | 0.893 | 0.885 | |
| (b) Genuine trailing (AUC) | Logistic | 0.636 | 0.635 | 0.635 | 0.644 |
| Random Forest | 0.634 | 0.632 | 0.631 | 0.634 | |
| (c) Finish regressor (R²) | Ridge | 0.545 | 0.542 | 0.587 | 0.775 |
| Random Forest | 0.521 | 0.537 | 0.574 | 0.712 |
@misc{xiong2026sgminepga,
title = {Within-Tour Determinants of Finish Position on the PGA Tour,
2015--2022: Dominance Analysis of Strokes Gained, Honest
Archetype Clustering, and Leakage-Aware Predictive Models},
author = {Xiong, Edward},
year = {2026},
note = {Version 2 (peer-review revised)},
url = {https://github.com/edwardxiong2027/sgmine-pga}
}
Broadie, M. (2012). Assessing Golfer Performance on the PGA Tour. Interfaces 42(2), 146–165.
Fearing, D., Acimovic, J., & Graves, S. C. (2011). How to Catch a Tiger: Understanding Putting Performance on the PGA Tour. Journal of Quantitative Analysis in Sports 7(1).
Grömping, U. (2007). Estimators of Relative Importance in Linear Regression Based on Variance Decomposition. The American Statistician 61(2), 139–147.
Fitzsimons, G. J. (2008). Death to Dichotomizing. Journal of Consumer Research 35(1), 5–8.
Chow, G. C. (1960). Tests of Equality Between Sets of Coefficients in Two Linear Regressions. Econometrica 28(3), 591–605.