Loading...
Novel topography-based classification for mountain basins: utility and limitations of unsupervised machine learning, A
Pfaff, Brian
Pfaff, Brian
Citations
Altmetric:
Advisor
Editor
Date
Date Issued
2023
Date Submitted
Collections
Research Projects
Organizational Units
Journal Issue
Embargo Expires
Abstract
Unsupervised machine learning algorithms are commonly used data analytic methods with applications spanning many disciplines. In the field of hydrology, K-means and similar clustering methods have been shown to be useful discerning differences in hydrologic signatures between catchments. Specifically, these approaches apply a clustering algorithm to hydrologic data, including hydroclimatic data, land cover data, and topographic data. Topographic attributes alone have also been clustered to establish distinct distributions of mountain ranges to understand species response to climate change and habitat availability, but similar approaches haven’t been applied to hydrology to help understand categories of hydrologic response to climate change. Here, I show that calculated distributions of elevation, slope, and aspect for 604 delineated catchments in the USGS GAGES II dataset along with elevation data originating from the USGS National Elevation Dataset can be used to partition catchments into two clusters. Two clusters were determined to be optimal for this dataset according to the elbow method and average silhouette method. The notion that two clusters is optimal suggests there is minimal differences between clusters when simplifying the topographies of catchments to only distributions of elevation, slope, and aspect. Modeling experiments were conducted using HEC-HMS to assess hydrologic response of clustered catchments to the same hydroclimatic forcing parameters. A sensitivity analysis of models run for the basins closest to the center of Cluster 1 and Cluster 2 indicated Cluster 1 Center shows more variation in the timing of peak flow while Cluster 2 Center shows a shorter time to peak flow (TP), time to baseflow from peak flow (TB), and a faster recession. A Kolmogorov-Smirnov (KS) test indicated that the summary statistics TP, TB, and R2 are statistically significant and do differentiate Cluster 1 Center and Cluster 2 Center. However, when modeling randomly selected basins in each cluster, the lack of visual difference between hydrographs and a KS test indicated no statistical significance for TP, TB, and slope of recession rate (b). These results together suggest no hydrological difference between Cluster 1 and Cluster 2. The linear regression fit of recession rate against discharge performed well for Cluster 1 Center as indicated by a high R2 (0.86) but performed poorly for Cluster 2 Center and for the randomly selected basins. These results demonstrate pitfalls of the K-means clustering algorithm that must be considered. This study is somewhat unique in that I use an external validation method to determine the extent to which the K-means clustering produced meaningful categories. This external validation method indicated no differences between clusters and that the K-means algorithm did not create meaningful classifications. I propose a validation method is needed such as hydrologic modeling of the clustered catchments to assess the integrity of the K-means results and to ensure classifications are distinct and meaningful. Without proper validation techniques, the results of K-means algorithms and broader unsupervised clustering methods should not be assumed to be correctly partitioned.
Associated Publications
Rights
Copyright of the original work is retained by the author.