Geospatial data harmonization and machine learning for large-scale water quality modelling
Date
2022-10-11
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
Põllumajanduslik reostus põhjustab jätkuvalt magevee kvaliteedi üleilmset halvenemist. Tõhusate veemajandamise meetmete väljatöötamisel on oluline osa veekvaliteedi modelleerimisel. Veekvaliteedi laialdaseks modelleerimiseks on aga vajalik hea ruumilise katvusega lähteandmete olemasolu. Töö eesmärk oli parandada ja harmoniseerida veekvaliteedi modelleerimiseks vajalikke andmestikke ning arendada välja masinõppe raamistik, mida saaks kasutada riigiüleseks veekvaliteedi modelleerimiseks. Töö üheks väljundiks on Eesti mullastikuandmebaas EstSoil-EH. EstSoil-EH atribuudid olid sisendiks masinõppe mudelile, mida kasutasin mulla orgaanilise süsiniku sisalduse prognoosimiseks. Selgus, et proovivõtukohtade keskkonnatingimused mõjutasid mudeli prognoosi täpsust. Globaalse veekvaliteedi andmete parandamiseks loodi viie andmestiku põhjal andmebaas Global River Water Quality Archive (GRQA).
Mullasüsiniku mudeli loomise käigus õpitu põhjal arendati välja raamistik üle-eestiliseks veekvaliteedi modelleerimiseks. Mudel prognoosis toitainete kontsentratsioone 242 Eesti jõe valglas. Saadud mudelite täpsus on võrreldav Baltimaades varem rakendatud mudelitega. Mudelite täpsust mõjutas valglate suurus, kuna prognoosid olid üldjuhul ebatäpsemad väiksemates valglates. Seejuures piisas rahuldava täpsuse saavutamiseks vähem kui pooltest tunnustest, mis näitab, et tunnuste arvust olulisem on nende kirjeldusvõime. Seega on loodud masinõppe mudelid rakendatavad piirkondades, kus tunnuste tuletamiseks vajalike lähteandmete katvus on piiratud.
The state of freshwater quality continues to deteriorate worldwide due to agricultural pollution. In order to combat these issues effectively, water quality modeling could be used to better manage water resources. However, large-scale water quality models depend on input datasets with good spatial coverage. The aim of the thesis was to improve and harmonize datasets for water quality modeling purposes and create a machine learning framework for national-scale modeling. We created EstSoil-EH as a new numerical soil database for Estonia by converting the text-based soil properties in the Estonian Soil Map to machine-readable values. We used it to predict soil organic carbon content using the random forest machine learning method and found that the conditions of sampling locations affected prediction accuracy. We improved the global coverage of water quality data by producing the Global River Water Quality Archive (GRQA), which was compiled from five existing large-scale datasets. The compilation involved harmonizing the corresponding metadata, flagging outliers, calculating time series characteristics and detecting duplicate observations. We developed a framework suitable for national-scale water quality modeling based on lessons learnt from predicting soil carbon content. We used 82 environmental variables, including soil properties from EstSoil-EH as features to predict nutrient concentrations in 242 river catchments. The resulting models achieved accuracy comparable to the ones used previously in the Baltic region. We found that the size of the catchment influenced accuracy, since predictions were less accurate in smaller catchments. The models maintained reasonable accuracy even when the number of features was reduced by half, which shows that the relevance of features is more important than the amount. This flexibility makes our models applicable in areas that are otherwise lacking in the input data needed for extracting features.
The state of freshwater quality continues to deteriorate worldwide due to agricultural pollution. In order to combat these issues effectively, water quality modeling could be used to better manage water resources. However, large-scale water quality models depend on input datasets with good spatial coverage. The aim of the thesis was to improve and harmonize datasets for water quality modeling purposes and create a machine learning framework for national-scale modeling. We created EstSoil-EH as a new numerical soil database for Estonia by converting the text-based soil properties in the Estonian Soil Map to machine-readable values. We used it to predict soil organic carbon content using the random forest machine learning method and found that the conditions of sampling locations affected prediction accuracy. We improved the global coverage of water quality data by producing the Global River Water Quality Archive (GRQA), which was compiled from five existing large-scale datasets. The compilation involved harmonizing the corresponding metadata, flagging outliers, calculating time series characteristics and detecting duplicate observations. We developed a framework suitable for national-scale water quality modeling based on lessons learnt from predicting soil carbon content. We used 82 environmental variables, including soil properties from EstSoil-EH as features to predict nutrient concentrations in 242 river catchments. The resulting models achieved accuracy comparable to the ones used previously in the Baltic region. We found that the size of the catchment influenced accuracy, since predictions were less accurate in smaller catchments. The models maintained reasonable accuracy even when the number of features was reduced by half, which shows that the relevance of features is more important than the amount. This flexibility makes our models applicable in areas that are otherwise lacking in the input data needed for extracting features.
Description
Väitekirja elektrooniline versioon ei sisalda publikatsioone
Keywords
spatial data, automatic learning, water quality, environment simulation, geographic information systems