Aligning training loss to evaluation metrics in deep learnin
Laen...
Kuupäev
Autorid
Ajakirja pealkiri
Ajakirja ISSN
Köite pealkiri
Kirjastaja
Tartu Ülikooli Kirjastus
Abstrakt
Viimaste aastate edusammud masinõppes on kiirendanud masinõppesüsteemide kasutuselevõttu paljudes valdkondades ning tulemused ületavad sageli nii algoritmilisi baastasemeid kui ka mõnel juhul inimsooritust. Selle arengu on võimalikuks teinud mitmed vastastikmõjus olevad tegurid, sealhulgas suurte ja üha kvaliteetsemate andmestike kättesaadavus, rikkalikke kõrgemõõtmelisi esitusi võimaldavad tehisnärvivõrgude arhitektuurid ning optimeerimisprotseduurid, mis on praktikas piisavalt skaleeruvad. Optimeerimisvalikute hulgas on keskne roll kaofunktsioonil, mida võib vaadelda kui mõõdikut, mis määrab trahvi suuruse ennustava mudeli igale eksimusele. Kaofunktsioon defineerib optimeerimismaastiku ja selle abil määratakse, kuidas mudeli kaalusid treenimise käigus muudetakse. Klassifitseerimisel on kaofunktsiooni vaikevalikuna kasutatud näiteks ristentroopiat. Kui aga suuri mudeleid treenitakse ristentroopiaga, võib mudeli ennustustes ilmneda liigne enesekindlus, ning see halvendab mudeli usaldusväärsust nendele ennustustele tuginevates rakendustes. Kuna masinõppesüsteemid on üha enam põimitud otsustusprotsessidesse, on konkreetses rakendusvaldkonnas olulise hindamismõõdiku alusel sooritust parandava kaofunktsiooni valimine muutunud oluliseks praktiliseks väljakutseks.
Lisaraskusi tekitab asjaolu, et sooritust hinnatakse lõpuks hindamismõõdikutega, mis võivad olla katkevuspunktidega, mitte-diferentseeruvad või väga ülesandespetsiifilised. Sellised mõõdikud ei ole sageli treenimiseks otseselt sobivad, mistõttu kasutatakse asenduskaofunktsioone, nagu näiteks ristentroopia. See võib aga tekitada ebakõla treeningul optimeeritava eesmärgi ja lõpprakenduses olulise eesmärgi vahel. Klassikalistest mõõdikutest, nagu täpsus, paljudes kaasaegsetes rakendustes ei piisa, mistõttu toetuvad praktikud üha enam hinnatundlikele mõõdikutele, mis kaaluvad vigu vastavalt nende tähtsusele, ja kalibreerituse mõõdikutele, mis mõõdavad tõenäosuslike ennustuste kvaliteeti. Need mõõdikud kajastavad usaldusväärsuse aspekte, mida täpsus üksi väljendada ei suuda, ning on seetõttu muutunud olulisteks tööriistadeks soorituse hindamisel praktilistes rakendustes.
Käesolev väitekiri koosneb kolmest omavahel seotud uurimusest, mille eesmärk on parandada treenimisel kasutatavate kaofunktsioonide ja praktikas oluliste hindamismõõdikute vahelist joondatust. Esimene uurimus käsitleb kaofunktsiooni valikut hinnatundlikus klassifitseerimises, kus hindamismõõdik kaalub iga veatüüpi klassispetsiifiliste kuludega, mis on tavaliselt saadud valdkonna ekspertidelt. Praktikas on need kulud harva täpselt teada; selle asemel esinevad need ebakindlate hinnangutena, mis võivad aja jooksul ja kasutustingimuste muutudes teiseneda. Selle ebakindluse modelleerimiseks käsitleme klassispetsiifilisi kulusid etteantud jaotusega juhuslike suurustena ja tuletame kaofunktsioonide perekonnad, mis on matemaatiliselt samaväärsed vale klassifitseerimise eeldatava kuluga antud määramatuse korral. Tuvastame praktikas mugavad jaotuste perekonnad, näiteks beeta-jaotused kuluproportsioonide ja gamma-jaotused toorkulude jaoks, ning näitame, kuidas nende parameetrid määravad keskmist kulu, asümmeetriat ja ebakindlust. Katsed mitmete andmestike ja kulustsenaariumidega näitavad, et mõned tuletatud kaofunktsioonid saavutavad vastavatel hinnatundlikel mõõdikutel järjepidevalt tugeva soorituse, pakkudes põhimõttekindlat alternatiivi \emph{ad-hoc} kaalumisskeemidele.
Teine uurimus uurib mudeli kalibreeritust, mis iseloomustab seda, kui hästi vastavad ennustatud tõenäosused tegelikele tinglikele klassitõenäosustele. Mudeli kalibreeritus on hädavajalik mudeli praktilisel rakendamisel, kus ennustatud tõenäosuste põhjal tehakse praktilisi otsuseid. Eriti tähtis on kalibreeritus ohutuse või kulude seisukohalt kriitilistes olukordades. Kuigi ristentroopia on rangelt korralik kaofunktsioon ja teoreetiliselt soodustab kalibreeritud tõenäosusi, on sellega treenitud sügavad tehisnärvivõrgud sageli valesti kalibreeritud. Seevastu fokaalse kaofunktsiooni puhul on korduvalt täheldatud, et see annab paremini kalibreeritud mudeleid isegi ilma hilisema \emph{post-hoc} kohandamiseta. Doktoritöös on seda nähtust uuritud ja näidatud, et fokaalset kaofunktsiooni saab väljendada kahe komponendi kompositsioonina: korralik kaofunktsioon, mida minimeerivad tegelikud klassitõenäosused, ja fikseeritud kalibreerimisfunktsioon, mis sarnaneb temperatuuri skaleerimisega. See dekompositsioon selgitab, miks fokaalne kaofunktsioon on praktikas sageli hästi kalibreeritud, kuna see seob korraliku kaofunktsiooni treenimise ajal rakendatava sisseehitatud kalibreerimisteisendusega. Seda dekompositsiooni on laiendatud laiale lahutuvusomadusega kaofunktsioonide klassile ning on tuletatud valemid nendega seotud korralike komponentide ja kalibreerimisfunktsioonide jaoks. Need tulemused võimaldavad disainida uusi kaofunktsioone, millel on soovitud kalibreerimis- ja eristusomadused. Mitmed töös tuletatud uued kaofunktsioonid koos nende juurde kuuluvate kalibreerimisfunktsioonidega saavutavad täpsuse ja kalibreerituse osas tulemusi, mis on konkurentsivõimelised või paremad kui standardsed baastasemed.
Kolmas uurimus käsitleb lõhet treenimisel kasutatavate hindamismõõdikute ja rakendusepõhiste kasufunktsioonide vahel, kus kasufunktsioonid kvantifitseerivad mudeli ennustustel põhinevate otsuste domeenispetsiifilist väärtust. Rakendusepõhised kasufunktsioonid võivad sõltuda kontekstist tulenevatest teguritest, mis pole treenimise ajal kättesaadavad, ning nende hindamine võib olla kulukas, nõudes mõnikord simulatsiooni või füüsilisi eksperimente. Kui rakendusepõhist kasu ei saa suures mahus arvutada, muutub mudelite valimine keeruliseks. Selle probleemi leevendamiseks on töös välja pakutud meetod, milles valideerimisandmetel treenitud väike närvivõrk aitab treenimisel kasutatavat hindamismõõdikut paremini joondada rakendusepõhise kasufunktsiooniga. Töös on analüüsitud tingimusi, mille korral sellised teisendused säilitavad korralikkuse, ja demonstreeritud meetodi rakenduvust mitmes ülesandes, sealhulgas laoseisu optimeerimise probleemi korral. Selle meetodiga saab rakenduse jaoks hästi sobiva mudeli valida ka olukorras, kus rakendusepõhise kasufunktsiooni otsene hindamine on ebapraktiline.
Kokkuvõttes näitavad selle väitekirja tulemused treenimisel kasutatava kaofunktsiooni ja hindamismõõdiku joondamise olulisust. Töös on uuritud, kuidas erinevad kaofunktsiooni valikud kujundavad treenitava mudeli ennustuslikku käitumist, ning loodud meetodeid mudelite soorituse parandamiseks rakendusvaldkonnas olulistel mõõdikutel.
Recent advances in machine learning (ML) have accelerated the deployment of ML systems across a wide range of domains, often surpassing classical algorithmic baselines and, in some cases, even human performance. This progress has been driven by several interacting factors, including the availability of large and increasingly high-quality datasets, architectures capable of learning rich high-dimensional representations, and optimisation procedures that scale reliably in practice. Among the optimisation choices, the loss function plays a central role. It can be viewed as a scoring rule that assigns a penalty to each prediction, with larger penalties for more severe mistakes; these scores, aggregated over the data, define the optimisation landscape and determine how the model is updated during training. Historically, standard losses such as cross-entropy have been treated as default choices for classification. However, when large-scale models are trained on massive datasets with cross-entropy, they can exhibit undesirable behaviour, most notably overconfident predictions that degrade reliability in downstream applications. As ML systems become embedded in decision pipelines, selecting losses and associated modifications that improve performance under the evaluation metric of interest has become a core practical challenge. A further difficulty arises because performance is ultimately assessed by evaluation metrics that may be discontinuous, non-differentiable, or highly task-specific. Such metrics often require surrogate losses for training and can create a mismatch between the optimised objective and the quantities used for evaluation. Classical metrics such as accuracy are insufficient for many modern applications, so practitioners increasingly rely on cost-sensitive metrics that weight errors according to their importance and calibration metrics that measure the quality of predicted probabilities. These metrics capture aspects of reliability that accuracy alone cannot express and have therefore become essential tools for assessing real-world performance. This thesis comprises three interconnected studies that aim to improve the alignment between training losses and evaluation metrics. The first study addresses loss selection in cost-sensitive classification, where the evaluation metric weights each type of error by class-specific costs, typically elicited from domain experts. In practice, these costs are rarely known precisely; instead, they appear as uncertain estimates that may shift over time as deployment conditions change. To model this uncertainty, we treat class-specific costs as random variables with specified distributions and derive families of losses that are mathematically equivalent to the expected misclassification cost under this uncertainty. We identify distribution families that are convenient in practice, particularly Beta distributions over cost proportions and Gamma distributions over raw costs, and show how their parameters govern mean cost, asymmetry, and uncertainty. Experiments across multiple datasets and cost scenarios demonstrate that some of the induced losses consistently achieve strong performance on the corresponding cost-sensitive metrics, providing a principled alternative to ad-hoc weighting schemes. The second study examines model calibration, which characterises how well predicted probabilities correspond to true conditional class probabilities. Calibration is essential for reliable downstream decision-making, particularly in safety-critical or cost-critical settings. Although cross-entropy is strictly proper and theoretically encourages calibrated probabilities, deep networks trained with it are often miscalibrated. In contrast, focal loss has repeatedly been observed to produce more calibrated models even without post-hoc adjustments. We investigate this phenomenon and show that focal loss can be expressed as a composition of two components: a proper loss that is minimised by the true class probabilities and a fixed calibration map that closely resembles temperature scaling. This decomposition clarifies why focal loss often exhibits strong calibration in practice, as it couples a proper scoring rule with an embedded calibration transformation applied during training. We extend this decomposition to a broad class of separable losses and provide explicit formulas for the associated proper components and calibration maps. These results enable the design of new losses that inherit desirable calibration and discrimination properties, and several of the newly derived losses, paired with their induced calibration maps, achieve accuracy and calibration performance competitive with or better than standard baselines. The third study addresses the gap between upstream metrics, which summarise general model behaviour, and downstream utilities, which quantify the domain-specific value of decisions informed by model predictions. Downstream utilities may depend on context-dependent factors unavailable during training and may be costly to evaluate, sometimes requiring simulation or physical experimentation. When downstream utility cannot be computed at scale, selecting models becomes challenging. To mitigate this issue, we propose learning a data-driven proxy that maps upstream metrics to downstream utilities using a small neural network trained on a validation set. We analyse conditions under which such mappings preserve properness and demonstrate feasibility on multiple proof-of-concept tasks, including a simple inventory optimisation problem. This approach enables model selection aligned with application-level performance even when direct evaluation of downstream utility is impractical. Taken together, the contributions of this thesis highlight the importance of aligning the training loss with the evaluation metric of interest. The results offer theoretical and practical insights, along with methodological guidelines, that deepen our understanding of how loss design influences predictive behaviour and provide tools to improve performance under domain-specific metrics.
Recent advances in machine learning (ML) have accelerated the deployment of ML systems across a wide range of domains, often surpassing classical algorithmic baselines and, in some cases, even human performance. This progress has been driven by several interacting factors, including the availability of large and increasingly high-quality datasets, architectures capable of learning rich high-dimensional representations, and optimisation procedures that scale reliably in practice. Among the optimisation choices, the loss function plays a central role. It can be viewed as a scoring rule that assigns a penalty to each prediction, with larger penalties for more severe mistakes; these scores, aggregated over the data, define the optimisation landscape and determine how the model is updated during training. Historically, standard losses such as cross-entropy have been treated as default choices for classification. However, when large-scale models are trained on massive datasets with cross-entropy, they can exhibit undesirable behaviour, most notably overconfident predictions that degrade reliability in downstream applications. As ML systems become embedded in decision pipelines, selecting losses and associated modifications that improve performance under the evaluation metric of interest has become a core practical challenge. A further difficulty arises because performance is ultimately assessed by evaluation metrics that may be discontinuous, non-differentiable, or highly task-specific. Such metrics often require surrogate losses for training and can create a mismatch between the optimised objective and the quantities used for evaluation. Classical metrics such as accuracy are insufficient for many modern applications, so practitioners increasingly rely on cost-sensitive metrics that weight errors according to their importance and calibration metrics that measure the quality of predicted probabilities. These metrics capture aspects of reliability that accuracy alone cannot express and have therefore become essential tools for assessing real-world performance. This thesis comprises three interconnected studies that aim to improve the alignment between training losses and evaluation metrics. The first study addresses loss selection in cost-sensitive classification, where the evaluation metric weights each type of error by class-specific costs, typically elicited from domain experts. In practice, these costs are rarely known precisely; instead, they appear as uncertain estimates that may shift over time as deployment conditions change. To model this uncertainty, we treat class-specific costs as random variables with specified distributions and derive families of losses that are mathematically equivalent to the expected misclassification cost under this uncertainty. We identify distribution families that are convenient in practice, particularly Beta distributions over cost proportions and Gamma distributions over raw costs, and show how their parameters govern mean cost, asymmetry, and uncertainty. Experiments across multiple datasets and cost scenarios demonstrate that some of the induced losses consistently achieve strong performance on the corresponding cost-sensitive metrics, providing a principled alternative to ad-hoc weighting schemes. The second study examines model calibration, which characterises how well predicted probabilities correspond to true conditional class probabilities. Calibration is essential for reliable downstream decision-making, particularly in safety-critical or cost-critical settings. Although cross-entropy is strictly proper and theoretically encourages calibrated probabilities, deep networks trained with it are often miscalibrated. In contrast, focal loss has repeatedly been observed to produce more calibrated models even without post-hoc adjustments. We investigate this phenomenon and show that focal loss can be expressed as a composition of two components: a proper loss that is minimised by the true class probabilities and a fixed calibration map that closely resembles temperature scaling. This decomposition clarifies why focal loss often exhibits strong calibration in practice, as it couples a proper scoring rule with an embedded calibration transformation applied during training. We extend this decomposition to a broad class of separable losses and provide explicit formulas for the associated proper components and calibration maps. These results enable the design of new losses that inherit desirable calibration and discrimination properties, and several of the newly derived losses, paired with their induced calibration maps, achieve accuracy and calibration performance competitive with or better than standard baselines. The third study addresses the gap between upstream metrics, which summarise general model behaviour, and downstream utilities, which quantify the domain-specific value of decisions informed by model predictions. Downstream utilities may depend on context-dependent factors unavailable during training and may be costly to evaluate, sometimes requiring simulation or physical experimentation. When downstream utility cannot be computed at scale, selecting models becomes challenging. To mitigate this issue, we propose learning a data-driven proxy that maps upstream metrics to downstream utilities using a small neural network trained on a validation set. We analyse conditions under which such mappings preserve properness and demonstrate feasibility on multiple proof-of-concept tasks, including a simple inventory optimisation problem. This approach enables model selection aligned with application-level performance even when direct evaluation of downstream utility is impractical. Taken together, the contributions of this thesis highlight the importance of aligning the training loss with the evaluation metric of interest. The results offer theoretical and practical insights, along with methodological guidelines, that deepen our understanding of how loss design influences predictive behaviour and provide tools to improve performance under domain-specific metrics.
Kirjeldus
Doktoritöö elektrooniline versioon ei sisalda publikatsioone
Märksõnad
doktoritööd