Relationship of attacks and input data asset

General Image Description

The UML class diagram visualizes a threat model with 13 threats, determined from the conducted systematic literature review, which target the input data for the initial compromise. The compromise of the training data is conducted either through “ML system input/API”, “Processing hardware running the ML model” system assets or by targeting the “Machine learning model” itself.

Input data: real-time user-supplied, external textual data, which is ingested by the system and passed to the model for inference and analysis. For the LLM’s the input data primarily consists of textual queries, which are submitted to the model to produce a target response. Although, with development of multimodal models, the text may incorporate additional mediums of data, such as images.

List of threats

  1. [IT.T.1] Adversarial attacks are malicious attempts to fool or subvert machine learning (ML) models by exploiting weaknesses in their algorithms or training data. In this paper, these attacks are considered to occur the testing/inference phase (evasion attacks). The goal of these attacks is to undermine the performance, reliability, or security of ML systems through malicious input data.
    1. [IT.T.2] An evasion attack is a type of adversarial attack where an adversary manipulates input data at the test or inference stage to cause a trained machine learning (ML) model to misclassify it, thus evading correct detection or classification. The core idea is to exploit vulnerabilities or blind spots in the model without altering the training data or the model's parameters.
    2. [IT.T.12] A Man-in-the-Middle (MitM) attack in the context of machine learning is a type of adversarial attack where an attacker stealthily intercepts and alters the communication between two parties (e.g., a data source and a machine learning classifier) to deliver malicious payloads or manipulate the data, with the aim of compromising the integrity or availability of the ML system.
    3. [IT.T.3] A jailbreak attack is a type of security attack that exploits vulnerabilities within a constrained system (such as an aligned LLM) to bypass imposed restrictions and achieve privilege escalation. In the context of LLMs, jailbreaking refers to the practice of circumventing or overriding alignment guardrails that are designed to govern the scope of content the model can produce.
  2. [IT.T.4] A model inversion attack is a type of privacy attack where an adversary aims to reconstruct training samples from a machine learning model by exploiting the model's outputs. The goal is to infer specific features or attributes of the hidden input data used to train the model. This type of attack allows an adversary to directly learn information about the training dataset.
  3. [IT.T.5] A model extraction attack (also known as model stealing) is a type of security attack where an adversary aims to replicate the functionality of a target machine learning model without having direct access to its internal parameters or training data. The attacker interacts with the target model, typically through a prediction API, to gather information and train a substitute model that mimics the behavior of the original. This allows the adversary to gain insights into the training data and potentially launch further attacks, such as evasion or membership inference attacks.
  4. [IT.T.13] A membership inference attack is a type of privacy attack where an adversary tries to determine whether a specific data record was part of the training dataset of a machine learning model. It exploits the tendency of ML models to behave differently on data they have been trained on compared to unseen data. A successful MIA signifies that the privacy of the training data was not sufficiently protected when the trained ML model is released.
  5. [IT.T.11] Cyber-physical attacks against machine learning models refer to attacks that exploit the interaction between the cyber (computing and communication) components and the physical components of a system. These attacks target machine learning models that are integrated into cyber-physical systems (CPS), aiming to cause physical consequences by manipulating the data, the training process, or the model itself.
    1. [IT.T.10] A hardware side-channel attack is a type of attack that exploits vulnerabilities in the physical implementation of a machine learning (ML) model to extract sensitive information, such as model parameters, training data, or the model's architecture. Instead of targeting the ML algorithm directly, these attacks measure and analyze side-channel information that is correlated with the ML assets.
  6. [IT.T.6] A property inference attack is a type of privacy attack that aims to infer confidential information or attributes about the training dataset used to train a machine learning model. The attacker attempts to determine certain properties or characteristics that are present in the training data, which the model provider does not want to reveal. This attack doesn't directly manipulate the model but extracts private information without disrupting the model’s normal training process.
  7. [IT.T.7] A Denial of Service (DoS) attack aims to disrupt the normal functioning and reduce the availability of a machine learning system, making it unusable for legitimate users. This is typically achieved by overwhelming the system with a high volume of requests or resource-intensive tasks, exhausting its computational resources.
  8. [IT.T.8] A Denial of Wallet (DoW) attack is a type of attack where an adversary exploits the cost-per-use model of cloud-based AI services by generating an excessive number of operations or resource-intensive tasks. This leads to unsustainable financial burdens on the service provider, potentially causing financial strain or even financial ruin.
  9. [IT.T.9] Poisoning attacks involve an adversary compromising a machine learning (ML) model by manipulating the training data. The attacker injects malicious data into the training dataset or alters the original training data. The high-level goal is to maximize the generalization error in the classification process or reduce the system’s performance. These attacks occur during the training process, aiming to shift the decision boundaries of classifiers.