Model manipulation attack

Definition

[MP.T.3] A model manipulation attack involves an adversary directly altering the parameters, logic, or architecture of a machine learning model with the intent to compromise its performance, security, or integrity. This type of attack differs from traditional adversarial attacks that focus on crafting malicious input samples or poisoning training data. Instead, the adversary gains access to the model itself and modifies it to achieve specific malicious goals.

Targeted assets

System Asset: machine learning model.

Business Asset: model's parameters.

Security Criteria: integrity.

Attack details

Exploited vulnerabilities

Vulnerabilities:

  1. Lack of model integrity verification.
  2. Unauthorized access to the model.

Threat agent

Threat agent: white-box and black-box scenarios. In the white-box scenario, the attacker is assumed to have complete knowledge of the target machine learning model, its architecture, parameters, utilized training data, and the learning algorithm. In a black-box scenario, the attacker has no knowledge of the target model's architecture, parameters, or training data. The attacker is assumed to be only able to interact with the model by sending it inputs and observing the outputs.

Attack methods

Attack methods:

  1. A loss function is optimized combining cross-entropy loss to misclassify target samples and weight regularization to minimize parameter deviations from the original model, ensuring that selected malicious samples are classified as benign. The final modifications are applied to the target ML model, and malicious samples are submitted.
  2. Using backpropagation, the final layers are optimized to force the target model into assigning incorrect labels to the chosen malicious samples while minimizing changes to overall model accuracy. Only the fully connected layers are modified through gradient descent with an added constraint, while initial convolutional layers are frozen, preserving the model’s ability to generalize while misclassifying specific target malicious samples. The final modifications are applied to the target ML model, and malicious samples are submitted.

Impact and harm

Impact and harm: Negates the integrity of the targeted machine learning model. This leads to a reduction in model's accuracy.

Security countermeasures

Security requirements

Security requirement: The machine learning system's actions and decisions must be resistant to model manipulation attacks.

Security controls

Security controls:

  1. None stated in the target paper.