Adversarial attack

Definition

[IT.T.1] Adversarial attacks are malicious attempts to fool or subvert machine learning (ML) models by exploiting weaknesses in their algorithms or training data. In this paper, these attacks are considered to occur the testing/inference phase (evasion attacks). The goal of these attacks is to undermine the performance, reliability, or security of ML systems through malicious input data.

Targeted assets

System Asset: ML system input/API.

Business Asset: input data.

Security Criteria: integrity.

Attack details

Exploited vulnerabilities

Vulnerabilities:

  1. There exist special inputs that are close to correctly classified samples but are completely misclassified by a machine learning model.
  2. There exists inherent ambiguity between the decision boundaries of a machine learning model and true decision boundaries.
  3. A system flaw, that can be accessed and exploited externally.
  4. There exist special inputs that are close to correctly classified samples but are completely misclassified by a machine learning model. There exists inherent ambiguity between the decision boundaries of a machine learning model and true decision boundaries.

Threat agent

Threat agent: white-box and black-box scenarios. In the white-box scenario, the attacker is assumed to have complete knowledge of the target machine learning model, its architecture, parameters, utilized training data, and the learning algorithm. In a black-box scenario, the attacker has no knowledge of the target model's architecture, parameters, or training data. The attacker is assumed to be only able to interact with the model by sending it inputs and observing the outputs.

Attack methods

Attack methods:

  1. Initially the target model is reverse engineered, a shadow model is created. Adversarial samples are generated with the use of gradient-based methods (e.g., FGSM-like approach) and created shadow models. The adversarial samples are submitted to the target model to cause malfunction in its classification.
  2. A submission of an adversarial sample. A set of suitable perturbations is determined by solving a constrained optimization problem. Afterwards, the set of perturbation is reduced based on imposed spatial and physical constraints. The suitable perturbation is applied to the target setting, resulting in an adversarial sample.
  3. The target model is queried limited number of times. The retrieved outputs are utilized to train a substitute model (feed-forward neural network (FFNN)), while augmenting data with a custom augmentation algorithm and further retraining the model. Adversarial sample re-crafted from the trained substitute model. Adversarial sample generation accounts for the cost gradients of the substitute model, the degree of perturbations is limited. The produced adversarial samples are submitted to the target model.
  4. An adversarial sample can be produced through one of the four algorithms, two gradient-based - Carlini & Wagner, Fast Gradient Method, and two gradient-free - Decision-based attack, Simulated Annealing.
  5. A submission of an adversarial input. The adversarial input can be generated with an algorithm by perturbating an original benign sample. Covered algorithms: L-BFGS method, Fast gradient sign method (FGSM), Universal adversarial perturbations (UAP), UPSET and ANGR method, C&W attack method.
  6. The adversary crafts malicious input that is misclassified by the target model. The malicious input can be encrypted to be made statistically identical to normal input, this is a "polymorphic blending attack". The malicious input can integrate features of a benign input, this is a "good word attack".
  7. With Jaccard similarity a malicious samples is iterated upon to arrive at its modified version, closes to a benign sample, which gets misclassified by the target ML model. The final produced malicious samples is submitted to the target ML model for analysis.
  8. Features of the malicious sample are randomly modified. The produced malicious sample is submitted to the target ML model for analysis.
  9. A sample is iteratively modified with a guidance image to produce an adversarial sample of the target class. Each step takes a seed, mutates the image, evaluates the result. Final adversarial sample is submitted as an input to the target machine learning system.
  10. A feature within the malicious sample is iteratively selected and greedily updated with an aim of increasing classification errors. Features are ranked through information gain. The features are further bi-directionally selected. The features are added and eliminated from the target malicious sample. The produced malicious sample is submitted to the target ML model for analysis.
  11. Addition of perturbations to the input data to be processed by the target AI system.
  12. Requests are made to the target model and the model's responses are collected. Collected samples can be used to create perturbations for adversarial attacks with universal adversarial perturbation (UAP) method. Alternatively, a surrogate model is trained on responses, it is used for estimating target response; Basic Iterative Method (BIM) is used to produce perturbations. Perturbations are used for creation adversarial samples.
  13. With the knowledge of the target model, adversarial examples can be directly produced with Basic Iterative Method (BIM).

Impact and harm

Impact and harm: Negates the integrity of the targeted machine learning model. This leads to misclassification of benign and malicious inputs.

Security countermeasures

Security requirements

Security requirement: The machine learning system must be resistant to adversarial attacks.

Security controls

Security controls:

  1. Adversarial training: using adversarial examples in training.
  2. Data randomization, modification of the input data.
  3. Data compression of the input.
  4. Addition of a masking layer to control dominant weights, thus reducing model's sensitivity.
  5. Regularization of model based off of models; outputs and inputs during training.
  6. Feature squeezing, utilize multiple models to determine if the sample is adversarial.
  7. Utilize generative adversarial networks to train a model.
  8. Adversarial training: inclusion of adversarial samples into the training dataset, thus training to minimize the loss from the inclusion of adversarial samples.
  9. Rectification, additional of new "pre-input" layer, acting as a perturbations detector.
  10. Ensemble Adversarial Learning provided robustness against Fast Gradient Method. Carlini attack remains successful.
  11. Implement a security evaluation mechanism: a reactive defense updates the model based on the new attacks; a proactive defense considers possible security deficiencies before deploying the model.
  12. Defense mechanism during the training phase: enhance generalization capability of the machine learning model. Possible methods: Bagging (bootstrap aggregating algorithm), RSM (random subspace method) by Biggio et al., the ANTIDOTE algorithm by Rubinstein et al.
  13. Defense mechanism on the prediction/test phase. Modify the machine learning to make it more resistant to adversarial samples through the following methods: adversarial training, incorporate adversarial data into the training data, this is a non-adaptive approach; conduct data compression (image specific), high compression rate may lead to the loss in classification accuracy; apply a foveation method to an image region (image specific), effectiveness against more powerful attack has not been validated; modify gradients of the input data, loss/activation function with a gradient masking method, trains the model by penalizing the input variation degree; transfer knowledge of the model to a new model through a defensive distillation method; DeepCloak method adds a new trained layer before the network decision layer, removing the prominent features by masking dominant weights, this method does not require model retraining. Append an external model: GAN-based method conducts GAN training of the target model; feature squeezing method (image specific) modifies properties of an image and compares the image with classification result, if there is a significant difference, the images is considered to be adversarial; universal perturbation method (image specific) implants a perturbation rectifying network (PRN) before the input layer, the network is separately trained to rectify the input images before feeding them into the target model.
  14. Limit access to information about the training procedure and training data. Can be difficult to keep the training data secret.
  15. Harden classifiers through higher order patterns, such as n-grams or randomized feature selection.
  16. Create an adversary-aware classifier by adjusting the likelihood function to anticipate the attacker's changes.
  17. Introduce randomness in the classification process, this may decrease the utility of the responses.
  18. Limit the feedback that is provided to the attacker or provide intentionally misleading responses. This may reduce the utility of the responses.
  19. Query number restriction.
  20. Leverage the proposed fuzzing attack framework to improve the robustness of the defence mechanisms against bulk-generated adversarial examples.
  21. Adversarial training on examples generated by such attacks to strengthen model robustness.
  22. Robust feature selection method - a feature selection method, considering the feature importance for classification, the costs of manipulating of each feature, probabilistically selecting features inversely proportional to their attack vulnerability.
  23. Ensemble learning - combination of multiple classifiers trained on different feature subsets; the classifiers are designed so that all the features are integrated and that classifiers differentiate from each other.
  24. Certified defense, achieving a constant model prediction within a specific bound; a base classifier is trained with noise to average model's output over noisy samples.
  25. Scrambling: an operation for radio unit (RU) ordering during training and inference stages to mitigate adversarial attacks by obfuscating AI model input relationships, significantly reducing attack effectiveness. This approach is not possible int other domain, such as image, text or audio, where semantics of the data would be lost.