Definition
[IT.T.2] An evasion attack is a type of adversarial attack where an adversary manipulates input data at the test or inference stage to cause a trained machine learning (ML) model to misclassify it, thus evading correct detection or classification. The core idea is to exploit vulnerabilities or blind spots in the model without altering the training data or the model's parameters.
Targeted assets
System Asset: ML system input/API.
Business Asset: input data.
Security Criteria: integrity, availability.
Attack details
Exploited vulnerabilities
Vulnerabilities:
- There exist special inputs that are close to correctly classified samples but are completely misclassified by a machine learning model.
- The machine learning model may produce unexpected results, if the input incorporates features that are not supported by the training dataset's feature space.
- There exist special inputs that are close to correctly classified samples but are completely misclassified by a machine learning model. There exists inherent ambiguity between the decision boundaries of a machine learning model and true decision boundaries.
- Better fitting (overfitting) make models more vulnerable to the deviations introduced by the adversaries. There is an inverse relationship between single model fitting accuracy and robustness to adversarial evasion.
- Machine learning model depends on the features, which can be mimicked and manipulated by the adversary.
- Reliance of the model on non-predictive (not valuable) features.
- The machine learning model can produce erroneous output, when the input incorporates features that exploit imperfect decision boundary produced by the learning algorithm. This may happen due to utilization of a limited training dataset or utilization of a learning algorithm with limited capacity.
Threat agent
Threat agent: black-box scenario. In a black-box scenario, the attacker has no knowledge of the target model's architecture, parameters, or training data. The attacker is assumed to be only able to interact with the model by sending it inputs and observing the outputs.
Attack methods
Attack methods:
- Iterative probing, gradient estimation, or using a surrogate model to craft adversarial examples.
- The new approach uses GANs to generate adversarial samples that resemble benign files in feature space without needing access to the model’s queries or internal structure. The GAN consists of a generator network that creates realistic malware features by transforming malicious features into benign-looking distributions, and a critic network that assesses these features’ benignness. This method performs well in evading detection by state-of-the-art ML detectors, including VirusTotal, without requiring queries.
- The attacker produces adversarial samples that are close to the incorrect classes. Adversarial samples are provided as input to the machine learning system with an aim of causing misclassification. Gradient descent strategy can be used to solve the optimization problem to produce an adversarial sample. Fast Gradient Sign Method can be used as an alternative more computationally efficient method.
- Adversarial samples are iteratively created by modifying them through a gradient-based algorithm to reduce the classifier’s discriminant function value and push the sample across the decision boundary. The produced adversarial samples are submitted to the machine learning algorithm for processing.
- Malicious machine learning models are trained on the features, which are analyzed by the targeted ML model and target ML model's performance against malicious model's behavior. Malicious machine learning models are utilized to produce adversarial samples. The produced adversarial samples are submitted to the target model for analysis.
- Influential features are algorithmically identified through feature attribution methods. The non-essential features selected and are modified via gradient-guided optimization by computing the gradient of target model's classification function. The malicious input is iteratively modified until it gets misclassified by the target model. The produced malicious input data is submitted to the target model for analysis.
- An adversarial sample is produced, accounting for the weaknesses in target ML classification capabilities, to evade its proper classification. The produced adversarial sample is submitted to the target ML model for analysis.
- The attacker produces adversarial samples that are close to the target class. Adversarial samples are provided as input to the machine learning system to cause specific misclassification. Gradient descent strategy can be used to solve the optimization problem to produce an adversarial sample. Fast Gradient Sign Method can be used as an alternative more computationally efficient method.
Impact and harm
Impact and harm: Negates the integrity and/or availability of the targeted machine learning model. This leads to misclassification of benign and malicious inputs.
Security countermeasures
Security requirements
Security requirement: The machine learning system must be resistant to adversarial attacks.
Security controls
Security controls:
- Randomization - introduction of unpredictability into model's response. Example approach: randomly generate and train multiple classifiers using different subsets of the feature space. The final output is aggregated from the predictions of all classifiers.
- Complexity - complexity of model's decision function is increased, for example the boundary can be made non-linear or be fractalized.
- Adversarial re-training, the model is retrained on the training dataset that includes adversarial samples. The cost of this method may be prohibitive.
- Utilization of regularization terms that promote enclosure of the legitimate class.
- Explainable AI design, utilization of methods like LIME and LASSO to learn the decision boundary around specific input points. This could be helpful in design of more robust ML models.
- Gradient masking/Defensive distillation method produces a distilled model with a smoothed-out decision surface. Research has shown that this method may not be effective against evasion attacks, and the produce model may be as vulnerable as the original model.
- Dimensionality reduction, Principal Component Analysis (PCA) can be used to reduce components of the classifier. This method may reduce performance of the algorithm in exchange for robustness.
- Stateful analysis - query analysis, maintenance of query history and its analysis with meta-detectors.