Definition
[IT.T.9],[TD.T.5] Poisoning attacks involve an adversary compromising a machine learning (ML) model by manipulating the training data. The attacker injects malicious data into the training dataset or alters the original training data. The high-level goal is to maximize the generalization error in the classification process or reduce the system’s performance. These attacks occur during the training process, aiming to shift the decision boundaries of classifiers.
Targeted assets
System Asset: machine learning training system, ML system input/API.
Business Asset: input data, training data.
Security Criteria: integrity, availability.
Attack details
Exploited vulnerabilities
Vulnerabilities:
- The training dataset is susceptible to unauthorized modifications.
- The target model's performance can be skewed through training on malicious samples.
- Training data directly influences the performance of the machine learning mode.
- Gathering of public data without validation and sanitization.
- Lack of training data integrity checks.
- Unauthorized access to the training data.
- Lack of sanitization of publicly acquired training data.
- Usage of unverified and unsanitized training data.
Threat agent
Threat agent: white-box scenario. In this scenario, the attacker is assumed to have complete knowledge of the target machine learning model, its architecture, parameters, utilized training data, and the learning algorithm.
Attack methods
Attack methods:
- The adversary introduces spurious features to the positive training samples, this is a "correlated outlier attack". The adversary mimics the legitimate traffic as malicious and submits it to the target classifier, the crafted malicious traffic is used for target model training, this is an "allergy attack".
- Addition of the malicious training samples into the dataset and target model retraining on the poisoned dataset. The malicious samples are produced by random generation of noise and its addition to the original sample.
- Unauthorized malicious modification of the training dataset. Labels can be obscured within the training dataset to achieve the poisoning attack with the following specific attacks: random label flipping (RLF) attack (randomly modifies labels within the training data subset), nearest-prior label flipping (NPLF) attack (distorts labels based near the decision boundary), farthest-prior label flipping (FPLF) attack (distorts labels that are far from the decision boundary), farthest-rotation label flipping (FRLF) attack, adversarial label flipping (ALF) attacks (attempts to maximum the classification error through distorted examples).
- The attacker mislabels targeted samples within the training dataset to minimize the loss function on the relabeled samples.
- Malicious samples are derived and are injected into the target model's training data set. The target ML model is trained on the modified training dataset.
- Poisoning of the general (public data) training data.
- Poisoning the fine-tuning process.
- Embedding process poisoning (conversion of text into numerical vectors).
- Poisoning of public sources through temporary content modification or domain re-acquisition.
- The adversary carefully generated malicious samples into the training dataset. A possible method is to introduce spurious features in the training dataset to mislead the classifier and provide malicious input to the model lacking the spurious features, thus bypassing the defenses, utilized in a "Red herring attack".
- Label values of the half of the target training dataset is iteratively flipped.
- Label values of samples from the data set flipped based on the distance from the classifier's decision hyperplane.
- An attacker chooses an initial guess for each poisoned sample. The chosen guess is used to retrain the model. The difference in model's performance is used to update the guessed poisoned sample through computation of (sub)gradient-ascent algorithm. The final set of produced poisoned samples is injected into the target model' data set.
- Labels of the samples within the training set of the targeted ML model are mixed. The target ML model trains on the modified training data set.
Impact and harm
Impact and harm: Negates the integrity and availability of the targeted machine learning model. This leads to misclassification of malicious input.
Security countermeasures
Security requirements
Security requirement: The machine learning system must be resistant to poisoning attacks.
Security controls
Security controls:
- Quantum Neural Network (QNN) is more resistant to the attack than the conventional neural networks.
- Utilize robust statistics, the robustness could be measured with the influence functions and breakdown point. Aim is to utilize a procedure with a high breakdown point and bounded influence function.
- Implement a security evaluation mechanism: a reactive defense updates the model based on the new attacks; a proactive defense considers possible security deficiencies before deploying the model.
- Defense mechanism during the training phase: data sanitization, a possible method is to attach a new sample to the existing dataset, train the model on the new set and compare the results to the previous models, if error rates significantly differ - discard the sample. This method has heavy computational costs.
- Combine multiple classifiers, which may provide different security properties to produce a composite prediction.
- Reject On Negative Impact (RONI) defense detect and discards samples within the training dataset that have a negative impact on the classifier's accuracy. This technique is computationally very expensive. The method may be susceptible to overfitting, reducing its performance, when operated on a small training dataset, compared to the amount of features.
- Combine outlier detection with optimization techniques to correlate classifier predictions with labels. This method requires prior knowledge on the fraction of the poisoned samples.
- Utilize a small, curated, and verified subset of trusted data points to train outlier detectors for each class. This method requires curation of trusted data.
- Vet data sources, suppliers, terms and conditions, privacy policies.
- Regular review and audit suppliers' security and terms.
- Vulnerability scanning.
- Vulnerable software patching or a virtual patching.
- Remove unused dependencies and unnecessary features.
- Monitor dependencies for their state, versions and vulnerabilities.
- Source components and components from official sources.
- Conduct red teaming, penetration testing, integrity checking against 3rd party models.
- Maintain an inventory of components, a Software Bill of Materials (SBOM).
- Utilize code signing for externally supplied code.
- Monitor and audit collaborative and development environments.
- Utilize integrity checks and vendor attestation APIs against apps and models.
- Relabel potentially malicious data points based on their k-Nearest Neighbors in the feature space. This method is inefficient if malicious sample are close to genuine data.
- Setup an influence function that estimates the influence of each training sample on the model’s predictions.
- Detect and remove outliers, pre-filter the training dataset.
- Reject On Negative Impact (RONI), assess the empirical effect of each training sample and remove the sample that have significant negative impact on the classification accuracy. This method may have a high computational cost.