Definition
[TD.T.6] A backdoor attack is a specific type of poisoning attack where adversaries modify the labels of training samples and inject these mislabeled data with backdoor triggers into the training dataset. The goal is to force the trained model to assign a desired target label to new samples containing the trigger.
Targeted assets
System Asset: machine learning training system, ML system input/API.
Business Asset: training data.
Security Criteria: integrity.
Attack details
Exploited vulnerabilities
Vulnerabilities:
- The training dataset is susceptible to unauthorized modifications.
- The model 's classification boundary is prone to compromise, if trained on malicious data samples.
- Training data directly influences the performance of the machine learning mode.
- The model is incapable of distinguishing original and poisoned samples, when subtle frequency-domain modifications are introduced.
- The target model's performance can be skewed through training on malicious samples.
Threat agent
Threat agent: white-box and gray-box scenarios. In the white-box scenario, the attacker is assumed to have complete knowledge of the target machine learning model, its architecture, parameters, utilized training data, and the learning algorithm. In a gray-box scenario, the threat agent is assumed to have some partial knowledge of the target model. This knowledge could include the type of learning algorithm used, the feature set, or the agent may possess an access to a surrogate dataset with similar characteristics to the original training data.
Attack methods
Attack methods:
- Frequency space signals of the trigger and benign images are obtained with the Fast Fourier Transform (FFT). Trigger data is injected into the amplitude spectrum of the original, benign image from the trigger image's amplitude spectrum.
- Procedural noise is crafted using algorithms, such as Perlin noise, Gabor noise, Worley noise. A clean model is trained on original images to get the attention images. The attention images are fused with noise and then added to the original image. The target model is trained on the poisoned data, turning into a backdoored model.
- A random trigger is sampled from uniform distribution. The random trigger is added at random locations within original samples with a target label.
- Backdoor Generating Network (c-BaN): a modified BaN for generation of label-specific triggers. The c-BaN model uses both the target label and a noise vector as inputs, creating triggers that can correspond to any target label, and that can be positioned at any location within the input space. This enables shared locations for triggers among different target labels, enhancing the stealthiness of the backdoor. The c-BaN is used to generate a backdoor trigger that is added to a target sample, which is to be submitted to the target model.
- A Backdoor Generating Network (BaN) is created: BaN is a generative model (GAN based) that algorithmically creates backdoor triggers instead of relying on fixed patterns or random sampling. BaN is jointly trained with the target model (acts as a discriminator), which optimizes the trigger patterns. The BaN produces a trigger, it is provided as an input to the target model. Received loss from the backdoored sample is used with losses from benign samples to retrain the BaN. When BaN is ready, it produces a trigger that is added to the target sample in particular location.
Impact and harm
Impact and harm: Negates the integrity of the targeted machine learning model. This leads to misclassification of malicious input.
Security countermeasures
Security requirements
Security requirement: The machine learning system must be resistant to backdoor attacks.
Security controls
Security controls:
- Activate clustering: a detection method splitting feature representation into a poisoned and clean cluster. The last hidden layer can detect differences among cluster's high-level features. Perlin and Gabor noise avoid detection, while Worley noise gets detected.
- Autoencoder-based reconstruction: an autoencoder, trained on clean data, encodes the training data (compresses it) and then decodes it back into training data (decompresses it).