Membership inference attack

Definition

[IT.T.13],[TD.T.4] A membership inference attack is a type of privacy attack where an adversary tries to determine whether a specific data record was part of the training dataset of a machine learning model. It exploits the tendency of ML models to behave differently on data they have been trained on compared to unseen data. A successful MIA signifies that the privacy of the training data was not sufficiently protected when the trained ML model is released.

Targeted assets

System Asset: machine learning model, ML system input/API.

Business Asset: training data, input data.

Security Criteria: confidentiality.

Attack details

Exploited vulnerabilities

Vulnerabilities:

  1. There is a difference in gradient behavior between training (member) and non-training (non-member) records, which reflects membership information due to overfitting or gradient convergence during training.
  2. The more output classes the model has, the more data is leaked.
  3. It is possible to learn additional information about the training data sample from the target model's output.
  4. It is possible to infer membership of samples within the target machine learning model.

Threat agent

Threat agent: white-box and black-box scenarios. In the white-box scenario, the attacker is assumed to have complete knowledge of the target machine learning model, its architecture, parameters, utilized training data, and the learning algorithm. In a black-box scenario, the attacker has no knowledge of the target model's architecture, parameters, or training data. The attacker is assumed to be only able to interact with the model by sending it inputs and observing the outputs.

Attack methods

Attack methods:

  1. GAN's discriminator model is used to deduce whether a particular data instance was part of the training dataset of a target model. This is achieved by exploiting differences in the model's response to known (trained on) versus unknown input data.
  2. Shadow models are trained on datasets similar to the target model’s dataset to replicate its behavior. An attack model is trained on the shadow models' outputs to distinguish between training and non-training data based on prediction behavior. The attack model is utilized if the target sample is a part of the targeted model's training dataset.
  3. Synthesis of generated data through optimization against the latent code. The distance between the generated data and target data is measured to determine membership.
  4. The target model is requested with perturbed features. With the target model's outputs toward requests a local linear regression model is trained. An autoencoder is to extract membership features from the approximated gradients for training of the local attacks model, which would be used to classify if the target sample is a member of the target model's training dataset or not.
  5. 1. An attacker trains a shadow model on the part of the shadow dataset, acquired from the same distribution as the target model's training data. The model is trained to determine if the requested sample is a part of the shadow model's training dataset or not. Finally, the attacker queries the shadow model with the entire shadow dataset, to determine, if it is a part of the originally targeted model. 2. Based on the produced prediction and the shadow training data set, an attack model is trained, a binary membership classifier. (If the attacker has a part of the original training dataset, the attack model can be trained directly.) 3. The attacker then queries the target model with a target sample to determine if the sample is a part of its dataset. 4. If the attacker has access to the targeted model, then the target model's gradients can be incorporated into the training of the attack model. The attack model is then used to determine if the target sample is a part of the targeted model, based on the posteriors and predicted label for the supplied target sample.
  6. An attacker with white box access feeds a noise sample to the targeted model to receive its posteriors. Afterwards, through back-propagation over model's parameters the input is optimized, thus producing a representative sample of a class.
  7. With a shadow dataset a generative adversarial network (GAN) can be trained. The GAN model is provided with optimized inputs, with an aim of generating samples that reach high posteriors on the target model.
  8. The attacker utilizes embeddings of the target sample from the target model to train a classifier, which will predict the sample's target attributes.

Impact and harm

Impact and harm: Negates the confidentiality of the targeted machine learning model. This leads to privacy leakage and potential legal repercussions.

Security countermeasures

Security requirements

Security requirement: The machine learning system must be resistant to membership inference attacks.

Security controls

Security controls:

  1. The defensive mechanism utilized in this study is Differential Privacy (DP), applied to Generative Adversarial Networks (GANs) to protect against privacy invasion attacks, specifically membership inference attacks. Differential privacy introduces controlled noise into the model training process to obscure the presence or absence of any single data point in the training set, thus aiming to protect individual data privacy.
  2. Differential privacy: calibrated noise is added to the gradients during training, applied gradient clipping to bound the sensitivity of the training algorithm.
  3. Regularization - a technique that reduces the overfitted nature of a model. It generalizes the model and reduces information leakages about the training dataset.
  4. Differential privacy: adds mathematically bounded noise to the training process or to the final model parameters, limiting the influence of any single data record on the model’s outcomes.
  5. Utilize Differential Privacy, Differentially Private Stochastic Gradient Descend (DP-SGD). This method adds Gaussian noise to gradient during the target model’s training. The method is capable of mitigating inference attacks without significantly deteriorating model's utility.
  6. Knowledge Distillation (KD) method can reduce membership inference attack risks. KD method transfers knowledge from the larger target model to the smaller distilled model. The distilled model can be more resource efficient than the original model.