Model extraction attack

Definition

[IT.T.5] A model extraction attack (also known as model stealing) is a type of security attack where an adversary aims to replicate the functionality of a target machine learning model without having direct access to its internal parameters or training data. The attacker interacts with the target model, typically through a prediction API, to gather information and train a substitute model that mimics the behavior of the original. This allows the adversary to gain insights into the training data and potentially launch further attacks, such as evasion or membership inference attacks.

Targeted assets

System Asset: ML system input/API.

Business Asset: input data.

Security Criteria: confidentiality.

Attack details

Exploited vulnerabilities

Vulnerabilities:

  1. It is possible to replicate target model's properties within a new model based on the target machine learning model's output.
  2. It is possible to possible to reconstruct target model's functionality from the target model's output.
  3. There are no restrictions on the amount and frequency of inference requests.
  4. Operation of the machine learning has high computational demands and a high cost.
  5. It is possible to replicate target model's properties within a new model based on the target machine learning model's output.

Threat agent

Threat agent: white-box, gray-box and black-box scenarios. In the white-box scenario, the attacker is assumed to have complete knowledge of the target machine learning model, its architecture, parameters, utilized training data, and the learning algorithm. In a gray-box scenario, the threat agent is assumed to have some partial knowledge of the target model. This knowledge could include the type of learning algorithm used, the feature set, or the agent may possess an access to a surrogate dataset with similar characteristics to the original training data. In a black-box scenario, the attacker has no knowledge of the target model's architecture, parameters, or training data. The attacker is assumed to be only able to interact with the model by sending it inputs and observing the outputs.

Attack methods

Attack methods:

  1. The target model is queried with surrogate data, based on the public data, creating a dataset of surrogate data-target model output data pairs. A publicly available pre-trained Data-efficient Transformer (DeiT) model is fine-tuned based on the previously produced dataset.
  2. One of the possible methods: A set number of requests is produced against the target model, producing a set of responses. A surrogate model is trained based on request data-response data pairs. Other possible techniques: Linear least square approach, Malicious samples query, Build universal thief datasets, Query synthesis active learning, Autoregressive generation, Fine-tuned encoder - Algebraic attack, Direct Extraction - Recreate Projection, Head Fuzzy gray correlation.
  3. Querying of model's input API with malicious input samples for collection of data, sufficient for model replication.
  4. The target model is probed, the received output is utilized to build a clone model. The cloned model can be utilized for further attacks, such as an evasion attack.
  5. The attacker utilizes a partial training dataset or shadow set to query the target model to receive its posteriors. The retrieved posteriors and the utilized samples are used to train the new adversary's model.

Impact and harm

Impact and harm: Negates the confidentiality of the targeted machine learning model. This may lead to the loss of intellectual property.

Security countermeasures

Security requirements

Security requirement: The machine learning system must be resistant to model extraction attacks.

Security controls

Security controls:

  1. Prediction poisoning, addition of bounded noise to the model's output for maximization of the annular deviation (MAD) between the original gradients and poisoned ones, while maintaining the rank of the most confident prediction, to prevent accuracy loss.
  2. Reverse Sigmoid, addition of an activation layer, which controls the amount of perturbation added to the target model's posterior probabilities, to prevent loss of the model's accuracy.
  3. Differential privacy: addition of noise to deviate the outputs from the original.
  4. Secure multi-party computation: joint computations are conducted within confidential environment.
  5. Homomorphic encryption: calculations are conducted through confidential means, allowing operations on encrypted data without revealing the original data.
  6. Adversarial machine learning: incorporation of data about adversarial techniques into the model's training process.
  7. Watermarking techniques: embedding of watermarks into model's parameters, or algorithmic analysis of the model due to over-parametrization, or entanglement between watermark and training data features.
  8. Rate limit the amount of requests from a particular user.
  9. Adversarial training, training the model for detection and resistance against the extraction attempts.
  10. Vulnerability detection: risk assessment method for machine learning mode, evolution of the models pre-release.
  11. Randomization - introduction of unpredictability into model's response. Example approach: randomly generate and train multiple classifiers using different subsets of the feature space. The final output is aggregated from the predictions of all classifiers.
  12. Complexity - complexity of model's decision function is increased, for example the boundary can be made non-linear or be fractalized.
  13. Stateful analysis - query analysis, maintenance of query history and its analysis with meta-detectors.
  14. Utilize Differential Privacy, Differentially-Private Stochastic Gradient Descend (DP-SGD). This method adds Gaussian noise to gradient during the target model’s training. The method is capable of mitigating inference attacks without significantly deteriorating model's utility.
  15. Knowledge Distillation (KD) method can reduce membership inference attack risks. KD method transfers knowledge from the larger target model to the smaller distilled model. The distilled model can be more resource efficient that the original model.