Jailbreak attack

Definition

[IT.T.12] A jailbreak attack is a type of security attack that exploits vulnerabilities within a constrained system (such as an aligned LLM) to bypass imposed restrictions and achieve privilege escalation. In the context of LLMs, jailbreaking refers to the practice of circumventing or overriding alignment guardrails that are designed to govern the scope of content the model can produce.

Targeted assets

System Asset: ML system input/API.

Business Asset: input data.

Security Criteria: confidentiality, integrity.

Attack details

Exploited vulnerabilities

Vulnerabilities:

  1. It is possible to bypass built-in output restrictions of a VLM by providing an adversarial image or textual data.
  2. There exists inherent ambiguity between the decision boundaries of a machine learning model and true decision boundaries.
  3. Lack of or insufficiency in defensive measures.
  4. Vulnerability of deep neural networks to small, almost imperceptible perturbations to benign examples.
  5. Transferability of adversarial examples from surrogate to target models.
  6. Increased vulnerability in multi-modal setup.
  7. Model's API misconfiguration.
  8. Stochastic nature of the machine learning model.

Threat agent

Threat agent: white-box and black-box scenarios. In the white-box scenario, the attacker is assumed to have complete knowledge of the target machine learning model, its architecture, parameters, utilized training data, and the learning algorithm. In a black-box scenario, the attacker has no knowledge of the target model's architecture, parameters, or training data. The attacker is assumed to be only able to interact with the model by sending it inputs and observing the outputs.

Attack methods

Attack methods:

  1. A submission of an adversarial image input alongside a textual prompt, requesting a malicious output. Adversarial image data is generated based on examples of malicious content through Projected Gradient Descent (PGD), this image data is later paired with malicious textual prompt.
  2. A submission of an adversarial textual input alongside a textual prompt, requesting a malicious output. A discrete optimization algorithm from Shin et al., an improved version of the hotflip attacks can be utilized for adversarial text generation.
  3. 1. A surrogate LLM is fine-tuned. 2. Optimize adversarial distribution using Gumbel-Softmax. 3. Apply constraint model (CLM) for perplexity/semantic regularization. 4. Use geometric loss to balance objectives. 5. Sample adversarial examples with semantic filtering. 6. Submit the generated adversarial data as an input to the target LLM system.
  4. Input of a malicious prompt, bypassing present safeguards.

Impact and harm

Impact and harm: Negates the integrity of the targeted machine learning model. This leads to misclassification of benign and malicious inputs.

Security countermeasures

Security requirements

Security requirement: The machine learning system must be resistant to jailbreak attacks.

Security controls

Security controls:

  1. Conduct adversarial training, the cost is prohibitive. In addition, the bounds on the perturbation can be much wider that generally assumed.
  2. Utilize common harmfulness detection APIs like Perspective API and Moderation API. The API's accuracy is limited, they may cause bias, harm and reduce the helpfulness of a model. These API's are not applicable to offline models in adversary's possession.
  3. Conduct robustness certification, although the cost is prohibitive.
  4. The input can be pre-processed. A possible method is DiffPure, which introduces noise into the input images and then diffuses it back into learned data manifold, thus restoring a clean image. This method cannot be applied to an offline, local model in adversary's possession.
  5. Model behavior constraints: definition of system prompts, restricting model's capabilities, role, accepted output, actions and topics.
  6. Deterministic validation of the output.
  7. Filter input and output: assess the context, consider semantic filtering; sanitize the input data.
  8. Restrict LLM application's privileges.
  9. Human-in-the-loop control and validation of privileged operations caused by the LLM application.
  10. Separation of the external input data and limitation of its influence.
  11. Conduct penetration testing.
  12. Restrict model's access to sensitive and external data sources.
  13. Federated learning, reduces the need for centralized data collection and by extension reduces data exposure risk.
  14. Differential privacy - addition of noise to the output, reduces the data exposure risk.
  15. Education of users on the risks of sensitive data input.
  16. Define clear and transparent policy about data retention, usage, removal.
  17. Conceal system prompt and configuration.
  18. Proper and secure configuration of the model's input API, to restrict feedback on error and configuration details.
  19. Homomorphic encryption - to ensure secure data analysis and training process.