Anti-Personalized Generative Models

Introduction

Generate a perturbation that can be added to the input images to prevent these images from then being learned by a personalized method.

Defender’s goal: generate a perturbation that can be added to the input images to prevent these images from then being learned by a personalized method.
Attacker’s goal: get the perturbed images and use them to personalize the model.

What are the differences between this idea and the Anti-Dreambooth idea?

We focus on specific set of key words or concepts
These images can still be learned by personalized method, but cannot generate meaningful images if there is a prohitibited keyword/concept in the prompt.

Setup

Data to learn:

Just 4 images collected from Huggingface

Source images

Version 2

Recap Textual Inversion

Textual Inversion

Tokenization and Text Embedding

Given a prompt \(y\), we tokenize it into a sequence of tokens \(y = (y_1, y_2, ..., y_n)\), where \(y_i\) is the \(i\)-th token in the sequence.

Then, we use an embedding lookup table (i.e., an embedding matrix) to get the embedding vector \(v_i\) of each token \(y_i\) in the sequence. \(v = T(y) = (v_1, v_2, ..., v_n)\).

Finally, we use a contextualized text embedding model (e.g., CLIP text embedding) to get the text embedding \(\tau(T(y))\) of the prompt \(y\). (i.e., denoted by \(c_\theta (y)\) in the figure above).

In textual inversion, we will lock the conceptualized text embedding model \(\tau\) and only learn the embedding lookup table (actually, we will change the row of the embedding matrix that corresponds to a specific placeholder token, e.g., \(sks\)).

Textual Inversion

High-level idea: We will learn an embedding vector \(v^*\) that can represent the visual concept of a given set of images \(X\).

The objective function:

\[v^* = \underset{v}{\text{argmin}} \; \mathbb{E}_{x \sim X, z \sim \mathcal{N}, c \sim C} \; \mathcal{L}(G(z, \tau( T(c)+v)), x)\]

where \(G\) is the generator, \(z\) is the latent code, \(c\) is the prompt sampled from a set of templates \(C\), \(x\) is the input image, and \(v\) is the embedding vector.

Version 2.1

Generate perturbed images by maximizing the reconstruction loss.

\[\delta^* = \underset{\delta}{\text{argmax}} \; \mathbb{E}_{z \sim \mathcal{N}, c \sim C_d} \; \mathcal{L}(G(z, \tau( T(c) + v_d)), (x+\delta))\]

where \(G\) is the generator, \(z\) is the latent code, \(c\) is the prompt sampled from a set of templates for defender \(C\), \(x\) is the input image, and \(\delta\) is the perturbation. \(C\) is the list of neutral prompts, i.e., a photo of sks, sks is the placeholder token.

In this particular setting:

Defender uses the same prompts (i.e., the same placeholder token and templates \(C_a = C_d\)) as the attacker, which is quite unrealistic assumption.

\[v^* = \underset{v}{\text{argmin}} \; \mathbb{E}_{x \sim X^\delta, z \sim \mathcal{N}, c \sim C_a} \; \mathcal{L}(G(z, \tau( T(c)+v)), x)\]

Results: Attacker cannot use the perturbed images to personalize the model. The output images are not meaningful.

Version 2.2

Generate perturbed images by maximizing the reconstruction loss.

\[\delta^* = \underset{\delta}{\text{argmax}} \; \mathbb{E}_{z \sim \mathcal{N}, c \sim C_d} \; \mathcal{L}(G(z, \tau( T(c) + v_d)), (x+\delta))\]

In this setting:

Defender uses different prompts (i.e., different placeholder token and the templates \(C_a \neq C_d\)) from the attacker. For example, defender_token = person while placeholder_token = advmersh_diff.

Results: Attacker cannot use the perturbed images to personalize the model. The output images are not meaningful.

Perturbed images

Output images of the generator

Next step:

Develop better attack method that can use the perturbed images to personalize the model. Try with different transformations. There is a tool called Adverse Cleaner which employs computer vision filtering to filter out adversarial perturbations. In the Anti-Dreambooth paper, the authors already test with this tool (and Gaussian filter and JPEG compression as well). Their method show some robustness to these filters.

Version 2.3

Three parties: defender, attacker, and adversarial cleaner.

Defender: generate perturbed images by maximizing the reconstruction loss.
Attacker: use the perturbed images to personalize the model.
Adversarial cleaner: clean the perturbed images before the attacker can use them to personalize the model.

In this version:

Develop adversarial cleaner to clean the perturbed images. Adversarial cleaner tries to remove the adversarial perturbation from the input image by applying another perturbation, with the loss function is minimizing the distance between the input image (the perturbed images) and the generated ones (given by a prompt).

Results: There are two versions of adversarial cleaner: advmersh_clean where the prompt templates of defender and cleaner are the same, and just different in the placeholder token (i.e., person and human). In advmersh_clean2 where the prompt templates of defender and cleaner are different, and the placeholder token as well. The results show that in the first version, after cleaning, the attacker cannot use these cleaned-perturbed images to personalize the model, however, in one generated sample, we can recall the original person. In the second version, the attacker cannot generate any meaningful images at all.

advmerch_clean

Perturbed images after cleaning (look almost the same as the original images)

Output images of the generator learned from the cleaned-perturbed images

advmerch_clean2

Perturbed images after cleaning --> still seeing some noise

Output images of the generator learned from the cleaned-perturbed images

Conclusion

Simple perturbation method can work well
It is surprising that even when the defender and the attacker use different placeholder tokens (i.e., version 2.2), the attacker cannot use the perturbed images to personalize the model.
The adversarial cleaner somehow can reduce the adversarial effect but it is not strong enough.

Next step

Develop a stronger adversarial cleaner that can clean the perturbed images completely.
Generate perturbed images by maximizing the embedding distance between the perturbed images and the text embeddings of the prompts.

Version 3

Questions

How to completely remove the adversarial perturbation from the input image, so that the attacker still can use these images to personalize the model?

References

https://huggingface.co/spaces/multimodalart/latentdiffusion/blame/5ea000aef2990e60c6cf2ebed428268cff8ae465/app.py
Code that can get text embedding and image embedding from clip based and calculate ssim to detect NSFW data