Tutorial on Adversarial Machine Learning - Part 2

Adversarial Attacks

Reference: This post is inspired by the tutorial “Adversarial Machine Learning for Good” by Dr. Pin-Yu Chen (IBM Research) at the AAAI 2022 conference. Link to the tutorial: https://sites.google.com/view/advml4good

Trustworthy Machine Learning vs Adversarial Machine Learning

Trustworthy Machine Learning (TML) is a broad area of research that focuses on developing and deploying reliable, transparent, fair and secure machine learning systems. While Adversarial Machine Learning (AML) is a subfield of TML which specifically focuses on defending against malicious attacks. Let’s compare these two concepts in more detail:

Trustworthy Machine Learning:

Adversarial Machine Learning:

In this post, we will focus on AML research and discuss four main topics: adversarial attacks, adversarial defenses, certified robustness, and AML for good.

Category of Adversarial Attacks

Adversarial attacks aim to manipulate machine learning models by exploiting their vulnerabilities. They can be categorized based on the following criteria:

Category of Adversarial attacks based on their access to the data, training or inference process of a target model. Adapted from Pin-Yu Chen's tutorial (2022).

Based on the above criteria, there are many types of adversarial attacks. In this post, we will focus on the most common ones: poisoning, backdoor attacks, model extraction, evasion attacks, and privacy attack.

Poisoning Attacks

Aim: Poisoning attacks aim to manipulate the training data or the training process of a target model to corrupt its parameters. The attacker’s goal is to make the target model misbehave on future test data.

Threat scenario: StabilityAI and OpenAI are two competing companies that develop image generation models. StablityAI has a better model than OpenAI. However, OpenAI wants to win the competition. Therefore, it hires a group of hackers to hack into StabilityAI’s training process and corrupt its model. As a result, StabilityAI’s model is corrupted, and it starts to generate images that are not as good as before. OpenAI wins the competition.

Some notable works: , ,

Backdoor Attacks

Aim: Backdoor attacks aim to manipulate the training data or the training process of a target model to embed a backdoor into the model. The attacker’s goal is to make the target model misbehave on future test data that contains a trigger, where the trigger is a specific pattern that is not present in the training data (recently, there are some works that can embed a backdoor with imperceptible triggers). Without the trigger, the model behaves normally. However, when the trigger is present in the input data, the model starts to make wrong predictions. In order to make the attack’s threat more practical, the attacker is allowed to control the training data only, not the training process. However, it is still possible to embed a backdoor into the model by manipulating the training process or specific architecture.

Threat scenario: A bank wants to use a machine learning model to predict whether a loan applicant will default or not. The bank hires a data scientist to develop a machine learning model for this task. However, the data scientist is not honest. He/she wants to make the model misbehave on his/her future loan application. Therefore, he/she embeds a backdoor into the model. As a result, the model starts to make wrong predictions on future test data that contains a trigger. The data scientist gets benefits from this attack.

Some notable works: , ,

Model Extraction

Aim: Model extraction attacks aim to extract a copy of a target model. The attacker’s goal is to obtain a model that has similar performance to the target model.

Threat scenario: A bank has a machine learning model that can predict whether a loan applicant will be accepted or not. The bank wants to keep this model secret. However, a competitor wants to obtain this model. Therefore, the competitor hires a hacker that can submit a lot of queries and replicate the model from observed output. They can use this model for their own benefits.

Some notable works: , ,

Privacy Attacks

Aim: Privacy attacks aim to extract sensitive information from a target model. The attacker’s goal is to obtain sensitive information from the target model.

Threat scenario: A bank trained their chatbox using their clients’ data and release their chatbox for publish use. However, a competitor wants to obtain their clients’ data. Therefore, the competitor hires a hacker that interact with the chatbox and extract the clients’ data. They can use this data for their own benefits.

Some notable works: Recent work from Carlini et al. demonstrates that it is possible to extract training data from Large Language Models (LLMs) such as GPT-2 and GPT-3. Another work from shows that generative model trained by copyrighted/unallowed images.

Evasion Attacks (Adversarial Examples)

Aim: Evasion attacks aim to manipulate the input data to cause a target model to make a incorrect prediction. The common goal is to make the model predict a specific class (targeted attack) or make the model predict any class other than the correct class (untargeted attack). The perturbation is usually small and imperceptible to human perception.

Threat scenario: A eKYC system uses a machine learning model to verify the identity of a person. However, a hacker wants to bypass this system. Therefore, he/she manipulates his/her ID card to make the system misbehave. As a result, the system accepts his/her ID card, and he/she can get access to the system. It is even worse if the system is used in warfare.

Some notable works: Szegedy et al. first demonstrated the existence of adversarial examples. Madry et al. proposed a PGD attack.

Some thought on the practicality of adversarial attacks

While adversarial attacks are definitely a big issue when it comes to deploying machine learning models, in my opinion, the situation isn’t as bad as it seems. That’s probably why many companies, even though they know about these attacks, don’t really take proper actions to defend against them. (Except for a few like Google, Microsoft, and Facebook, but they’re not the majority. I’m exhausted from searching for job opportunities in the industry, and let me tell you, most companies just don’t care about adversarial attacks.)

When it comes to poisoning attacks and backdoor attacks, attackers need access to the training data or the training process, which isn’t always possible. They can’t control how a model is trained. In model extraction attacks, attackers have to submit a bunch of queries to the target model, which isn’t always doable. And let’s talk about privacy attacks - current research successfully extracts some training data, but guess what? That data is already out there on the internet, so it’s not really a big deal. Adversarial examples, on the other hand, are a real headache when deploying machine learning models. But white-box attacks? They’re not practical because the attacker needs to know everything about the target model. Black-box attacks are more realistic, but even then, they either require a ton of queries or rely on transferability.

So what’s the point of all this?