Machine learning classifiers trained via gradient descent are vulnerable to arbitrary misclassification attack

ID VU:425163
Type cert
Reporter CERT
Modified 2020-06-04T19:14:00



Machine learning models trained using gradient descent can be forced to make arbitrary misclassifications by an attacker that can influence the items to be classified. The impact of a misclassification varies widely depending on the ML model's purpose and of what systems it is a part.


This vulnerability results from using gradient descent to determine classification of inputs via a neural network. As such, it is a vulnerability in the algorithm. In plain terms, this means that the currently-standard usage of this type of machine learning algorithm can always be fooled or manipulated if the adversary can interact with it. What kind or amount of interaction an adversary needs is not always clear, and some attacks can be successful with only minor or indirect interaction. However, in general more access or more interaction options reduce the effort required to fool the machine learning algorithm. If the adversary has information about some part of the machine learning process (training data, training results, model, or operational/testing data), then with sufficient effort the adversary can craft an input that will fool the machine learning tool to yield a result of the adversary's choosing. In instantiations of this vulnerability that we are currently aware of, "sufficient effort" ranges widely, between seconds and weeks of commodity compute time.

Within the taxonomy by Kumar et al., such misclassifications are either perturbation attacks or adversarial examples in the physical domain. There are other kinds of failures or attacks related to ML systems, and other ML systems besides those trained via gradient descent. However, this note is restricted to this specific algorithm vulnerability. Formally, the vulnerability is defined for the following case of classification.

Let \(x\) be a feature vector and \(y\) be a class label. Let \(L\) be a loss function, such as cross entropy loss. We wish to learn a parameterization vector \(\theta\) for a given class of functions \(f\) such that the expected loss is minimized. Specifically, let

\[\theta_{\star} = min_\theta \mathop{\mathbb{E}}_{x,y} L\left(f\left(\theta, x\right), y\right) \]

In the case where \(f\left(\theta,x\right)\) is a neural network, finding the global minimizer \(\theta_{\star}\) is often computationally intractable. Instead, various methods are used to find \(\hat{\theta}\) which is a "good enough" approximation. We refer to \(f\left(\hat{\theta}, .\right)\) as the fitted neural network.

If stochastic gradient descent is used to find \(\hat{\theta}\) for the broadly defined set of \(f\left(\theta,x\right)\) representing neural networks, then the fitted neural network \(f\left(\hat{\theta}, .\right)\) is vulnerable to adversarial manipulation.

Specifically, it is possible to take \(f\left(\hat{\theta}, .\right)\) and find an \(x'\) such that the difference between \(x\) and \(x'\) is smaller than some arbitrary \(\epsilon\) and yet \(f\left(\hat{\theta}, x\right)\) has the label \(y\) and \(f\left(\hat{\theta}, x'\right)\) has an arbitrarily different label \(y'\).

The uncertainty of the impact of this vulnerability is compounded because practitioners and vendors do not tend to disclose what machine learning algorithms they use. However, training neural networks by gradient descent is a common technique. See also the examples in the impact section.


An attacker can interfere with a system which uses gradient descent to change system behavior. As an algorithm vulnerability, this flaw has a wide-ranging but difficult-to-fully-describe impact. The precise impact will vary with the application of the ML system. We provide three illustrative examples; these should not be considered exhaustive.

  • Automatic speech recognition can be forced to erroneously transcribe an audio clip, and the attacker can freely pick the target transcription if they can pick the audio clip.
  • Facial recognition can be forced to erroneously identify faces_ _across many photographs, in different lightings, by use of physical eyeglass frames or manipulation of the source photo. The attacker can choose an arbitrary target identity of the misclassification.
  • Tesla autopilot in March 2019 had a demonstrated vulnerability (as in, on a closed road course, the attack works) where the car can be forced to change lanes arbitrarily based on stickers placed on the road surface. In January 2020, similar attacks were demonstrated using projections, such as from airborne drones, rather than stickers.


The CERT/CC is currently unaware of a specific practical solution to this problem. To defend generally, do both of:
1.*** Adversarial training and testing of ML models*. If a model must be exposed to adversarial input, the only well-tested defense against this kind of adversarial attack is adversarial training: using adversarially perturbed examples as part of the neural network's training regimen. When used for training, these examples increase the model's robustness against adversarial attack. Such training significantly increases the difficulty of attacking the model, but it does not guarantee the model is not vulnerable. Test your machine learning algorithm against known attacks. Libraries featuring reference implementations of popular attacks and defenses include Cleverhans, Foolbox, and the Adversarial Robustness Toolkit (ART).

2. Standard defense in depth. A machine learning tool is not different from other software in this regard. Any tool should be deployed in an ecosystem that supports and defends it from adversarial manipulation. For machine learning tools specifically designed to serve a cybersecurity purpose, this is particularly important, as they are exposed to adversarial input as part of their designed tasking. See CMU/SEI-2019-TR-005 for more information on evaluating machine learning tools for cybersecurity.

Other proposed solutions, which rely on either pre-processing the data or simply obfuscating the gradient of the loss, do not work when your adversary is aware that you are attempting those mitigations.


See Papernot et al. (2016) Towards the Science of Security and Privacy in Machine Learning or Biggio and Roli (2018) Wild patterns: Ten years after the rise of adversarial machine learning for a brief history.

This document was written by Allen Householder, Jonathan M. Spring, Nathan VanHoudnos, and Oren Wright.

Vendor Information

This advisory information is generic and does not describe any specific instance of this type of problem, so no vendors have been notified or listed here.

CVSS Metrics

Group | Score | Vector
Base | 0 | AV:--/AC:--/Au:--/C:--/I:--/A:--
Temporal | 0 | E:ND/RL:ND/RC:ND
Environmental | 0 | CDP:ND/TD:ND/CR:ND/IR:ND/AR:ND


  • <>
  • <>
  • <>
  • <>
  • <>
  • <>
  • <>
  • <>
  • <>
  • <>
  • <>

Other Information

Date Public: | 2020-03-19
Date First Published: | 2020-03-19
Date Last Updated: | 2020-06-04 19:14 UTC
Document Revision: | 25