Five Essential Machine Learning Security Papers

We recently published “Practical Attacks on Machine Learning Systems”, which has a very large references section – possibly too large – so we’ve boiled down the list to five papers that are absolutely essential in this area. If you’re beginning your journey in ML security, and have the very basics down, these papers are a great next step.

We’ve chosen papers that explain landmark techniques but also describe the broader security problem, discuss countermeasures and provide comprehensive and useful references themselves.

  1. Stealing Machine Learning Models via Prediction APIs, 2016, by Florian Tramer, Fan Zhang, Ari Juels, Michael K. Reiter and Thomas Ristenpart

ML models can be expensive to train, may be trained on sensitive data, and represent valuable intellectual property, yet they can be stolen – surprisingly efficiently – by querying them.

From the paper: “We demonstrate successful model extraction attacks against a wide variety of ML model types, including decision trees, logistic regressions, SVMs, and deep neural networks, and against production ML-as-a-service (MLaaS) providers, including Amazon and BigML.1 In nearly all cases, our attacks yield models that are functionally very close to the target. In some cases, our attacks extract the exact parameters of the target (e.g., the coefficients of a linear classifier or the paths of a decision tree).”

  1. Extracting Training Data from Large Language Models, 2020, by Nicholas Carlini, Florian Tramer, Eric Wallace, et. al.

Language models are often trained on sensitive datasets; transcripts of telephone conversations, personal emails and messages… since ML models tend to perform better when trained on more data, the amount of sensitive information involved can be very large indeed. This paper describes a relatively simple attack technique to extract verbatim training samples from large language models.

From the paper: “We demonstrate our attack on GPT-2, a language model trained on scrapes of the public Internet, and are able to extract hundreds of verbatim text sequences from the model’s training data. These extracted examples include (public) personally identifiable information (names, phone numbers, and email addresses), IRC conversations, code, and 128 bit UUIDs. Our attack is possible even though each of the above sequences are included in just one document in the training data.”

  1. Model inversion attacks that exploit confidence information and basic countermeasures, 2015, by Matt Fredrikson, Somesh Jha and Thomas Ristenpart

Model Inversion attacks enable the attacker to generate samples that accurately represent each of the classes in a training dataset, for example, an image of a person in a facial recognition system or a picture of a signature.

From the paper: “We experimentally show attacks that are able to estimate whether a respondent in a lifestyle survey admitted to cheating on their significant other and, in the other context, show how to recover recognizable images of people’s faces given only their name and access to the ML model.”

  1. Targeted Backdoor Attacks on Deep Learning Systems Using Data Poisoning, 2017, by Xinyun Chen, Chang Liu, Bo Li, Kimberly Lu, and Dawn Song

Obtaining training data is a major problem in Machine Learning, and it’s common for training data to be drawn from multiple sources; user-generated content, open datasets and datasets shared by third parties. This attack applies to a scenario where an attacker is able to supplement the training set of a model with a small amount of data of their own, resulting in a model with a “backdoor” – a hidden, yet specifically targeted behaviour that will change the output of the model when presented with some specific type of input.

From the paper: “The face recognition system is poisoned to have backdoor with a physical key, i.e., a pair of commodity reading glasses. Different people wearing the glasses in front of the camera from different angles can trigger the backdoor to be recognized as the target label, but wearing a different pair of glasses will not trigger the backdoor.”

  1. Explaining and harnessing adversarial examples, 2014, by Ian J. Goodfellow, Jonathon Shlens, and Christian Szegedy

Neural networks classifiers are surprisingly “brittle”; a small change to an input can cause a surprisingly large change in the output classification. Classifiers are now a matter of life and death; the difference between a “STOP” sign and a “45 MPH” sign, a gun and a pen, or the classification of a medical scan are extremely important decisions that are increasingly automated by these systems, so this odd behaviour is an extremely important security problem.

This paper is an exploration of the phenomenon, with several suggested explanations, discussion around generation of adversarial examples, and defences.

The paper also poses several interesting questions. From the paper: “An intriguing aspect of adversarial examples is that an example generated for one model is often misclassified by other models, even when they have different architecures or were trained on disjoint training sets. Moreover, when these different models misclassify an adversarial example, they often agree with each other on its class.”