Project Ava: On the Matter of Using Machine Learning for Web Application Security Testing – Part 3: Understanding Existing Approaches and Attempts

Last week, our research team explored the capabilities of IBM’s Natural Language Processing (NLP) tool and how we might be able to apply it to social engineering or phishing campaigns.

In this phase of the research, the team talk us through the existing approaches and attempts to harness machine learning for pentesting scenarios.


The aim of this phase of research was to look at what has already been achieved by others who have utilised ML in penetration testing scenarios. The questions we set out to answer in this phase included:

  • What’s been done already?
  • How successful were attempts?
  • Who are the main players in this space?
    • Is the main work coming from academia?
    • How long has it taken them to get where they are?
  • What open source tools exist in this domain and can we reuse anything for Project Ava?

Ultimately this phase helped us understand how much work was required in later phases of development, with a heads-up on likely stumbling blocks.

Academic approaches and existing approaches at complete systems

Our open source helped to confirm that complete systems using AI and ML to identify and exploit web applications are still in early development stages. When the AI involved in a complete system is broken down into its individual components, there is a large body of work in this space.

However, attempts at creating complete systems are far scarcer and still in their infancy. With the exception of the DARPA Grand Cyber Challenge [1] (discussed later), our research was unable to find any universities that had attempted to create complete systems. All attempts found were by professional security researchers presenting their work at security conferences such as Black Hat.

What follows is a summary of key approaches at complete systems and of relevant academic research.

Bharadwaj Machiraju

Bharadwaj Machiraju presented his work at Nullcon Goa 2017 on training machines for web application scanning [2]. Machiraju attempted to build his own system from scratch and highlighted some key issues he faced as well as the strategies he used to overcome issues along the way.

Although he did not get near a complete system, his approach and the work achieved are good examples of why this is a challenging area and what needs to be done to overcome these challenges. Machiraju notes that login and sign-up sequences are the most complicated web sequences that will ever be performed and that training a machine to intuitively recognise and carry out these sequences is a hard task. Machiraju breaks down web browsing as a human into three core components:

  1. Identifying inputs
  2. Understanding feedback
  3. Performing sequences

In labelling inputs, he first used a Naïve Bayes classifier to label Input placeholders, however this approach was too laborious and was restrictive in the classes of label that could be applied – i.e. how would foreign languages be labelled?

His second approach data-mined a large quantity of input forms from real people using a plugin he created for the task. The plugin mined placeholder values of the input field and an anonymised suggestion for valid input data based on user input. He then used Term Frequency-Inverse Document Frequency (TF-IDF) to represent all placeholders in a multi-dimensional space. This groups similar placeholders together and removes the need to label them.

Machiraju highlighted the importance of understanding feedback in order for a system to understand previously un-encountered placeholders. This is an area that lends itself to Natural Language Processing (NLP).

In his first attempt, Machiraju used Parts of Speech (PoS) tagging to break down sentences into parts of meaning. This approach required lots of complex logic that would need to be different for each language and would not cope with misspelling and bad grammar. He went on to use seq2seq, which is a long short-term memory (LSTM) network. This approach would require lots of training data and some initial manual classification but yielded good results.

Q-Value learning was used for Reinforced Learning (RL). To perform sequences Machiraju first tried using Doc2Vec and Paragraph2Vec to add all the link texts, form placeholders and label input data as a string and represent the data in a vector format. A Least Squares Policy Iteration (LSPI) [3] State Action Reward State Action (SARSA) [4] agent was used with Radial Basis Function (RBF) for storing the value.

Machiraju found that this arrangement performed very well for simple web applications but fell short when dealing with larger, more complex websites. One of the main challenges was e-commerce websites where following links lead to a product.

A human can easily distinguish the similarity between these journeys and will know not to proceed along all product paths. The system outlined above however, would not quickly recognise the similarity in these paths. This is because the state space becomes huge and complex, which makes achieving convergence very difficult within a reasonable period. His solution to this was to use an unsupervised feature selector to select features and then convert those features into a state vector. This approach reduced the feature space to a manageable size. He used RL to train his feature selector. If a path ended with a positive reward, then the number of links and forms was stored. For the next path, the required number of links and forms were labelled as per the stored path.

The labels were used as the state vector – the feature selector replaced doc2vec and all links and forms were fed into it. The output of the feature selector was then a state vector discrete space of N links and M forms. To improve his model, Machiraju tried a semi-supervised approach where new sequences were taught to the model once. He found that if he first walked the machine through two or three websites, his training time was greatly reduced. He collected some seed data for this purpose and dramatically improved the time required for training.

Now that his machine could crawl through web pages with approximately 80% efficiency as compared to a human, Machiraju focused his machine on reflected XSS detection. He broke representing HTML into two factors. The tag and the context i.e. div and class. He then fed these into an RL machine. For payload generation he presented the machine with a quantity of valid mark-up and set the positive condition to a pop up being generated. This enabled the machine to generate its own valid payloads for a variety of different scenarios.

Although the demonstrated attempt took several minutes to find a valid XSS payload in a vulnerable web application, it was successful. Although Machiraju’s system is in its infancy, he highlights some of the key challenges faced when building complete systems and presents some useful approaches for tackling these challenges.

Isao Takaesu: Spider Artificial Intelligence Vulnerability Scanner (SAIVS)

In 2016 at Code Blue and Black Hat Asia, Isao Takaesu of Mitsui Bussan Secure Directions presented the Spider Artificial Intelligence Vulnerability Scanner (SAIVS) [5]. Takaesu is aiming to bridge the large skills gap that exists in Japan. He hopes that AI/ML solutions will help make up for the lack of available security professionals.

Takaesu states the importance of crawling intelligently and highlights the fact that many vulnerabilities only reveal themselves once a user has come to the end of a journey. This is something that conventional scanners miss.

He discusses how humans intrinsically follow flows by recognising keywords and the meaning associated with them. Humans are also able to understand what input is likely to be valid for a given input and what needs to be completed to progress to the next step. In addition to this, humans are able to process error messages and feed that information back into their workflows. For example, “Email is not valid” means the email address has been incorrectly entered.
Takaesu summarises these as three requirements for an intelligent crawler:

  1. Recognise the page type
  2. Recognise the success or failure of a page transition
  3. Learn optimal parameter values

For the first two requirements, he used Naïve Bayes and for the third, he used a Multi-layer Perceptron (MLP) [6] with Q-Learning. He used Naïve Bayes with predefined categories and probability of keyword occurrence to recognise page types. His approach categorises a presented page based on the keywords found within the page and the probability of those keywords existing in one of the predefined categories.

To create his table of keywords in predefined categories he manually sifted through approximately 50 web applications. For the page transitioning, Takaesu created a table of success words and failure words and again used Naïve Bayes. He used an MLP with Q-Learning to teach his crawler optimal parameters for input fields.

The inputs for the MLP were the current page and the desired next page. The outputs were input values to transition to the next page. Q-Learning was then used to supervise the learning of the MLP and optimise the outputs. After 300 rounds of optimisation, the efficiency of his MLP was still poor.

To improve this he used word2vec with Cosine similarity to calculate the similarity of previously seen words. Word2vec represents a word as a vector and cosine similarity measures the angle between two vectors to give a similarity rating. Each vector is given a cosine value and similar vectors will have similar angles. He used a Long-Short Term Memory (LSTM) algorithm to detect Reflected XSS.

LSTMs can generate output from a starting seed. Again, Takaesu highlights the importance of understating. He uses a tag to demonstrate how a human would know to escape the text area before injecting i.e. alert(1). He used 20,000 pages of HTML syntax and 10,000 pages of JavaScript to train his machine; after which, his machine was able to generate valid syntax. To overcome sanitisation of input he used an MLP with Q Learning for supervision. Approximately 100 attempts were required to train his machine to overcome sanitisation. To improve the efficiency of this, Takaesu used pre-training. The videos [5] of his final creation are quite impressive. Although he tests it on basic web applications, the ability to intelligently browse and discover XSS is an impressive feat.

Existing approaches at partial systems

Bishop Fox – DeepHack

Bishop Fox developed a tool that uses ML to perform SQL Injection [7]. It uses a simple Neural Network with supervised learning. They bootstrapped their model with known data and highlight that getting a good sample of labelled training data is the key to solving problems with AI/ML.

DeepHack is agnostic to database types – it just needs to be trained on a new set of data. They caution the need to carefully set out the ruleset when training as one may inadvertently reward behaviours one does not actually want. They found great speed gains when training with GPUs.

DARPA Grand Challenge

In 2016 DARPA held the Grand Cyber Challenge (CGC) [1]. It was a challenge for teams to build an autonomous system to simultaneously attack and defend network services. The systems would then compete against each other in an isolated network with no outside interaction. The attack scenarios were based on finding zero-day attacks in services. Competitors intelligently fuzzed networked services to find zero-days, where many of the exposed services had zero-days that were inspired by famous bugs such as Heartbleed, Shellshock and MS08-67. All of these services were written from scratch and subtly different from their inspirations. The machines then had to patch their own services and hack other teams’ services. Points were gained for maintaining service and successfully exploiting competitor services.

The winning team was Shellphish with their machine Mechaphish [1]. They are the authors of the Angr framework and have open-sourced their winning machine.

The walkthrough talk that DARPA produced in their challenge is very interesting and well worth a watch. Beyond this, the DARPA CGC output was not deemed to be applicable to Project Ava.

Despite being a very impressive feat, Project Ava’s focus is on web applications and where possible, emulating human consultant actions. In addition, during the usual course of web application assessments, our clients would not expect zero-days to be found. Of course, we do sometimes find zero-days during web application penetration tests, however our focus is fundamentally on reporting against known classes of vulnerability within those applications, such as SQLi, XSS etc.

Related Academic Works

In their 2018 paper [8], Young et. al present the state of NLP in academia. They highlight the shift from shallow models based on high-level, sparse features such as SVMs and logistic regression, to neural networks based on dense vector representations. This shift has produced marked improvements. Deep learning neural networks are capable of automatic multi-level feature representation learning. Whereas traditional methods relied heavily on manual labelling of data. Their paper goes into great detail on the state of NLP today and the advancements made in the field.

The Hierarchical Attention Network (HAN) was presented by Yang et al in their 2016 paper [9]. A document vector is derived which is a high-level vector that summarises all information of sentences in a document. It is essentially created from two sections. One section that pays attention to individual words and another that processes sentences as a whole. This gives it a much deeper understanding of the meaning of the sentence as the context of words changes their meaning. Yang et. al performed experiments on large datasets and found that HAN significantly outperformed other widely used machine learning NLP solutions including SVM, CNN and LSTM. For example, HAN achieved 71.0% document classification on a Yelp 2015 dataset compared to 58.2% for LSTM and 62.4% for SVM + Bigrams. Bofin Babu of CloudSek gives a slightly more layman’s view of the paper on their blog [10].

Commercial works in progress or existing offerings

During this phase of research we inspected public sources for information on existing commercial offerings in the realm of ML applied to web application security testing, in addition to any public information on works in progress by commercial entities in this space. Our research here is by no means exhaustive and did not involve any hands-on usage of any of the products or systems identified from public literature.

Many patents filed by large tech organisations were identified in this space [11], [12]. A summary of some of the commercial offerings potentially leveraging ML in web application security assessment includes:

  • Micro Focus – Fortify Web Inspect [13]
  • High-Tech Bridge – ImmuniWeb [14]
  • IBM – Appscan [15] – The cloud-based version, “Application Security on Cloud”, uses Intelligent Findings Analytics which is a ML analyser that sifts through results to reduce false positives
  • CloudSEK – In their 2017 blog post [16], CloudSEK present some detail about their system and discuss the vulnerabilities they found in LinkedIn stemming from insecure direct object references. They have named their AI platform Cloud-AI.
  • Cronus – CyBot [17] provides continuous, automatic, global penetration testing

Existing frameworks

While many of the projects already discussed have released parts of their solutions as open source, the following are frameworks that have not yet been discussed are mentioned as they were identified during our research:


Kurgan [18] was created in April 2017 by Leon Nash and is a python-based multi-agent AI based web application hacking framework that uses Markov Decision Process and Partially Observable Markov Decision Process. It is still in its early stages. It uses pre-existing tools (SQLMap and XSSSnipper) for vulnerability discovery. It uses selenium and PyExecJS to interact with web pages. There is a web-GUI or it can be run from the command line.

Yet Another Web Application Testing Toolkit (YAWATT)

Presented first in 2006 YAWATT [19] is a ruby-based framework that uses RL Bayesian Network for classification. It started well but never continued. It has not been updated much since first being presented.


This blog post explored existing works and approaches in the problem domain of ML applied to web application security testing. The following conclusions were made as a result of this phase of research:

What has already been done?

A great deal has been done in this space. Despite some very early approaches, this space is still in its infancy, but is unlikely to remain so for long. There are many players in this field with limited information available on techniques being developed by commercial entities. Most of the published works are from security researchers, some of whom have spun their research off into businesses.
The common threads coming from published works is the importance of training data, the necessity to use some kind of NLP, and that intelligently crawling a web page is a hard part of automated web application assessments. The methods and solutions presented gave us a good head start and a useful roadmap to follow as we looked to create our own solution.

How successful were attempts?

Attempts have reached varying degrees of success. No one has yet presented a complete working system but many have created reasonable crawlers and a few have created ML solutions for identifying specific attacks such as SQLi and XSS. Takaesu’s XSS solution sticks out in particular and Bishop Fox’s solution shows how easy it is to achieve some degree of success with even simple ML.
Scanners that are currently being sold do not yet offer a strong solution. The lack of information on some vendor usage of ML makes it difficult to assess how much they have progressed towards working solutions, but also reinforces the conjecture that no one yet has a complete, strong ML solution.

How long has it taken for solutions to get to where they are?

Most researchers seem to have taken a year or so to develop their initial products, indicating that there are a number of challenges to being quick to market in this space, and likely due to a combination of:

  • Skillset – the need for deep technical understanding in the realm of data science and how different techniques can be applied to a technical specialist area such as web application testing.
  • Data – the needs for real-world data, and lots of it, for model training.
  • Refinement – the need for experimentation and refinement of models to understand best-fit algorithms and approaches.

What open source tools exist in this domain and can we reuse anything that is already out there?

Most people or entities have not released everything that they have done around this. Each person or entity has released different parts of their work, while existing frameworks are in their infancy and/or have been abandoned by their creators. There are many well-known existing libraries for C/C++, CUDA and Python that take the pain out of writing AI, as seen in Part 1 of this blog series.

From this phase of research, we concluded that while we can take inspiration for many useful directions from the work of others, it seemed unlikely that we could directly use much of the existing code of others, meaning that we would be creating our own solution using existing ML frameworks and libraries.

From observation of many of our client needs in this space we also observed a demand for a capability such as Project Ava that can fit easily within an existing testing strategy, enhancing shift-left strategies (particularly within agile approaches) and in ways that reduce (or ideally remove) false positives.

Alas, Project Ava was not going to be a cakewalk…


[1] DARPA Grand Cyber Challenge, DARAPA, 2016, URL:, GIT:, Mechaphish Github:
[2] Nullcon Goa 2017 – Training Machines for Application Scanning in Retro Style by Bharadwaj Machiraju, Presentation available at: Slides available at: GitHub: Blog:
[5] Spider Artificial Intelligence Vulnerability Scanner(SAIVS), Isao Takaesu, 2016, Code Blue Presentation Slides:, Blackhat Asia 2016 Slides:, Demos of it crawling and detecting vulns:,, MBSD Blog:, Takaesu’s Twitter:, Takaesu’s Github:, Presenting At Code Blue Tokyo 2016 (In Japanese):
[7] Deep Hack, Bishop Fox, 2017, GitHub:, DefCon25 Presentation:
[8] Young, T., Hazarika, D., Poria, S. and Cambria, E., 2017. Recent trends in deep learning based natural language processing. arXiv preprint arXiv:1708.02709. PDF:
[9] Yang, Z., Yang, D., Dyer, C., He, X., Smola, A. and Hovy, E., 2016. Hierarchical attention networks for document classification. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (pp. 1480-1489). PDF:
[10] Bofin Babu, Cloudsek, 2018, Hierarchical Attention Neural Networks: Beyond the traditional approaches for text classification. URL:
[18] Leo Nash, 2017, Kurgan, Confaria0day Slides:, Project Website:, GitHub:
[19] Fyodor and Meder, 2006, Yet Another Web Application Testing Tollkit (YAWATT), HITBCon Slides: Website:

Written by NCC Group
First published on 10/06/19

Call us before you need us.

Our experts will help you.

Get in touch
%d bloggers like this: