Project Ava: On the Matter of Using Machine Learning for Web Application Security Testing – Part 9: Adventures with Expert Systems

In the penultimate blog of the Project Ava series, our research team take a look at expert systems to test for Cross-Site Scripting (XSS) vulnerabilities, develop a proof of concept, and discuss whether machine learning could ever be harnessed to complement currenting pentesting capabilities. 


Penetration testing can sometimes be repetitive and tedious. This suggests automating certain processes within an assessment could provide efficiencies. But how hard is it to automate those repetitive tasks while not introducing false negatives?

As part of Project Ava we investigated an expert system style approach to testing for Cross-Site Scripting (XSS) vulnerabilities and developed a proof of concept, and here, with tongue-in-cheek, we discuss the potential next steps on our journey towards understanding how machine learning could effectively complement pentesting methods in the future.


Web application penetration testers perform very similar basic checks over and over. Testing can be repetitive and requires sustained focus for long periods of time. The repetitive nature of the tests indicates that there is potential for automation in those areas. Benefits might include:

  • Efficiencies that result in the same or similar outcome in a shorter period of time.
  • Less time spent by a consultant on repetitive tasks, allowing them to focus more of their time on harder to identify bug classes.
  • A repeatable uniform approach might provide assurance that a human lapse in attention does not miss a result.

Project Ava has focused on use of AI (specifically ML) in the testing the security of web applications. AI is of course a broad term and to most people it refers to ML and Neural Networks, but AI can also be represented through building expert systems [1].

An expert system is a computer program which emulates the decision making process of a human expert. They were the first truly successful AI applications and were used in areas such as medical diagnosis, dam safety and mortgage loan applications.

At its heart, an expert system is a series of if-then rules which operate on facts. An inference engine uses the knowledgebase of facts and the rules to make deductions, creating new facts for the cycle to continue. This post describes the application of an expert system to the detection of XSS vulnerabilities.


The first step was to develop a proof of concept and in order to narrow the problem space down the choice was made to build a system that would identify a very simple stored cross-site scripting vulnerability.

In order to achieve this, the decision was made to write a plug-in for Burpsuite which used an expert system called CLIPS [2] - developed in the mid-80s by NASA [3]. It is widely used and is now in the public domain. It also has a Java native interface, making it slightly easier to integrate with Burpsuite.

The system needed to perform the following functions:

  1. Identify a potential parameter value in a web page
  2. Confirm that the value was controllable in the request
  3. Test the value to see whether it was vulnerable to XSS

Parts one and two do not require an expert system, so to speed up the process, they were written as a straight Burpsuite plugin. A listener monitored all requests and responses and when it saw a value in a response which had been passed as a parameter in a previous request, it repeated the request with a randomly-generated value using the same character set as the original request to ensure that the random value was also present in the response. This confirms supplied input is returned in the response - an indication of the potential for XSS.

Part three is where the expert system comes in. We need to examine the data and generate facts about it. From these facts we can derive tests to determine if certain payloads will work. In the case of XSS, for example, we need to determine the context of the target value. Is it in plain text, part of an HTML tag or in JavaScript? Once this is identified, we can test to see if it is possible to break out of this context and execute a payload.

Due to the nature of CLIPS (and a lack of familiarity with the language and how to integrate it with Burpsuite), this was separated into the functional and logical components. The Burpsuite plugin performed the functional aspects, collected the results and the CLIPS system took the results and applied the logic to determine what the next step was.

To test the system, Damn Vulnerable Web App’s [4] stored cross-site scripting vulnerability was used. It consists of a message board as shown below. The user adds their name and a message, and the message board shows a list of such messages. The screenshot below shows a message ‘test123’ has been entered. A very short name of ‘a’ was used so that only one possible vulnerable field would be exposed.


The Burpsuite component is shown in the screenshot below. Here we can see that it has identified an input which was later observed on a response page, which corresponds to the screenshot from above. This is our potential stored XSS vulnerability, so we need to test it.

Right clicking on the input gives the option to test it. Behind the scenes, this passes both the request which creates the value, and the request which retrieves the response with our input in it to a confirmer.

The confirmer generates a random test case as shown below. This was captured using the Logger++ [5] Burpsuite plugin. The test case is the same length and uses the same character classes found in the original message in an effort to not have the value rejected. It then confirms that the value is represented by retrieving the page which shows the result.

The result in the source is shown below, it is in plain text, i.e. not inside an HTML tag or JavaScript or some other entity.

<div id=”guestbook_comments”>Name: a<br />Message: oSNC54B<br /></div>

At this point some facts can be established about the potential exploit. People more familiar with CLIPS should probably look away now, this may offend:

  1. A controllable value is stored in the backend and returned in a response
  2. The stored value occurs in plain text, i.e. not inside an HTML tag or JavaScript
  3. The length of the input is seven characters

These facts are asserted into the CLIPS environment as the following:

  1. (stored-value)
  2. (in-text)
  3. (input-length 7)

One additional fact is also inserted. This fact asserts that we don’t know what the encoding of < and > characters is. In a real world example, there would likely be quite a few of these things we don’t know, but for this proof of concept, just one is sufficient. The reason for this fact will become apparent in the next step.

  1. (lt-gt unknown)

These facts are now applied to a set of rules. The first rule is as follows:

(defrule test-length
(input-length ?length)
(test (< ?length 30))
(assert (next-step check-length))

This rule is called ‘test-length’ and is triggered when the three facts; (stored-value,) (in-text) and an input-length of less than 30 are present in the system. When it triggers, it asserts a new fact;

  1. (next-step check-length)

When this rule has run, and the new fact has been asserted, the Burpsuite plugin will extract the next-step fact and see that it needs to check for length. It will then generate a request with a longer payload and ensure that it is returned in the response. The result of this can be seen below:

If the stored response contains the longer value, another fact is asserted:

  1. (input-length 31)

This does not overwrite the original (input-length 7) fact, and so that fact needs to be retracted; i.e. removed from the fact store.

A second rule follows:

(defrule test-angle-brackets
(lt-gt unknown)
(assert (next-step check-lt-gt))

This rule is called ‘test-angle-brackets’ and is triggered when the three facts; (stored-value,) (in-text) and (lt-gt unknown) are present in the system. When it triggers, it asserts a new fact;

  1. (next-step check-lt-gt)

When this rule has run, and the new fact has been asserted, the Burp plugin will extract the next-step fact and see that it needs to check for < and > encoding. A request with angle brackets is generated to check if the returned value is encoded.

If the result is returned in the stored response, then the (lt-gt unknown) fact is retracted and a further fact is asserted:

  1. (lt-gt not-encoded)

At this stage, we know enough about our exploit to suggest a payload. This is discovered using the following rule:

(defrule exploit-plain
(lt-gt not-encoded)
(input-length ?l)
(test (> ?l 30))
(assert (payload "//"))

This rule triggers when a stored value in the plain text context which does not encode < and > and accepts an input length of more than 30 is present, and asserts a new fact.

  1. (payload "//")

The Burpsuite plugin then notices that a payload has been discovered and provides it to the user, who can then use it to test for XSS.

The simplicity of this example merely shows the concept of how an expert system could be used in a situation such as XSS detection, but building a fully working web application testing system would involve significantly more effort.


Given the complexity of integrating the active request generation within the Burpsuite plugin with the CLIPS expert system, a decision was made to further develop the ruleset and generate the facts manually to see how effective the AI parts would be.

This led very quickly to several observations. The first is that familiarity with CLIPS was a hindrance in building more complex systems, although the work required to produce a prototype, which was useful enough to describe in a blog post, has helped that.

In order to build an expert system, you really need at least one expert, and probably more than one. This makes the construction of the system more difficult, but the end result is more effective. Converting your expert’s knowledge into rules is very time consuming and labour intensive. The good news is that once the initial work has been done, it should take far less effort to maintain the expertise and introduce new techniques as they are discovered.

Covering just XSS requires careful planning of rules. Covering all potential web application vulnerabilities is likely to be exponentially more complex.

It is unclear what the appropriate way to test is. In the system used, the vulnerability was identified, and then individual tests were made to determine the properties of it before a payload was suggested. It may be more effective to attempt the most basic payload first, and then examine the results and generate the facts from that. This would result in potentially fewer requests being sent, and hence be more efficient.

The question a lot of people are probably asking is, how is a system like this better than Burpsuite’s active scanner? There are several answers to this. The first is that it has the potential to be a lot more efficient. When Burp’s active scanner is scanning, it tries almost every single payload it knows about. An expert system can use knowledge about the application to target payloads to the specific vulnerability.

The second reason is that active scanner either finds a vulnerability or fails with no information. This system, upon failing to find a vulnerability can give you the reasons why it couldn’t. The simple example here for instance would tell you that < and > were encoded, or that the length of the response was too short.


We think an expert system is a good way forward for automated web application testing, and probably also infrastructure testing; but for more complete coverage, other AI technologies would need to be incorporated as well.

A lot of web application testing is methodical and repetitive, perfect for an expert system, but there are also a lot of tasks which a human tester performs which we take for granted as being easy, but which are very difficult to perform by a standard program. These areas are perhaps those where ML can be applied.



Written by NCC Group
First published on 25/06/19

Call us before you need us.

Our experts will help you.

Get in touch