Some Musings on Common (eBPF) Linux Tracing Bugs

Having been in the game of auditing kprobe-based tracers for the past couple of years, and in light of this upcoming DEF CON on eBPF tracer race conditions (which you should go watch) being given by a friend of mine from the NYU(-Poly) (OSIR)IS(IS) lab, I figured I would wax poetic on some of the more amusing issues that tracee, Aqua Security’s “Runtime Security and Forensics” framework for Linux, used to have and other general issues that anyone attempting to write a production-ready tracer should be aware of. These come up frequently whenever we’re looking at Linux tracers. This post assumes some familiarity with writing eBPF-based tracing tools, if you haven’t played with eBPF yet, consider poking around your kernel and system processes with bpftrace or bcc.

tl;dr In this post, we discuss an insecure coding pattern commonly used in system observability and program analysis tools, and several techniques that can enable one to evade observation from such tools using that pattern, especially when they are being used for security event analysis. We also discuss several ways in which such software can be written that do not enable such evasion, and the current limitations that make it more difficult than necessary to write such code correctly.

fork(2)/clone(2) et al Considered Harmful

As we’ve mentioned before,1 one does not simply trace fork(2) or clone(2) because the child process is actually started (from a CoW snapshot of the caller process) before the syscall has actually returned to the caller. To do so would be a problem, as any tracer that waits for the return value of fork(2)/clone(2)/etc. to start watching the PID will invariably lose some of the initial operations of the child >99% of the time. While this is not a “problem” for most applications’ behavior, it becomes troublesome for monitoring systems based on following individual process hierarchies live instead of all processes globally, retroactively, as anyone can simply “double-fork” in rapid succession to throw off the yoke of inspection, since the second fork(2) will be missed ~100% of the time (even when implementing the bypass in C).

// $ gcc -std=c11 -Wall -Wextra -pedantic -o double-fork dobule-fork.c
// $ ./double-fork <iterations> </path/to/binary>
#include <stdlib.h>
#include <stdio.h>
#include <sys/types.h>
#include <unistd.h>

int main(int argc, char** argv, char** envp) {
  if (argc < 3) {
    return 1;
  }

  int loop = atoi(argv[1]);

  for (int i=0; i < loop; i++) {
    pid_t p = fork();
    if (p != 0) {
      return 0;
    }
  }

  return execve(argv[2], &argv[2], envp);
}
/tracee/dockerer/tracee.main/dist # ./tracee --trace process:follow --filter pid=48478 -e execve -e clone
TIME(s)        UID    COMM             PID     TID     RET              EVENT                ARGS
111506.067379  0      bash             0       0       50586            clone                flags: CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID, stack: 0x0, parent_tid: 0x0, child_tid: 0x7F56929630BF, tls: 0
111506.069569  0      bash             0       0       0                execve               pathname: ./double-fork, argv: [./double-fork 100 /usr/bin/id]
111506.077553  0      double-fork      0       0       50590            clone                flags: CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID, stack: 0x0, parent_tid: 0x0, child_tid: 0x7FF0153690BF, tls: 0
111506.079220  0      double-fork      0       0       50592            clone                flags: CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID, stack: 0x0, parent_tid: 0x0, child_tid: 0x7FF0153690BF, tls: 0
...
111506.142778  0      double-fork      0       0       50690            clone                flags: CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID, stack: 0x0, parent_tid: 0x0, child_tid: 0x7FF0153690BF, tls: 0
111506.143236  0      double-fork      0       0       0                execve               pathname: /usr/bin/id, argv: [/usr/bin/id]
...
111514.289461  0      bash             0       0       50699            clone                flags: CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID, stack: 0x0, parent_tid: 0x0, child_tid: 0x7F56929630BF, tls: 0
111514.293312  0      bash             0       0       0                execve               pathname: ./double-fork, argv: [./double-fork 100 /usr/bin/id]
111514.303955  0      double-fork      0       0       50700            clone                flags: CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID, stack: 0x0, parent_tid: 0x0, child_tid: 0x7F9CF46280BF, tls: 0
111514.304240  0      double-fork      0       0       50701            clone                flags: CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID, stack: 0x0, parent_tid: 0x0, child_tid: 0x7F9CF46280BF, tls: 0
...
111514.356522  0      double-fork      0       0       50799            clone                flags: CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID, stack: 0x0, parent_tid: 0x0, child_tid: 0x7F9CF46280BF, tls: 0
111514.356949  0      double-fork      0       0       0                execve               pathname: /usr/bin/id, argv: [/usr/bin/id]
...
111519.410500  0      double-fork      0       0       50836            clone                flags: CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID, stack: 0x0, parent_tid: 0x0, child_tid: 0x7F533D0A10BF, tls: 0
111519.411117  0      double-fork      0       0       50837            clone                flags: CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID, stack: 0x0, parent_tid: 0x0, child_tid: 0x7F533D0A10BF, tls: 0

The start of child execution is triggered from the wake_up_new_task() function called by _do_fork() (now kernel_clone()), which is the internal kernel function powering all of the fork(2)/clone(2) alikes.

pid_t kernel_clone(struct kernel_clone_args *args)
{
  ...
  /*
   * Do this prior waking up the new thread - the thread pointer
   * might get invalid after that point, if the thread exits quickly.
   */
  trace_sched_process_fork(current, p);

  pid = get_task_pid(p, PIDTYPE_PID);
  nr = pid_vnr(pid);

  if (clone_flags & CLONE_PARENT_SETTID)
    put_user(nr, args->parent_tid);

  if (clone_flags & CLONE_VFORK) {
    p->vfork_done = &vfork;
    init_completion(&vfork);
    get_task_struct(p);
  }

  wake_up_new_task(p);
  ...

In our talk (slides 24-25), we gave a bpftrace example of a fork-exec tracer that would not lose to race conditions.

kprobe:wake_up_new_task {
  $chld_pid= ((structtask_struct*)arg0)->pid;
  @pids[$chld_pid]= nsecs;
}

tracepoint:syscalls:sys_enter_execve {
  if (@pids[pid]){
    $time_diff= ((nsecs-@pids[pid]) / 1000000);
    if($time_diff<= 10){
      printf("%s => ",comm);
      join(args->argv);
    }
  }
  delete(@pids[pid]);
}

In general, we prefer to hook wake_up_new_task() with a kprobe since it’s fairly stable and gives raw access to the entire fully-configured child struct task_struct* right before it is started. However, if one does not care about other metadata accessible from that pointer, nor need it to be fully initialized (i.e. if they just want the PID), they can hook the sched_process_fork tracepoint event, which is triggered by the trace_sched_process_fork(current, p) call shown above. This is what tracee currently opts to do as of commit 8c944cf07f15045f395f7754f92b7809316c681c/tag v0.5.4.

Additionally, the problems of tracing the fork(2)/clone(2)/etc. syscalls directly led to (and lead to in any tracers not hooking wake_up_new_task/sched_process_fork) other issues that can present bypasses in the scenario of live child process observation.

PID Namespaces

The most interesting of these issues is that fork(2)/clone(2)/etc. return PIDs within the context of the PID namespace of the process (thread). As a result, the return values of these syscalls cannot meaningfully be used by a kernel-level tracer without also accounting for child pidns PID to host PID mappings. In distros that allow unprivileged user namespaces to be created, this allows arbitrary process to create nested PID namespaces by first creating a nested user namespace. This can be done in a number of ways, such as via unshare(2), setns(2), or even clone(2) with CLONE_NEWUSER and CLONE_NEWPID.

root@box:~# su -s /bin/bash nobody
nobody@boc:/root$ unshare -Urpf --mount-proc
root@box:/root# nano &
[1] 18
root@box:/root# ps aux
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root           1  0.1  0.0   8264  5156 pts/0    S    01:37   0:00 -bash
root          17  0.0  0.0   7108  4132 pts/0    T    01:38   0:00 nano
root          18  0.0  0.0   8892  3344 pts/0    R+   01:38   0:00 ps aux
root@box:/root# unshare -pf --mount-proc
root@box:/root# ps aux
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root           1  1.2  0.0   8264  5200 pts/0    S    01:38   0:00 -bash
root          15  0.0  0.0   8892  3332 pts/0    R+   01:38   0:00 ps aux
// $ gcc -std=c11 -Wall -Wextra -pedantic -o userns-clone-fork userns-clone-fork.c
// $ ./userns-clone-fork </path/to/binary>

#define _GNU_SOURCE
#include <sched.h>

#include <stdlib.h>
#include <stdio.h>
#include <sys/types.h>
#include <unistd.h>

int clone_child(void *arg) {
  char** argv = (char**)arg;

  printf("clone pid: %u\n", getpid());

  pid_t p = fork();
  if (p != 0) {
    return 0;
  }

  printf("fork pid: %u\n", getpid());

  return execve(argv[1], &argv[1], NULL);
}

static char stack[1024*1024];

int main(int argc, char **argv) {
  if (argc < 2) {
    return 1;
  }

  printf("parent pid: %u\n", getpid());

  pid_t p = clone(clone_child, &stack[sizeof(stack)], CLONE_NEWUSER|CLONE_NEWPID, argv);
  if (p == -1) {
    perror("clone");
    exit(1);
  }

  return 0;
}
/tracee/dockerer/tracee.main/dist # ./tracee --trace process:follow --filter pid=54519 -e execve -e clone
TIME(s)        UID    COMM             PID     TID     RET              EVENT                ARGS
117174.563477  0      bash             0       0       55395            clone                flags: CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID, stack: 0x0, parent_tid: 0x0, child_tid: 0x7F99617200BF, tls: 0
117174.566597  0      bash             0       0       0                execve               pathname: ./userns-clone-fork, argv: [./userns-clone-fork /usr/bin/id]
117174.578037  0      userns-clone-fo  0       0       55396            clone                flags: CLONE_NEWUSER|CLONE_NEWPID, stack: 0x5621B2DBB030, parent_tid: 0x0, child_tid: 0x7F7C130B6285, tls: 18
117174.579600  0      userns-clone-fo  0       0       2                clone                flags: CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID, stack: 0x0, parent_tid: 0x0, child_tid: 0x7F7C1307A0BF, tls: 0

However, more interestingly, this means that such tracers will not work by default on containers unless the containers run within the host PID namespace, a dangerous configuration. This was the behavior we observed with tracee prior to the aforementioned commit 8c944cf07f15045f395f7754f92b7809316c681c.

clone3(2) also Considered Harmful

Prior to tracee 0.5.4, fork(2)/clone(2)-alike syscall PID return value processing was handled with the following code:

SEC("raw_tracepoint/sys_exit")
int tracepoint__raw_syscalls__sys_exit(struct bpf_raw_tracepoint_args *ctx)
{
    long ret = ctx->args[1];
    struct task_struct *task = (struct task_struct *)bpf_get_current_task();
    struct pt_regs *regs = (struct pt_regs*)ctx->args[0];
    int id = READ_KERN(regs->orig_ax);

    ...

    // fork events may add new pids to the traced pids set
    // perform this check after should_trace() to only add forked childs of a traced parent
    if (id == SYS_CLONE || id == SYS_FORK || id == SYS_VFORK) {
        u32 pid = ret;
        bpf_map_update_elem(&traced_pids_map, &pid, &pid, BPF_ANY);
        if (get_config(CONFIG_NEW_PID_FILTER)) {
            bpf_map_update_elem(&new_pids_map, &pid, &pid, BPF_ANY);
        }
    }

In the above snippet, the syscall ID is compared against those of clone(2), fork(2), and vfork(2). However, the syscall is not compared against the ID of clone3(2). While tracee does separately log clone3(2) events (since commit f44eb206bf8e80efeb1da68641cb61f3f00c522c/tag v0.4.0), the above omission resulted in clone3(2)-created child process not being followed prior to commit 8c944cf07f15045f395f7754f92b7809316c681c.

// $ gcc -std=c11 -Wall -Wextra -pedantic -o clone3 clone3.c
// $ ./clone3 </path/to/binary>

#define _GNU_SOURCE
#include <sched.h>

#include <linux/sched.h>
#include <linux/types.h>

#include <stdlib.h>
#include <stdint.h>
#include <stdio.h>
#include <sys/types.h>
#include <unistd.h>
#include <sys/syscall.h>

int clone_child(void *arg) {
  char** argv = (char**)arg;

  return execve(argv[1], &argv[1], NULL);
}

int main(int argc, char **argv) {
  if (argc < 2) {
    return 1;
  }

  printf("parent pid: %u\n", getpid());

  struct clone_args args = {0};
  pid_t p = syscall(__NR_clone3, &args, sizeof(struct clone_args));

  if (p == -1) {
    perror("clone3");
    return 1;
  }

  if (p != 0) {
    printf("clone pid: %u\n", p);
  } else {
    clone_child(argv);
  }

  return 0;
}

Since that commit, which introduces the change to use sched_process_fork, tracee now obtains both the host PID and in-namespace PID:

SEC("raw_tracepoint/sched_process_fork")
int tracepoint__sched__sched_process_fork(struct bpf_raw_tracepoint_args *ctx)
{
    // Note: we don't place should_trace() here so we can keep track of the cgroups in the system
    struct task_struct *parent = (struct task_struct*)ctx->args[0];
    struct task_struct *child = (struct task_struct*)ctx->args[1];

    int parent_pid = get_task_host_pid(parent);
    int child_pid = get_task_host_pid(child);

    ...

    if (event_chosen(SCHED_PROCESS_FORK) && should_trace()) {
       ... 
        int parent_ns_pid = get_task_ns_pid(parent);
        int child_ns_pid = get_task_ns_pid(child);

        save_to_submit_buf(submit_p, (void*)&parent_pid, sizeof(int), INT_T, DEC_ARG(0, *tags));
        save_to_submit_buf(submit_p, (void*)&parent_ns_pid, sizeof(int), INT_T, DEC_ARG(1, *tags));
        save_to_submit_buf(submit_p, (void*)&child_pid, sizeof(int), INT_T, DEC_ARG(2, *tags));
        save_to_submit_buf(submit_p, (void*)&child_ns_pid, sizeof(int), INT_T, DEC_ARG(3, *tags));

        events_perf_submit(ctx);
    }

    return 0;
}

TOCTTOU Issues Endemic to Lightweight Tracers

In our CCC talk23, we discussed how there exists a significant time-of-check-to-time-of-use (TOCTTOU) race condition when hooking a syscall entrypoint (e.g. via a kprobe, but also more generally) as userland-supplied data that is copied/processed by the hook may change by the time the kernel accesses as part of the syscall’s implementation.

The main way to get around this issue is to hook internal kernel functions, tracepoints, or LSM hooks to access syscall inputs after they have already been copied into kernel memory (and probe the in-kernel version). However, this approach is not universally applicable and only works in the presence of such internal anchor points. Instead, one has to rely on the Linux Auditing System (aka auditd), which, in addition to simple raw syscall argument dumps, has its calls directly interleaved within the kernel’s codebase to process and log inputs after they have been copied from user memory for processing by the kernel. auditd’s calls are very carefully (read: fragilely) placed to ensure that values used for filtering and logging are not subject to race conditions, even in the cases where data is being read from user memory.

auditd: The “d” Stands for Dancing

For example, auditd’s execve(2) logging takes the following form for a simple ls -lht /:

type=EXECVE msg=audit(...): argc=4 a0="ls" a1="--color=auto" a2="-lht" a3="/"

This log line is generated by audit_log_execve_info() from apparent userspace memory:

  const char __user *p = (const char __user *)current->mm->arg_start;
  ...
      len_tmp = strncpy_from_user(&buf_head[len_buf], p,
                                  len_max - len_buf);

However, we can observe that the execve(2) argument handling of auditd is “safe” with the following bpftrace script which hooks some of the functions called during an execve(2) syscall that have symbols:

kprobe:__audit_bprm { printf("__audit_bprm called\n"); }
kprobe:setup_arg_pages { printf("setup_arg_pages called\n") }
kprobe:do_open_execat { printf("do_open_execat called\n"); }
kprobe:open_exec { printf("open_exec(\"%s\") called\n", str(arg0)); }
kprobe:security_bprm_creds_for_exec { printf("security_bprm_creds_for_exec called\n"); }
# bpftrace trace.bt 
Attaching 5 probes...
do_open_execat called
security_bprm_creds_for_exec called
open_exec("/lib64/ld-linux-x86-64.so.2") called
do_open_execat called
setup_arg_pages called
__audit_bprm called

The first do_open_execat() call is that from bprm_execve(), which is called from do_execveat_common(), right after argv is copied into the struct linux_binprm. setup_arg_pages is called from within a struct linux_binfmt implementation and sets current->mm->arg_start to bprm->p. And then lastly, __audit_bprm() is called (from exec_binprm(), itself called from bprm_execve()), which sets the auditd context type to AUDIT_EXECVE, resulting in audit_log_execve_info() being called from audit_log_exit() (via show_special()) to generate the above type=EXECVE log line.

It goes without saying that this is not really something that eBPF code could hope to do in any sort of stable manner. One could try to use eBPF to hook a bunch of the auditd related functions in the kernel, but that probably isn’t very stable either and any such code would essentially need to re-implement just the useful parts of auditd that extract inputs, process state, and system state, and not the cruft (slow filters, string formatting, and who knows what else) that results in auditd somehow having a syscall overhead upwards of 245%.4

eBPF Doesn’t Like to Share

Instead of trying to hook onto __audit_* symbols called only when auditd is enabled, we should probably try to find relevant functions or tracepoints in the same context to latch onto, such as trace_sched_process_exec in the case of execve(2).

static int exec_binprm(struct linux_binprm *bprm)
{
  ...
  audit_bprm(bprm);
  trace_sched_process_exec(current, old_pid, bprm);
  ptrace_event(PTRACE_EVENT_EXEC, old_vpid);
  proc_exec_connector(current);
  return 0;
}

As it turns out, trace_sched_process_exec is even more necessary than one might initially think. While race conditions when hooking syscalls via kprobes and tracepoints are troublesome, it turns out that userspace can flat out block eBPF from reading syscall inputs if they reside in MAP_SHARED pages. It is worth noting that such tomfoolery is somewhat limited as it only works against bpf_probe_read(|_user|_kern) calls made before a page is read by the kernel in a given syscall handler. As a result, a quick “fix” for tracers is to perform such reads when the syscall returns. However, such a “fix” would increase the feasibility of race condition abuse whenever the syscall implementation takes longer than the syscall context switch.

Given that this limitation doesn’t appear to be that well known, it could be a bug in the kernel, but it only presents an issue when one is already writing their eBPF tracers in the wrong manner. tracee is not generally vulnerable to MAP_SHARED abuse because it mostly dumps syscall arguments from a raw tracepoint hook on sys_exit. However, for syscalls that don’t normally return, such as execve(2), it resorts to dumping the arguments in its sys_enter raw tracepoint hook, enabling the syscall event to be fooled. Regardless, this is also not an issue for tracee as it implements a hook for the sched_process_exec tracepoint as of commit 6166346e7479bc3b4b417a67a92a2493a30b949e/tag v0.6.0.

// $ gcc -std=c11 -Wall -Wextra -pedantic -o clobber clobber.c -lpthread

#define _GNU_SOURCE
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <stdint.h>
#include <unistd.h>
#include <fcntl.h>
#include <sys/mman.h>
#include <pthread.h>

static char* checker;
static char* key;

static int mode = 1;

//force byte by byte copies
void yoloncpy(volatile char* dst, volatile char* src, size_t n, int r) {
  if (r == 0) {
    for (size_t i = 0; i < n; i++) {
      dst[i] = src[i];
    }
  } else {
    for (size_t i = n; i > 0; i--) {
      dst[i-1] = src[i-1];
    }
  }
}

void* thread1(void* arg) {
  int rev = (int)(uintptr_t)arg;
  uint64_t c = 0;
  while(1) {//c < 8192) {
    switch (c%2) {
      case 0: {
        yoloncpy(key, "supergood", 10, rev);
        break;
      }
      case 1: {
        yoloncpy(key, "reallybad", 10, rev);
        break;
      }
    }
    c += 1;
  }
  return NULL;
}

void* thread2(void* arg) {
  (void)arg;
  uint64_t c = 0;
  while(1) {
    switch (c%2) {
      case 0: {
        memcpy(key, "supergood", 10);
        break;
      }
      case 1: {
        memcpy(key, "reallybad", 10);
        break;
      }
    }
    c += 1;
  }
  return NULL;
}

int main(int argc, char** argv, char** envp) {

  if (argc < 2) {
    printf("usage: %s <count> [mode]\n", argv[0]);
    return 1;
  }

  int count = atoi(argv[1]);

  if (argc >= 3) {
    mode = atoi(argv[2]);
    if (mode != 1 && mode != 2) {
      printf("invalid mode: %s\n", argv[2]);
      return 1;
    }
  }

  key = mmap(NULL, 32, PROT_READ | PROT_WRITE, MAP_SHARED | MAP_ANONYMOUS, -1, 0);
  if (key == NULL) {
    perror("mmap");
    return 1;
  }
  checker = mmap(NULL, 32, PROT_READ | PROT_WRITE, MAP_SHARED | MAP_ANONYMOUS, -1, 0);
  if (checker == NULL) {
    perror("mmap2");
    return 1;
  }

  strcpy(key, "supergood");
  strcpy(checker, "./checker");

  char count_str[32] = "0";

  char* nargv[] = {checker, key, count_str, NULL};

  pthread_t t_a;
  pthread_t t_b;
  if (mode == 1) {
    pthread_create(&t_a, NULL, &thread1, (void*)0);
    pthread_create(&t_b, NULL, &thread1, (void*)1);
  } else if (mode == 1) {
    pthread_create(&t_a, NULL, &thread2, NULL);
  }

  int c = 0;
  while(c < count) {
    snprintf(count_str, sizeof(count_str), "%d", c);

    int r = fork();
    if (r == 0) {
      int fd = open(key, 0);
      if (fd >= 0) {
        close(fd);
      }
      execve(checker, nargv, envp);
    } else {
      sleep(1);
    }
    c += 1;
  }
  return 0;
}
# ./dist/tracee-ebpf --trace event=execve,sched_process_exec
UID    COMM             PID     TID     RET              EVENT                ARGS
0      bash             7662    7662    0                execve               pathname: ./clobber, argv: [./clobber 5]
0      clobber          7662    7662    0                sched_process_exec   cmdpath: ./clobber, pathname: /root/clobber, argv: [./clobber 5], dev: 264241153, inode: 5391, invoked_from_kernel: 0
0      clobber          7665    7665    0                execve               argv: []
0      checker          7665    7665    0                sched_process_exec   cmdpath: ./checker, pathname: /root/checker, argv: [./checker rupergood 0], dev: 264241153, inode: 5393, invoked_from_kernel: 0
0      clobber          7666    7666    0                execve               argv: []
0      checker          7666    7666    0                sched_process_exec   cmdpath: ./checker, pathname: /root/checker, argv: [./checker reallybad 1], dev: 264241153, inode: 5393, invoked_from_kernel: 0
0      clobber          7667    7667    0                execve               argv: []
0      checker          7667    7667    0                sched_process_exec   cmdpath: ./checker, pathname: /root/checker, argv: [./checker supergood 2], dev: 264241153, inode: 5393, invoked_from_kernel: 0
0      clobber          7668    7668    0                execve               argv: []
0      checker          7668    7668    0                sched_process_exec   cmdpath: ./checker, pathname: /root/checker, argv: [./checker supergbad 3], dev: 264241153, inode: 5393, invoked_from_kernel: 0
0      clobber          7669    7669    0                execve               argv: []
0      checker          7669    7669    0                sched_process_exec   cmdpath: ./checker, pathname: /root/checker, argv: [./checker reallgood 4], dev: 264241153, inode: 5393, invoked_from_kernel: 0

Note: Interestingly enough, I only stumbled across this behavior because it would have been less effective to use in-process threads to clobber inputs to execve(2) since it kills all other threads than the one issuing the syscall. The open() call above exists primarily to trigger an example for the below test code to show how probes from sys_enter fail (with error -14, bad address), but succeed in in the sys_exit hook.

SEC("raw_tracepoint/sys_enter")
int sys_enter_hook(struct bpf_raw_tracepoint_args *ctx) {
  struct pt_regs _regs;
  bpf_probe_read(&_regs, sizeof(_regs), (void*)ctx->args[0]);
  int id = _regs.orig_ax;
  char buf[128];
  if (id == 257) {
    char* const pathname = (char* const)_regs.si;
    bpf_printk("sys_enter -> openat %p\n", pathname);
    bpf_probe_read_str(buf, sizeof(buf), (void*)pathname);
    bpf_printk("sys_enter -> openat %s\n", buf);
  } else if (id == 59) {
    char* const f = (char* const)_regs.di;
    bpf_printk("sys_exit -> execve %p\n", f);
    bpf_probe_read_str(buf, sizeof(buf), (void*)f);
    bpf_printk("sys_exit -> execve %s\n", buf);
  }

  return 0;
}

SEC("raw_tracepoint/sys_exit")
int sys_exit_hook(struct bpf_raw_tracepoint_args *ctx) {
  struct pt_regs _regs;
  bpf_probe_read(&_regs, sizeof(_regs), (void*)ctx->args[0]);
  int id = _regs.orig_ax;
  char buf[128];
  if (id == 257) {
    char* const pathname = (char* const)_regs.si;
    bpf_printk("sys_exit -> openat %p\n", pathname);
    bpf_probe_read_str(buf, sizeof(buf), (void*)pathname);
    bpf_printk("sys_exit -> openat %s\n", buf);
  } else if (id == 59) {
    char* const f = (char* const)_regs.di;
    bpf_printk("sys_exit -> execve %p\n", f);
    bpf_probe_read_str(buf, sizeof(buf), (void*)f);
    bpf_printk("sys_exit -> execve %s\n", buf);
  }

  return 0;
}
# cat /sys/kernel/tracing/trace_pipe
...
<...>-215084  [000] .... 2266209.468617: 0: sys_enter -> openat 000000005ec00ae4
<...>-215084  [000] .N.. 2266209.468645: 0: sys_enter -> openat
<...>-215084  [000] .... 2266209.469091: 0: sys_exit -> openat 000000005ec00ae4
<...>-215084  [000] .N.. 2266209.469114: 0: sys_exit -> openat supelybad
<...>-215084  [000] .... 2266209.469199: 0: sys_exit -> execve 0000000031d15ade
<...>-215084  [000] .N.. 2266209.469222: 0: sys_exit -> execve
<...>-215084  [000] .... 2266209.470178: 0: sys_exit -> execve 0000000000000000
<...>-215084  [000] .N.. 2266209.470224: 0: sys_exit -> execve
<...>-215084  [000] .... 2266209.472093: 0: sys_enter -> openat 000000008edac6ac
<...>-215084  [000] .N.. 2266209.472138: 0: sys_enter -> openat /etc/ld.so.cache
<...>-215084  [000] .... 2266209.472205: 0: sys_exit -> openat 000000008edac6ac
<...>-215084  [000] .N.. 2266209.472248: 0: sys_exit -> openat /etc/ld.so.cache
<...>-215084  [000] .... 2266209.472345: 0: sys_enter -> openat 000000007671a9c9
<...>-215084  [000] .N.. 2266209.472366: 0: sys_enter -> openat /lib/x86_64-linux-gnu/libc.so.6
<...>-215084  [000] .... 2266209.472420: 0: sys_exit -> openat 000000007671a9c9
<...>-215084  [000] .N.. 2266209.472440: 0: sys_exit -> openat /lib/x86_64-linux-gnu/libc.so.6
...

Current Thoughts

If you want accurate tracing for syscall events, you probably shouldn’t be hooking the actual syscalls, and especially not the syscall tracepoints. Instead, your only real option is to figure out how to dump the arguments from the internals of a given syscall implementation. Depending on if there are proper hook-points (e.g. tracepoints, LSM hooks, etc.) or not — and if they provide access to all arguments — it may be necessary to hook internal kernel functions with kprobes for absolute correctness, if it is at all possible in the first place. For what it’s worth, this is mostly a problem with Linux itself and not something that kprobe-ing kernel modules can fix; though they can properly handle kernel structs beyond basic complexity, unlike eBPF.

In the case of security event auditing, correctness supersedes ease of development, but vendors may not be making that choice, at least not initially. Due to this, auditors must be aware of how their analysis tools actually work and how (and from where) they source event information, so that they can treat the output with a sizable hunk of salt where necessary because, while the tools are likely not lying, they may not be not capable of telling the truth either.


  1. Olsen, Andy. “Fast and Easy pTracing with eBPF (and not ptrace)” NCC Group Open Forum, NCC Group, September 2019, New York, NY. Presentation.↩︎

  2. Dileo, Jeff; Olsen, Andy. “Kernel Tracing With eBPF: Unlocking God Mode on Linux” 35th Chaos Communication Congress (35C3), Chaos Computer Club (CCC), 30 December 2018, Leipziger Messe, Leipzig, Germany. Conference Presentation.↩︎

  3. https://media.ccc.de/v/35c3-9532-kernel_tracing_with_ebpf↩︎

  4. https://capsule8.com/blog/auditd-what-is-the-linux-auditing-system/↩︎