Sandbox Escape: Linux PI futex self-requeue bug

2014-05-2605:00:49

comex

hackerone.com

$10000

117

7.2 High

CVSS2

Access Vector

LOCAL

Access Complexity

LOW

Authentication

NONE

Confidentiality Impact

COMPLETE

Integrity Impact

COMPLETE

Availability Impact

COMPLETE

AV:L/AC:L/Au:N/C:C/I:C/A:C

0.001 Low

EPSS

Percentile

47.4%

JSON

I hope I haven’t messed something up…

The issue exists when after blocking in futex_wait_requeue_pi, q.rt_waiter is NULL but &rt_waiter (on the stack) has been added to various waiter lists by rt_mutex_start_proxy_lock.

This is not supposed to be possible, because setting rt_waiter to NULL indicates atomic acquisition. This is done by requeue_pi_wake_futex, which is called by futex_requeue (FUTEX_CMP_REQUEUE_PI) in two cases where the lock could be acquired immediately on behalf of some waiter rather than blocking. Meanwhile, rt_mutex_start_proxy_lock is only called from the bottom of futex_requeue, and only enqueues rt_waiter if the lock could not be acquired immediately. Since any particular FUTEX_WAIT_REQUEUE_PI is only supposed to be requeued once, those two possibilities should be mutually exclusive.

The requeue-once rule is enforced by only allowing requeueing to the futex previously passed to futex_wait_requeue_pi as uaddr2, so it’s not possible to requeue from A to B, then from B to C - but it is possible to requeue from B to B.

When this happens, if (!q.rt_waiter) passes, so rt_mutex_finish_proxy_lock is never called. (Also, AFAIK, free_pi_state is never called, which is true even without this weird requeue; in the case where futex_requeue calls requeue_pi_wake_futex directly, pi_state will sit around until it gets cleaned up in exit_pi_state_list when the thread exits. This is not a vulnerability.) futex_wait_requeue_pi exits, and various pointers to rt_waiter become dangling.

I haven’t actually tested this in a sandbox, but from reading the code, I believe most/all the syscalls used in the exploit are allowed by the Chromium renderer, GPU, NaCl, etc. sandbox - in particular, futex, setpriority, and prctl, without restrictions. (setpriority is overridden to allowed in those policies; the others are in the baseline policy.) Also, the exploit should be able to defeat KASLR, although it was not actually enabled in the kernel I was testing on (see comments).

I have attached an exploit for the Debian 3.14.4-1 Linux image on amd64, which manages to run some code in kernel mode and return. As discussed in the comments, it may be nontrivial to port to other kernel builds due to unpredictable compiler decisions, but I hope it demonstrates exploitability.