This is the DRAFT branch of this blog. Articles already published will not appear here.
Go to the main site
mw casual engineeringA five-panel comic featuring the Linux mascot Tux with a bandage placed above it, and a box with a question-mark.

Panel 1: The box says “Am I being traced??” and Linux says “No trust me.”
Panel 2: The box says “Ok now guess the number”.
Panel 3: Linux says “1” and the box says “lol no, now get lost”.
Panel 4: The background has 2 clocks in random rotation and shear, and also has some lines which says “ick_revert_proc: Restoring CoW'd page at 0x00000000005cxxxx”, and the box is wrapped in a squiggly blue circle, and is itself distorted.
Panel 5: The box thinks “(hmm what was that must have been the wind)” and Linux says “jk i mean 2”.

Using the Linux kernel to help me crack an executable quickly

Linux systems programming reverse engineering memory management deep dives

Truth be told, I've never been really good at CTFs, but I do enjoy solving challenges at my own pace, and exploring perhaps less conventional methods, learning more about programming in the process. A few weeks ago I found a reverse engineering problem which basically boiled down to running a Linux binary and entering the correct number to get it to print a flag. The program was heavily obfuscated, has anti-debugging techniques, and potentially utilized self-modifying code, but strace shows that, aside from those, it did not try to do anything fancy with system calls, attempt to save files, or communicate via network.

Fortunately, the interesting bits of the program ran quite fast – after reading the input, it spends around 5us (0.005ms) before printing out whether the guess was correct or not. The input space was also very managable – between 0 to 3,000,000. This means that even a brute-force search of the possible inputs could finish in a reasonable time, and there is no need to expend much effort on actual reverse engineering if we don't have to. The only tricky part is, how do we convince it to try different inputs as fast as this?

After considering some alternatives, my approach for solving this eventually ended up being a self-taught lesson in kernel hacking.

Table of Contents
  1. Environment
  2. Initial investigation
  3. What if we get the kernel to try numbers for us?
    1. Identify and mark the target process
    2. Some ground work for our new ‘feature’
    3. Patching read, inject our guess
    4. Patching write to call our revert function
    5. Save and restore registers
      1. What about the other ones?
    6. Save and restore memory
      1. What does fork do?
      2. Kgdb fun
      3. How does the kernel know what to do with write faults?
      4. Ok, so how do we write-protect the pages?
      5. Let's actually copy the pages, then restore them!
    7. Blocking off other syscalls
  4. Final thoughts
  5. Addendum: Hacking it with ptrace

    Environment

    Before we get into more investigation, let's quickly explain what my test environment for this looks like. Since I'm not a fan of running random binaries from the Internet on my host system, we will only ever run the target binary in a VM. In fact, since there will be a good amount of kernel hacking today, we need a VM for which which we can easily boot our hacked kernel in. Previously I made some shell scripts which comes in handy. It:

    Once you've checked that you have Docker and QEMU installed, you can get started with the same environment by cloning micromaomao/linux-dev, running make -j$(nproc) to compile the kernel, then .dev/startvm.sh to build the rootfs (if not already present) and start the VM (your user account might need to be in the kvm group). If you are on ARM, remove the -enable-kvm and -cpu host flag in startvm.sh to use emulation instead.

    A screenshot of a terminal running the startvm.sh script. The last line is a bash prompt.

    Initial investigation

    A straightforward, first approach would be to just run it repeatedly. However, with such a short runtime between getting the input and printing the result, any significant overhead in either the initialization of the executable itself, or the time spent in the script used to repeatedly run it quickly starts to dominate. A quick test would be to run it under strace and see how long it takes from the initial execve to the first read. That would be a good starting point to understand more about the program and catch any potentially surprising behavior, so let's do that first.

    Quick reminder again that you should not run or even interact with untrusted stuff directly on your host system. This include running them under strace, gdb, or even using ldd on them.

    $ strace -o strace.log --timestamps=ns ./hackme
    ...
    $ cat strace.log
    21:00:00.076244725 execve("./hackme", ["./hackme"], 0x7ffc71a76088 /* 13 vars */) = 0
    21:00:00.082059770 open("/proc/self/exe", O_RDONLY) = 3
    21:00:00.085364610 mmap(NULL, 663446, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f7865402000
    // mapping itself as (non-persistent) read-write
    21:00:00.087883357 mmap(0x7f7865402000, 663048, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED, 3, 0) = 0x7f7865402000
    21:00:00.090062941 mprotect(0x7f78654a3000, 3990, PROT_READ|PROT_EXEC) = 0
    21:00:00.092506797 readlink("/proc/self/exe", "/hackme", 4095) = 7
    21:00:00.095318319 mmap(0x400000, 1855488, PROT_NONE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x400000
    21:00:00.097642653 mmap(0x400000, 1536, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x400000
    21:00:00.099838908 mprotect(0x400000, 1536, PROT_READ) = 0
    // rwx... I later realized this executable has gone through a packer
    21:00:00.102363996 mmap(0x401000, 1420157, PROT_READ|PROT_WRITE|PROT_EXEC, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0x1000) = 0x401000
    // ... more stuff follows which aren't too interesting for now ...
    21:00:00.129039501 mprotect(0x401000, 1420157, PROT_READ|PROT_EXEC) = 0
    21:00:00.132023341 mmap(0x55c000, 346488, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0x15c000) = 0x55c000
    21:00:00.138600815 mprotect(0x55c000, 346488, PROT_READ) = 0
    21:00:00.141142324 mmap(0x5b1000, 49832, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0x1b1000) = 0x5b1000
    21:00:00.143710481 mprotect(0x5b1000, 49832, PROT_READ|PROT_WRITE) = 0
    21:00:00.145928557 mmap(0x5be000, 27880, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x5be000
    21:00:00.148246519 mmap(NULL, 4096, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7f7865401000
    21:00:00.150516772 close(3)             = 0
    21:00:00.153247606 munmap(0x7f7865402000, 663446) = 0
    21:00:00.155881587 arch_prctl(0x3001 /* ARCH_??? */, 0x7ffce824f170) = -1 EINVAL (Invalid argument)
    21:00:00.158451970 brk(NULL)            = 0x5c5000
    21:00:00.160858466 brk(0x5c5d80)        = 0x5c5d80
    21:00:00.163373055 arch_prctl(ARCH_SET_FS, 0x5c5400) = 0
    21:00:00.165862687 set_tid_address(0x5c56d0) = 184
    21:00:00.168069772 set_robust_list(0x5c56e0, 24) = 0
    21:00:00.170348551 rseq(0x5c5d20, 0x20, 0, 0x53053053) = 0
    21:00:00.172586062 prlimit64(0, RLIMIT_STACK, NULL, {rlim_cur=8192*1024, rlim_max=RLIM64_INFINITY}) = 0
    21:00:00.174898664 readlinkat(AT_FDCWD, "/proc/self/exe", "/hackme", 4096) = 7
    21:00:00.177229490 getrandom("\xb8\xaa\xbf\x47\x80\xa4\x0a\x0e", 8, GRND_NONBLOCK) = 8
    21:00:00.179682734 brk(NULL)            = 0x5c5d80
    21:00:00.181862768 brk(0x5e6d80)        = 0x5e6d80
    21:00:00.183994583 brk(0x5e7000)        = 0x5e7000
    21:00:00.186285765 mprotect(0x5b1000, 40960, PROT_READ) = 0
    21:00:00.188569383 futex(0x5bd9dc, FUTEX_WAKE_PRIVATE, 2147483647) = 0
    // Detecting debugger by looking for TracerPid?
    21:00:00.190868059 openat(AT_FDCWD, "/proc/self/status", O_RDONLY) = 3
    21:00:00.193172756 read(3, "Name:\thackme\nUmask:\t0022\nState:\t"..., 8191) = 1076
    // ... more strace output ...
    

    I suspect it is detecting debuggers and potentially changing its behavior. After writing this I later found out that actually it will pretend to run as normal even when it has detected a tracer, but actually will not print the flag even with the right input. For now, I found the code that puts that number in /proc/self/status by simply searching for “TracerPid”:

    diff --git a/fs/proc/array.c b/fs/proc/array.c
    index 34a47fb0c57f..141121505a37 100644
    --- a/fs/proc/array.c
    +++ b/fs/proc/array.c
    @@ -185,7 +185,7 @@ static inline void task_state(struct seq_file *m, struct pid_namespace *ns,
    seq_put_decimal_ull(m, "\nNgid:\t", ngid);
    seq_put_decimal_ull(m, "\nPid:\t", pid_nr_ns(pid, ns));
    seq_put_decimal_ull(m, "\nPPid:\t", ppid);
    - seq_put_decimal_ull(m, "\nTracerPid:\t", tpid);
    + seq_put_decimal_ull(m, "\nTracerPid:\t", 0);
    seq_put_decimal_ull(m, "\nUid:\t", from_kuid_munged(user_ns, cred->uid));
    seq_put_decimal_ull(m, "\t", from_kuid_munged(user_ns, cred->euid));
    seq_put_decimal_ull(m, "\t", from_kuid_munged(user_ns, cred->suid));

    With the linux-dev environment I mentioned earlier, we can simply make the change, then run make -j16 (assuming your PC has 16 CPU threads) to produce a new kernel image, then kill the old VM (with Ctrl-D on the console or .dev/stopvm.sh if it's stuck) and run startvm.sh again.

    You can apply the change above by copying the entire patch, then do git apply, paste it in, and hit Ctrl-D.

    It then proceeded to do this, which I assume is a second layer of debugger detection:

    ptrace(PTRACE_TRACEME) = -1 EPERM (Operation not permitted)
    

    It is probably looking for whether this call returns a -EPERM, which would indicate that the process is (already) being traced. But since we're already making kernel changes, this is not difficult to work around either. We can find the relevant function by searching for PTRACE_TRACEME, which will quickly surface the function ptrace_traceme:

    diff --git a/kernel/ptrace.c b/kernel/ptrace.c
    index d5f89f9ef29f..017d25c00fa3 100644
    --- a/kernel/ptrace.c
    +++ b/kernel/ptrace.c
    @@ -486,7 +486,7 @@ static int ptrace_attach(struct task_struct *task, long request,
    */
    static int ptrace_traceme(void)
    {
    - int ret = -EPERM;
    + int ret = 0;
    write_lock_irq(&tasklist_lock);
    /* Are we already being traced? */

    Quick note: instead of making such changes to the kernel, there are alternative to strace which relies on seccomp-unotify instead of ptrace, which also gets around anti-debugging techniques that targets ptrace, and may be more convenient to use in other situations.

    Now let's try strace it again:

    $ strace -o strace.log --timestamps=ns ./hackme
    ...
    $ cat strace.log
    21:01:00.368065520 execve("./hackme", ["./hackme"], 0x7ffe48cd5cb8 /* 6 vars */) = 0
    ...
    21:01:00.486692889 openat(AT_FDCWD, "/proc/self/status", O_RDONLY) = 3
    21:01:00.489438989 read(3, "Name:\thackme\nUmask:\t0022\nState:\t"..., 8191) = 1066
    21:01:00.491976961 close(3)             = 0
    21:01:00.494425997 getrandom("\x75\x79\x93\x7f\x58\x99\xe3\x3e", 8, 0) = 8
    21:01:00.496887787 getrandom("\x21\x5f\xfa\x83\x6c\x93\xef\x97", 8, 0) = 8
    21:01:00.499471244 ptrace(PTRACE_TRACEME) = 0   // yay
    21:01:00.501847234 getrandom("\x32\x5b\x46\xf2\x18\x84\xee\x3e", 8, 0) = 8
    // ... more getrandom ...
    21:01:00.530190685 getrandom("\x25\x69\xff\xa5\x61\x13\x9d\x82", 8, 0) = 8
    // Detecting if stdin/stdout is a terminal before read/write...
    // Could be using this to decide if it will buffer IO, or actual anti-scripting measure?
    21:01:00.532301961 newfstatat(1, "", {st_mode=S_IFCHR|0600, st_rdev=makedev(0x5, 0x1), ...}, AT_EMPTY_PATH) = 0
    21:01:00.535515472 ioctl(1, TCGETS, {c_iflag=ICRNL|IXON, c_oflag=NL0|CR0|TAB0|BS0|VT0|FF0|OPOST|ONLCR, c_cflag=B115200|CS8|CREAD|HUPCL|CLOCAL, c_lflag=ISIG|ICANON|ECHO|ECHOE|ECHOK|IEXTEN|ECHOCTL|ECHOKE, ...}) = 0
    21:01:00.539257429 write(1, startup output..., 48) = 48
    // ...
    21:01:00.544755153 newfstatat(0, "", {st_mode=S_IFCHR|0600, st_rdev=makedev(0x5, 0x1), ...}, AT_EMPTY_PATH) = 0
    21:01:00.547057135 ioctl(0, TCGETS, {c_iflag=ICRNL|IXON, c_oflag=NL0|CR0|TAB0|BS0|VT0|FF0|OPOST|ONLCR, c_cflag=B115200|CS8|CREAD|HUPCL|CLOCAL, c_lflag=ISIG|ICANON|ECHO|ECHOE|ECHOK|IEXTEN|ECHOCTL|ECHOKE, ...}) = 0
    21:01:00.549331495 read(0, "1\n", 4096) = 2
    // Note slight pause here due to input from serial terminal.
    21:01:00.551503575 write(1, "Nope"..., 43) = 43
    21:01:00.554241515 exit_group(0)        = ?
    21:01:00.556739011 +++ exited with 0 +++
    

    If we use the kernel tracing tool trace-cmd (which interacts with ftrace and related APIs designed for both kernel debugging and performance tracing), we can find out how long the program spends computing the result after reading the input, not including time spent in the actual I/O, and also without any strace overhead (which was actually quite significant when you're looking at sub-ms level).

    To do this, we can trace the syscalls family of events. We're specifically interested in read and write, and each syscall has two events we can trace: sys_enter_... and sys_exit_.... We specify these events with -e, and -F -c to only trace events from children processes.

    trace-cmd record -e 'syscalls/sys_*_read'  -e 'syscalls/sys_*_write' -F -c ./hackme
    trace-cmd report
    
    cpus=2
              hackme-82    [000] .....     4.364735: sys_exit_write:       0x1
              hackme-82    [000] .....     4.431209: sys_enter_read:       fd: 0x00000003, buf: 0x005c6920, count: 0x00001fff
              hackme-82    [000] .....     4.431220: sys_exit_read:        0x42c
              hackme-82    [000] .....     4.431236: sys_enter_write:      fd: 0x00000001, buf: 0x005c6920, count: 0x00000030
              hackme-82    [000] .....     4.431243: sys_exit_write:       0x30
              hackme-82    [000] .....     4.431244: sys_enter_write:      fd: 0x00000001, buf: 0x005c6920, count: 0x00000026
              hackme-82    [000] .....     4.431244: sys_exit_write:       0x26
              hackme-82    [000] .....     4.431246: sys_enter_read:       fd: 0x00000000, buf: 0x005c8960, count: 0x00001000
              hackme-82    [001] .....     5.737281: sys_exit_read:        0x2
              hackme-82    [001] .....     5.737292: sys_enter_write:      fd: 0x00000001, buf: 0x005c6920, count: 0x0000002b
              hackme-82    [001] .....     5.738621: sys_exit_write:       0x2b
    

    We can ignore everything that is not reading from stdin (fd = 0) or writing to stdout (fd = 1) after that. Subtracting the two green numbers, that's 11us from returning from read to attempting write! On the other hand, getting to the first read from the start of the program took almost 200ms, and so if we have to run the whole thing repeatedly (for example, in a bash loop), it will be 200ms 3,000,000 7 days! Now, you can of course use more CPU cores and run multiple instances of this loop at once, but that's not a very interesting solution. In this case the challenge is also time-sensitive and so we need something better.

    Notice that the program doesn't really try to do anything fancy (at least not externally visible) between read and write – all it does is spin some numbers around in its own memory, then spits out ‘Correct’ plus the flag, or ‘Nope’ if the guess was wrong. This means that, in theory, if we can perfectly restore the program's memory and register states1, we can instantly start trying a different input, then another one in 10us (when I ran it fast enough often it finishes within 5), and so on until we find the right number. Without the overhead of even forking another process, this has the potential to be a lot faster – we're basically turning the whole input validation process into a for loop! It's just that the ‘for’ in this case is outside the program. Hehe

    What if we get the kernel to try numbers for us?

    Recently I have also been getting very interested in Linux kernel development (from debugging weird kernel panics at work 🥲), so I decided why not turn this into a kernel programming exercise, and try to get Linux to automatically try all the possible inputs for me, and restore the program state in between inputs, as suggested earlier?

    The core idea is:

    1. We can identify the process we're interested in from its name (“hackme”).
    2. When a target process attempts to read from stdin, we:
      a) Save off its register states;
      b) Write-protect all its memory pages;
      c) Inject the next input into the read buffer.
    3. When a write protection fault happens on our target process, we:
      a) check if it is one of the mapping we've previously write-protected;
      b) if so, we save off an ‘original’ copy of the page being written to, and allow the write to continue.
    4. When the target process attempts a write to stdout, we:
      a) Check if the output indicate a wrong guess (whether it contains the word “Nope!”);
      b) If the output contains Nope, restore all modified memory pages to their original content, restore register states, and re-run the read syscall, which will cause the next input to be injected;
      c) Otherwise, we just let the process continue. If it prints the flag out, we will see it in the console.

    Now, this might seem like massive overkill, and it probably is, but hear me out:

    I will now walk through each step one by one, as laid out above. If you as the reader don't have a lot of kernel experience (I don't claim to have), hopefully by going like this, this will not be too difficult to follow:

    Identify and mark the target process

    The first step (step 1) is to identity and mark the process we're interested in. In Linux, each thread is represented by a struct task_struct, which contains things like the PID, process name, memory mappings, and a thousand other things. We can of course add our own data to this struct – for example, we can have a bool to indicate whether a process4 is our hack target, and use it to decide if we should do special things in our modified read/write syscall handlers (we don't want to break unrelated processes like the shell, for example).

    You probably know that when you run an executable, the shell forks a subprocess then run exec with the command arguments. If we assume our target binary is always named “hackme”, we can check for this in the handler for exec, and set the bool we added previously to true.

    With some searching around and perhaps tracing function calls with trace-cmd (which juse uses ftrace), we find that there is a common function for execve and execveatdo_execveat_common, and so we can add our code there, and add the additional field to the task_struct:

    diff --git a/fs/exec.c b/fs/exec.c
    index 9c349a74f385..926c1b60a706 100644
    --- a/fs/exec.c
    +++ b/fs/exec.c
    @@ -1891,6 +1891,12 @@ static int do_execveat_common(int fd, struct filename *filename,
    if (IS_ERR(filename))
    return PTR_ERR(filename);
    + if (strstr(filename->name, "hackme") != NULL) {
    + pr_info("execveat: %s[%u] to be hacked\n", filename->name, current->pid);
    + current->hack_target.hack = true;
    + current->hack_target.next_number = 1;
    + }
    +
    /*
    * We move the actual failure in case of RLIMIT_NPROC excess from
    * set*uid() to execve() because too many poorly written programs
    diff --git a/include/linux/sched.h b/include/linux/sched.h
    index 02eaf84c8626..9f58f1db4632 100644
    --- a/include/linux/sched.h
    +++ b/include/linux/sched.h
    @@ -1599,6 +1599,16 @@ struct task_struct {
    struct user_event_mm *user_event_mm;
    #endif
    + struct {
    + /* This indicates whether a process is our target */
    + bool hack;
    + /*
    + * And if the above is true, this is the next number we will
    + * send to it
    + */
    + u32 next_number;
    + } hack_target;
    +
    /*
    * New fields for task_struct should be added above here, so that
    * they are included in the randomized portion of task_struct.
    diff --git a/init/init_task.c b/init/init_task.c
    index 136a8231355a..7aeb9882d171 100644
    --- a/init/init_task.c
    +++ b/init/init_task.c
    @@ -219,6 +219,7 @@ struct task_struct init_task __aligned(L1_CACHE_BYTES) = {
    #ifdef CONFIG_SECCOMP_FILTER
    .seccomp = { .filter_count = ATOMIC_INIT(0) },
    #endif
    + .hack_target = { .hack = false, .next_number = 0 },
    };
    EXPORT_SYMBOL(init_task);

    Note that current is just a macro for a pointer to the current thread's task_struct, and pr_info prints to the syslog which in our case appears on the console.

    We also added the correct initialization of .hack_target for unrelated threads to init_task, which is what every other processes are forked from. When a process forks or clones, the entire task struct is first memcpy'd across (see arch_dup_task_struct (arch/x86/kernel/process.c)), and so there's no other place we need to initialize this (except for in execve when we detected a hack target).

    Let's test this out:

    > make -j$(nproc)
    ...
    > ./.dev/startvm.sh
    ...
    [    0.000000][    T0] Linux version 6.12.1-dev-00019-gc33353343963-dirty ...
    ...
    root@feef72fcd655:/# ./hackme
    [    5.050260][   T79] execveat: ./hackme[79] to be hacked
    (program prints startup message normally)
    root@feef72fcd655:/# cp hackme dont-hack-me
    root@feef72fcd655:/# ./dont-hack-me
    (program prints startup message normally)
    root@feef72fcd655:/#
    

    Nice :)

    Some ground work for our new ‘feature’

    Our next step (step 2) is to figure out a way to save off the state of the target process when it calls read – in some sense, ‘checkpoint’ it, and then inject our number guess. I'm going to give my special kernel feature that does this a slick name: ick, which stands for Instant ChecKpoint. We will create some utility functions which we can call in our patched read/write to checkpoint and restore the process, as well as a way to clean up the saved state should the process exits unexpectedly. Let's start by creating our header and C file for this feature, adding a basic struct ick_checked_process* pointer in task_struct for us to hold various data (like the saved off memory pages) later, and declare our functions.

    While this is not really necessary, let's also add a proper config option for our silly little feature – it's not difficult to do, and follows the rest of Linux. We will then gate all ick-related code within #ifdef CONFIG_ICK blocks.

    diff --git a/.config b/.config
    index c0d1a7505177..651bd63becee 100644
    --- a/.config
    +++ b/.config
    @@ -3454,3 +3454,5 @@ CONFIG_ARCH_USE_MEMTEST=y
    #
    # end of Rust hacking
    # end of Kernel hacking
    +
    +CONFIG_ICK=y
    diff --git a/Kconfig b/Kconfig
    index 745bc773f567..9b13ab8e4f29 100644
    --- a/Kconfig
    +++ b/Kconfig
    @@ -30,3 +30,9 @@ source "lib/Kconfig"
    source "lib/Kconfig.debug"
    source "Documentation/Kconfig"
    +
    +config ICK
    + bool "ick"
    + default y
    + help
    + instant checkpoint
    diff --git a/include/linux/ick.h b/include/linux/ick.h
    new file mode 100644
    index 000000000000..640d46e3a223
    --- /dev/null
    +++ b/include/linux/ick.h
    @@ -0,0 +1,32 @@
    +/* SPDX-License-Identifier: GPL-2.0 */
    +
    +#ifndef _LINUX_ICK_H
    +#define _LINUX_ICK_H
    +
    +#include <linux/types.h>
    +#include <linux/sched.h>
    +
    +#ifdef CONFIG_ICK
    +struct ick_checked_process {
    + // ???
    +};
    +
    +/**
    + * Initialize the ick data structures on the current task and checkpoint it.
    + */
    +int ick_checkpoint_proc(void);
    +
    +/**
    + * Revert the current task to the state checkpointed by ick_checkpoint_proc.
    + */
    +int ick_revert_proc(void);
    +
    +/**
    + * Clean up any saved ick checkpoints from the current task, if there is any.
    + * Call before a task exits.
    + */
    +void ick_cleanup(struct task_struct *task);
    +
    +#endif
    +
    +#endif
    diff --git a/include/linux/sched.h b/include/linux/sched.h
    index 9f58f1db4632..b20a4ced71f8 100644
    --- a/include/linux/sched.h
    +++ b/include/linux/sched.h
    @@ -1599,6 +1599,10 @@ struct task_struct {
    struct user_event_mm *user_event_mm;
    #endif
    +#ifdef CONFIG_ICK
    + struct ick_checked_process *ick_data;
    +#endif
    +
    struct {
    /* This indicates whether a process is our target */
    bool hack;
    diff --git a/init/init_task.c b/init/init_task.c
    index 7aeb9882d171..aefe6778df78 100644
    --- a/init/init_task.c
    +++ b/init/init_task.c
    @@ -218,6 +218,9 @@ struct task_struct init_task __aligned(L1_CACHE_BYTES) = {
    #endif
    #ifdef CONFIG_SECCOMP_FILTER
    .seccomp = { .filter_count = ATOMIC_INIT(0) },
    +#endif
    +#ifdef CONFIG_ICK
    + .ick_data = NULL,
    #endif
    .hack_target = { .hack = false, .next_number = 0 },
    };
    diff --git a/kernel/Makefile b/kernel/Makefile
    index 87866b037fbe..d129c357e521 100644
    --- a/kernel/Makefile
    +++ b/kernel/Makefile
    @@ -159,3 +159,5 @@ $(obj)/kheaders_data.tar.xz: FORCE
    $(call cmd,genikh)
    clean-files := kheaders_data.tar.xz kheaders.md5
    +
    +obj-$(CONFIG_ICK) += ick.o
    diff --git a/kernel/ick.c b/kernel/ick.c
    new file mode 100644
    index 000000000000..5d0146cbae2a
    --- /dev/null
    +++ b/kernel/ick.c
    @@ -0,0 +1,19 @@
    +/* SPDX-License-Identifier: GPL-2.0 */
    +
    +#include <linux/ick.h>
    +#include <linux/errno.h>
    +
    +int ick_checkpoint_proc(void) {
    + // TODO: implement
    + return -EPERM;
    +}
    +
    +int ick_revert_proc(void) {
    + // TODO: implement
    + return -EPERM;
    +}
    +
    +void ick_cleanup(struct task_struct *task) {
    + // TODO: implement
    + return;
    +}

    In case you didn't know, the proper way to have a function without parameters in C is to use void in the parameter list, like void myfunc(void).

    Patching read, inject our guess

    Next, we can finish off the relatively easier part of step 2 – handling read, calling the checkpoint function, and injecting our next guess. Again, we can use ftrace to figure out the best function to change. There are more ‘complicated’ variant of the read syscall that takes a struct iovec, but that's not what our target binary uses, so we won't worry about that.

    diff --git a/fs/read_write.c b/fs/read_write.c
    index befec0b5c537..43f403ef9476 100644
    --- a/fs/read_write.c
    +++ b/fs/read_write.c
    @@ -25,6 +25,8 @@
    #include <linux/uaccess.h>
    #include <asm/unistd.h>
    +#include <linux/ick.h>
    +
    const struct file_operations generic_ro_fops = {
    .llseek = generic_file_llseek,
    .read_iter = generic_file_read_iter,
    @@ -700,9 +702,51 @@ static inline loff_t *file_ppos(struct file *file)
    ssize_t ksys_read(unsigned int fd, char __user *buf, size_t count)
    {
    - struct fd f = fdget_pos(fd);
    + struct fd f;
    ssize_t ret = -EBADF;
    + if (fd == 0 && current->hack_target.hack) {
    + /*
    + * Our hacked process is now trying to read stdin...
    + * If we have not already checkpointed it, do it now.
    + * (if we _have_ already checkpointed, then maybe we've reverted to this
    + * point, so just inject next number)
    + */
    + if (!current->ick_data) {
    + trace_printk("ick checkpoint on hacked process %s[%u]\n",
    + current->comm, current->pid);
    + ret = ick_checkpoint_proc();
    + if (ret) {
    + // %pe with ERR_PTR gets us nice error output like -EINVAL
    + pr_err("sys_read: ick checkpoint failed: %pe\n", ERR_PTR(ret));
    + return ret;
    + }
    + }
    +
    + // And now, inject the next guess.
    + u32 number = current->hack_target.next_number++;
    + trace_printk("Providing number %u to hacked process %s[%u]\n",
    + number, current->comm, current->pid);
    + char data[64];
    + int len = snprintf(data, sizeof(data), "%u\n", number);
    + if (len < 0) {
    + pr_err("sys_read: snprintf failed: %pe\n", ERR_PTR(len));
    + return len;
    + }
    + if (count < len) {
    + pr_err("sys_read: user provided buffer too small: %zu < %d\n", count, len);
    + return -EINVAL;
    + }
    + ret = copy_to_user(buf, data, len);
    + if (ret) {
    + pr_err("sys_read: copy_to_user failed: %pe\n", ERR_PTR(ret));
    + return ret;
    + }
    + return len;
    + }
    +
    + f = fdget_pos(fd);
    +
    if (fd_file(f)) {
    loff_t pos, *ppos = file_ppos(fd_file(f));
    if (ppos) {
    diff --git a/kernel/ick.c b/kernel/ick.c
    index 5d0146cbae2a..16a1b41418cb 100644
    --- a/kernel/ick.c
    +++ b/kernel/ick.c
    @@ -5,7 +5,8 @@
    int ick_checkpoint_proc(void) {
    // TODO: implement
    - return -EPERM;
    + // We just return 0 for now to make the code in `read` work.
    + return 0;
    }
    int ick_revert_proc(void) {

    While we can use strace to see whether we're sending back the right thing to the user-space, for ease of debugging we can write our own test binary which prints out the number received, and also, for now, loops back itself, so that we can see if the guess increments correctly: my-hackme-looped.cpp

    make; ./dev/startvm.sh
    root@feef72fcd655:/# ./my-hackme-looped | head
    [   34.542186][   T83] execveat: ./my-hackme-looped[83] to be hacked
    Enter number: Nope! 1 was a wrong guess. The correct number is 574165.
    Enter number: Nope! 2 was a wrong guess. The correct number is 574165.
    Enter number: Nope! 3 was a wrong guess. The correct number is 574165.
    Enter number: Nope! 4 was a wrong guess. The correct number is 574165.
    Enter number: Nope! 5 was a wrong guess. The correct number is 574165.
    Enter number: Nope! 6 was a wrong guess. The correct number is 574165.
    Enter number: Nope! 7 was a wrong guess. The correct number is 574165.
    Enter number: Nope! 8 was a wrong guess. The correct number is 574165.
    Enter number: Nope! 9 was a wrong guess. The correct number is 574165.
    Enter number: Nope! 10 was a wrong guess. The correct number is 574165.
    root@feef72fcd655:/# ./my-hackme-looped | tail
    [   36.012463][   T85] execveat: ./my-hackme-looped[85] to be hacked
    Enter number: Nope! 494491 was a wrong guess. The correct number is 494500.
    Enter number: Nope! 494492 was a wrong guess. The correct number is 494500.
    Enter number: Nope! 494493 was a wrong guess. The correct number is 494500.
    Enter number: Nope! 494494 was a wrong guess. The correct number is 494500.
    Enter number: Nope! 494495 was a wrong guess. The correct number is 494500.
    Enter number: Nope! 494496 was a wrong guess. The correct number is 494500.
    Enter number: Nope! 494497 was a wrong guess. The correct number is 494500.
    Enter number: Nope! 494498 was a wrong guess. The correct number is 494500.
    Enter number: Nope! 494499 was a wrong guess. The correct number is 494500.
    Enter number: Correct!
    root@feef72fcd655:/#
    

    Great success! Note that at this point we haven't even touched write yet – this is only looping because our test program has a deliberate loop inside, but we will eventually not need that once we implement the checkpointing.

    Also note that for very frequent output, I have used trace_printk which doesn't print to the console, but can be seen in the kernel trace. As opposed to pr_info/printk, this is just writing to a ringbuffer in memory, and so is also much faster and more appropriate to use in tight loops.

    root@feef72fcd655:/# tail /sys/kernel/tracing/trace
     my-hackme-loope-90      [001] .....   447.571690: ksys_read: ick checkpoint on hacked process my-hackme-loope[90]
     my-hackme-loope-90      [001] .....   447.571690: ksys_read: Providing number 27 to hacked process my-hackme-loope[90]
     my-hackme-loope-90      [001] .....   447.571691: ksys_read: ick checkpoint on hacked process my-hackme-loope[90]
     my-hackme-loope-90      [001] .....   447.571691: ksys_read: Providing number 28 to hacked process my-hackme-loope[90]
     ...                     ...           ...         ...
    

    Patching write to call our revert function

    Let's also do the same for write – we will get the print output from user-space, and depending on whether we see a “Nope” we will either restore checkpoint, or let the process continue (handling the write as normal).

    diff --git a/fs/read_write.c b/fs/read_write.c
    index 43f403ef9476..f5a10ff795aa 100644
    --- a/fs/read_write.c
    +++ b/fs/read_write.c
    @@ -768,9 +768,36 @@ SYSCALL_DEFINE3(read, unsigned int, fd, char __user *, buf, size_t, count)
    ssize_t ksys_write(unsigned int fd, const char __user *buf, size_t count)
    {
    - struct fd f = fdget_pos(fd);
    + struct fd f;
    ssize_t ret = -EBADF;
    + if (fd == 1 && current->hack_target.hack) {
    + char data[64];
    + int ret = copy_from_user(data, buf, min_t(size_t, count, sizeof(data)));
    + if (ret) {
    + pr_err("sys_write: copy_from_user failed: %pe\n", ERR_PTR(ret));
    + return ret;
    + }
    + data[sizeof(data) - 1] = '\0';
    + trace_printk("hacked process attempted write with data %s\n", data);
    + if (strnstr(data, "Nope!", sizeof(data))) {
    + ret = ick_revert_proc();
    + if (ret) {
    + pr_err("sys_write: ick_revert_proc failed: %pe\n", ERR_PTR(ret));
    + return ret;
    + }
    +
    + // Hmm... what do we do now? Let's just return for now.
    + return count;
    + } else {
    + pr_info("hack: %u gave different output\n", current->hack_target.next_number - 1);
    + // No indication that the number was incorrect - let's just take this
    + // `write` as normal (so e.g. we will see the flag on the console)
    + }
    + }
    +
    + f = fdget_pos(fd);
    +
    if (fd_file(f)) {
    loff_t pos, *ppos = file_ppos(fd_file(f));
    if (ppos) {
    diff --git a/kernel/ick.c b/kernel/ick.c
    index 16a1b41418cb..ceae3ca4be33 100644
    --- a/kernel/ick.c
    +++ b/kernel/ick.c
    @@ -11,7 +11,8 @@ int ick_checkpoint_proc(void) {
    int ick_revert_proc(void) {
    // TODO: implement
    - return -EPERM;
    + // We just return 0 for now to make the code in `write` work.
    + return 0;
    }
    void ick_cleanup(struct task_struct *task) {

    Since we haven't implemented checkpoint restore yet, the overall behaviour currently will not change. However, since we're effectively ‘consuming’ any output with “Nope!” (by not returning from ksys_write instead of letting it continue), we should not see them in the console, but we should still see our trace_printk printing them out:

    root@feef72fcd655:/# ./my-hackme-looped
    [    7.461863][   T84] execveat: ./my-hackme-looped[84] to be hacked
    [    7.487957][   T84] hack: 0 gave different output
    Enter number: [    7.492040][   T84] hack: 1 gave different output
    Enter number: [    7.493428][   T84] hack: 2 gave different output
    Correct!
    root@feef72fcd655:/# cat /sys/kernel/tracing/trace
    # tracer: nop
     ...                     ...             ...       ...
     my-hackme-loope-84      [000] d....     7.461871: console: execveat: ./my-hackme-looped[84] to be hacked
     my-hackme-loope-84      [000] .....     7.487955: ksys_write: hacked process attempted write with data Enter number:
     my-hackme-loope-84      [000] d....     7.487961: console: hack: 0 gave different output
     my-hackme-loope-84      [000] .....     7.490135: ksys_read: ick checkpoint on hacked process my-hackme-loope[84]
     my-hackme-loope-84      [000] .....     7.490136: ksys_read: Providing number 1 to hacked process my-hackme-loope[84]
     my-hackme-loope-84      [000] .....     7.492037: ksys_write: hacked process attempted write with data Nope! 1 was a wrong guess. The correct number is 2.
    
     my-hackme-loope-84      [000] .....     7.492039: ksys_write: hacked process attempted write with data Enter number:
     my-hackme-loope-84      [000] d....     7.492042: console: hack: 1 gave different output
     my-hackme-loope-84      [000] .....     7.493423: ksys_read: ick checkpoint on hacked process my-hackme-loope[84]
     my-hackme-loope-84      [000] .....     7.493424: ksys_read: Providing number 2 to hacked process my-hackme-loope[84]
     my-hackme-loope-84      [000] .....     7.493427: ksys_write: hacked process attempted write with data Correct!
    
     my-hackme-loope-84      [000] d....     7.493430: console: hack: 2 gave different output
    

    You do not need trace-cmd nor even enabling tracing in tracefs to see trace_printk outputs.

    Note that the “_ gave different output” prints are because the program was writing “Enter number: ” every loop iteration, which doesn't contain “Nope!”. Once our checkpoint restore is working, the program should end up back to when it first issues the read, which means that any prompt to enter number would no longer be printed again.

    Save and restore registers

    If you had a look at struct task_struct and concluded that there is nothing in there that looks like it stores the thread's registers, you're correct. So, what are we supposed to save/restore?

    In Linux, each thread has a kernel stack which is used when it is executing kernel code (for example, when it is handling a syscall). This is separate from the user-space stack, and in fact lives in kernel memory not accessible from user-space. When a thread is running user-space code, it is empty. Whenever the thread enters the kernel, for example due to a syscall (or other reasons like interrupts), kernel entry code (arch/x86/entry/entry_64.S) pushes all the (general-purpose) registers from user-space onto this stack before calling into any C functions, which will start messing up the registers (especially the caller-saved ones, since the compiler isn't required to restore them at the end of a function):

    SYM_CODE_START(entry_SYSCALL_64)
        ...
        PUSH_AND_CLEAR_REGS rax=$-ENOSYS
        ...
        call    do_syscall_64        /* returns with IRQs disabled */
    
    .macro PUSH_REGS rdx=%rdx rcx=%rcx rax=%rax save_ret=0 unwind_hint=1
        .if \save_ret
        pushq   %rsi            /* pt_regs->si */
        movq    8(%rsp), %rsi    /* temporarily store the return address in %rsi */
        movq    %rdi, 8(%rsp)    /* pt_regs->di (overwriting original return address) */
        .else
        pushq   %rdi        /* pt_regs->di */
        pushq   %rsi        /* pt_regs->si */
        .endif
        pushq   \rdx        /* pt_regs->dx */
        pushq   \rcx        /* pt_regs->cx */
        pushq   \rax        /* pt_regs->ax */
        pushq   %r8         /* pt_regs->r8 */
        pushq   %r9         /* pt_regs->r9 */
        pushq   %r10        /* pt_regs->r10 */
        pushq   %r11        /* pt_regs->r11 */
        pushq   %rbx        /* pt_regs->rbx */
        pushq   %rbp        /* pt_regs->rbp */
        pushq   %r12        /* pt_regs->r12 */
        pushq   %r13        /* pt_regs->r13 */
        pushq   %r14        /* pt_regs->r14 */
        pushq   %r15        /* pt_regs->r15 */
    
        .if \unwind_hint
        UNWIND_HINT_REGS
        .endif
    
        .if \save_ret
        pushq    %rsi        /* return address on top of stack */
        .endif
    .endm
    
    .macro PUSH_AND_CLEAR_REGS rdx=%rdx rcx=%rcx rax=%rax save_ret=0 clear_bp=1 unwind_hint=1
        PUSH_REGS rdx=\rdx, rcx=\rcx, rax=\rax, save_ret=\save_ret unwind_hint=\unwind_hint
        CLEAR_REGS clear_bp=\clear_bp
    .endm
    

    The result of the register pushing code is that the top (i.e. ‘outer-most’ frame) of the task stack effectively contains a struct pt_regs5 after this. On kernel exit, this is popped back into the actual registers before returning to user-space. Therefore, it's reasonable to think that if we save and restore this struct pt_regs, we can effectively save and restore the register states of the calling thread.

    Assuming our hypothesis is correct, we can start implementing the register saving part of ick. Some searching around the kernel code would reveal that we can use the current_pt_regs macro to get a pointer to this struct for our current task. See for example arch/x86/kernel/process.c:231 (in copy_thread). This function is called when a thread calls fork, and it copies the pt_regs from the current task (the parent that called fork) to the new child task.

    diff --git a/include/linux/ick.h b/include/linux/ick.h
    index 640d46e3a223..1fa2a0f4a3b7 100644
    --- a/include/linux/ick.h
    +++ b/include/linux/ick.h
    @@ -8,7 +8,8 @@
    #ifdef CONFIG_ICK
    struct ick_checked_process {
    - // ???
    + struct pt_regs saved_regs;
    + // ... more to come ...
    };
    /**
    diff --git a/kernel/ick.c b/kernel/ick.c
    index ceae3ca4be33..c242185f519f 100644
    --- a/kernel/ick.c
    +++ b/kernel/ick.c
    @@ -2,20 +2,79 @@
    #include <linux/ick.h>
    #include <linux/errno.h>
    +#include <linux/ptrace.h>
    +#include <linux/sched/task_stack.h>
    int ick_checkpoint_proc(void) {
    - // TODO: implement
    - // We just return 0 for now to make the code in `read` work.
    + struct ick_checked_process *ick_data;
    + struct pt_regs *regs;
    +
    + if (current->ick_data) {
    + pr_alert("ick: %s[%d] already has a checkpoint\n",
    + current->comm, current->pid);
    + return -EEXIST;
    + }
    +
    + ick_data = kzalloc(sizeof(*ick_data), GFP_KERNEL);
    + if (!ick_data) {
    + return -ENOMEM;
    + }
    +
    + // Save registers
    +#if defined(__x86_64__)
    + regs = current_pt_regs();
    + memcpy(&ick_data->saved_regs, regs, sizeof(struct pt_regs));
    +#else
    +#error "Unsupported architecture"
    +#endif
    +
    + // TODO: implement rest
    +
    + current->ick_data = ick_data;
    + trace_printk("ick: Checkpointed %s[%d]\n", current->comm, current->pid);
    +
    return 0;
    }
    int ick_revert_proc(void) {
    - // TODO: implement
    - // We just return 0 for now to make the code in `write` work.
    + struct ick_checked_process *ick_data;
    + struct pt_regs *regs;
    +
    + ick_data = current->ick_data;
    + if (!ick_data) {
    + pr_alert("ick: ick_revert_proc called on %s[%d] which is not under ick checkpoint\n",
    + current->comm, current->pid);
    + return -EINVAL;
    + }
    +
    + // Restore registers
    +#if defined(__x86_64__)
    + regs = current_pt_regs();
    + memcpy(regs, &ick_data->saved_regs, sizeof(struct pt_regs));
    +#else
    +#error "Unsupported architecture"
    +#endif
    +
    + // TODO: implement rest
    +
    + trace_printk("Restored process %s[%d]\n",
    + current->comm, current->pid);
    +
    return 0;
    }
    void ick_cleanup(struct task_struct *task) {
    - // TODO: implement
    + struct ick_checked_process *ick_data = task->ick_data;
    +
    + if (!ick_data) {
    + return;
    + }
    +
    + trace_printk("Cleaning up ick data for %s[%d]\n", task->comm, task->pid);
    +
    + // TODO: add more clean-up code here
    +
    + kfree(ick_data);
    + task->ick_data = NULL;
    return;
    }

    Let's try it out:

    root@c8c5a2904008:/# ./my-hackme-looped
    [    4.294402][   T79] execveat: ./my-hackme-looped[79] to be hacked
    [    4.323670][   T79] hack: 0 gave different output
    Enter number: [    4.327423][   T79] my-hackme-loope[79]: segfault at 3a ip 000000000000003a sp 00007ffd0a4a49f0 error 14 likely on CPU 1 (core 1, socket 0)
    [    4.329196][   T79] Code: Unable to access opcode bytes at 0x10.
    

    Hmm… The program crashed because it tried to jump to some nonsensical instruction address (0x0000003a). Now, remember that we haven't implemented memory restore yet, and ‘memory’ includes the (user-space) stack as well. Consider what happens when you call a C function like read: the return address gets pushed to the stack, the function executes, and when it finishes the ret instruction pops that address from the stack and jumps to the caller. The caller may then call other functions, which will push and pop more data and return addresses from the stack, overwriting the previous one. Therefore, until we implements checkpoint/restore of memory, we will never be able to return to the original call side anyway, and it is perfectly possible that if the call stack for the read and write functions are at different depths, restoring the registers would result in the stack pointer pointing at garbage, rather than a valid return address.

    We can confirm what's going on with gdb. First we tell it to stop on the first read, so that we can inspect the registers:

    root@c8c5a2904008:/# gdb my-hackme-looped
    GNU gdb (Debian 13.1-3) 13.1
    ...
    Reading symbols from my-hackme-looped...
    (gdb) catch syscall read
    Catchpoint 1 (syscall 'read' [0])
    (gdb) condition 1 $rdi == 0
    (gdb) r
    Starting program: /my-hackme-looped
    [   32.349018][   T94] execveat: /my-hackme-looped[94] to be hacked
    [   32.393885][   T94] hack: 0 gave different output
    Enter number:
    Catchpoint 1 (call to syscall read), 0x000000000053eed1 in read ()
    (gdb)
    

    The instruction condition 1 $rdi == 0 makes the catchpoint only trigger on read from stdin ($rdi holds the first argument, and that is the file descriptor in the case of read/write). The catchpoint will trigger before the syscall is actually executed by the kernel (although because it uses ptrace to trap on syscalls, $rip will point to the instruction after the syscall, looking as-if the syscall has already executed).

    (gdb) disas $rip
    Dump of assembler code for function read:
       0x000000000053eec0 <+0>:	endbr64
       0x000000000053eec4 <+4>:	cmpb   $0x0,0xa668d(%rip)        # 0x5e5558 <__libc_single_threaded>
       0x000000000053eecb <+11>:	je     0x53eee0 <read+32>
       0x000000000053eecd <+13>:	xor    %eax,%eax
       0x000000000053eecf <+15>:	syscall
    => 0x000000000053eed1 <+17>:	cmp    $0xfffffffffffff000,%rax
       0x000000000053eed7 <+23>:	ja     0x53ef28 <read+104>
       0x000000000053eed9 <+25>:	ret
       0x000000000053eeda <+26>:	nopw   0x0(%rax,%rax,1)
       0x000000000053eee0 <+32>:	push   %rbp
    

    Let's note down the registers:

    (gdb) info reg
    rax            0xffffffffffffffda  -38
    rbx            0x5e4a00            6179328
    rcx            0x53eed1            5500625
    rdx            0x1000              4096
    rsi            0x6038a0            6305952
    rdi            0x0                 0
    rbp            0x7fffffffd800      0x7fffffffd800
    rsp            0x7fffffffd7c8      0x7fffffffd7c8
    r8             0x0                 0
    r9             0x4                 4
    r10            0x405b0f            4217615
    r11            0x246               582
    r12            0x5e1e10            6168080
    r13            0x5e1cc0            6167744
    r14            0x5e4820            6178848
    r15            0x5e1e10            6168080
    rip            0x53eed1            0x53eed1 <read+17>
    eflags         0x246               [ PF ZF IF ]
    cs             0x33                51
    ss             0x2b                43
    ds             0x0                 0
    es             0x0                 0
    fs             0x0                 0
    gs             0x0                 0
    (gdb)
    

    Ok, let's allow the program to continue to write:

    (gdb) catch syscall write
    Catchpoint 2 (syscall 'write' [1])
    (gdb) c
    Continuing.
    
    Catchpoint 1 (returned from syscall read), 0x000000000053eed1 in read ()
    (gdb) c
    Continuing.
    
    Catchpoint 2 (call to syscall write), 0x000000000053ef74 in write ()
    (gdb)
    

    Let's see the registers again, before we enter the syscall:

    (gdb) info reg
    rax            0xffffffffffffffda  -38
    rbx            0x3a                58
    rcx            0x53ef74            5500788
    rdx            0x3a                58
    rsi            0x602890            6301840
    rdi            0x1                 1
    rbp            0x7fffffffd7b0      0x7fffffffd7b0
    rsp            0x7fffffffd788      0x7fffffffd788
    r8             0xcccccccccccccccd  -3689348814741910323
    r9             0x7fffffffd7a0      140737488344992
    r10            0x1                 1
    r11            0x202               514
    r12            0x3a                58
    r13            0x602890            6301840
    r14            0x5e4820            6178848
    r15            0x5e1cc0            6167744
    rip            0x53ef74            0x53ef74 <write+20>
    eflags         0x202               [ IF ]
    cs             0x33                51
    ss             0x2b                43
    ds             0x0                 0
    es             0x0                 0
    fs             0x0                 0
    gs             0x0                 0
    (gdb)
    

    Looks quite different from before! Now we let it continue, then we shall see the registers reverting to the state before the read:

    (gdb) c
    Continuing.
    
    Catchpoint 1 (returned from syscall read), 0x000000000053eed1 in read ()
    (gdb) info reg
    rax            0x3a                58
    rbx            0x5e4a00            6179328
    rcx            0x53eed1            5500625
    rdx            0x1000              4096
    rsi            0x6038a0            6305952
    rdi            0x0                 0
    rbp            0x7fffffffd800      0x7fffffffd800
    rsp            0x7fffffffd7c8      0x7fffffffd7c8
    r8             0x0                 0
    r9             0x4                 4
    r10            0x405b0f            4217615
    r11            0x246               582
    r12            0x5e1e10            6168080
    r13            0x5e1cc0            6167744
    r14            0x5e4820            6178848
    r15            0x5e1e10            6168080
    rip            0x53eed1            0x53eed1 <read+17>
    eflags         0x246               [ PF ZF IF ]
    cs             0x33                51
    ss             0x2b                43
    ds             0x0                 0
    es             0x0                 0
    fs             0x0                 0
    gs             0x0                 0
    (gdb)
    

    Seems pretty successful - we've even confused gdb (it says “returned from syscall read”). Note that rax is the syscall return value, and therefore it is reasonable for it to be different before and after the syscall. Careful readers might have noticed something unusual here, but let's save that for later :)

    We know that if we let it continue it will just crash, but let's see what happens anyway, just for fun. ni basically lets it run one more instruction, then stop it again:

    (gdb) disas $rip
    Dump of assembler code for function read:
       0x000000000053eec0 <+0>:	endbr64
       0x000000000053eec4 <+4>:	cmpb   $0x0,0xa668d(%rip)        # 0x5e5558 <__libc_single_threaded>
       0x000000000053eecb <+11>:	je     0x53eee0 <read+32>
       0x000000000053eecd <+13>:	xor    %eax,%eax
       0x000000000053eecf <+15>:	syscall
    => 0x000000000053eed1 <+17>:	cmp    $0xfffffffffffff000,%rax
       0x000000000053eed7 <+23>:	ja     0x53ef28 <read+104>
       0x000000000053eed9 <+25>:	ret
       ...
    
    (gdb) ni
    0x000000000053eed7 in read ()
    (gdb)
    0x000000000053eed9 in read ()
    (gdb)
    0x000000000000003a in ?? ()
    (gdb)
    

    Ok, that didn't take very long to break.

    What about the other ones?

    Now, some of you might have noticed the lack of the more ‘unusual’ registers in the PUSH_REGS code above (or in struct pt_regs), and if you're questioning whether just saving the general-purpose registers in pt_regs is enough, you're right. We have not saved any of xmm0-xmm31 (or their y/z extensions) which are used for floating point operations and various generations of SIMD, or the FS and GS segment registers which can be set by user-space (most notably, libc uses them to point to start of thread-local storage, which obviously shouldn't change in the lifetime of a thread, but really they can be used in whatever ways by an obfuscated program).

    However, in a sensible program linking to a ‘normal’ libc, neither of these ought to matter. This is because, as noted earlier, the segment registers shouldn't normally change inside a thread, and the SIMD registers are caller-saved6, which means that the compiler isn't supposed to rely on them suriving across a call to libc's read() or write() wrappers (although technically syscall is supposed to perserve those).

    For performance reasons the kernel does not save these registers to memory on every kernel entry, but only when it is about to context-switch to another task.7 After all, the ?mm registers, on a new-enough CPU with AVX-512, totals to 512 bits 32 = 2 KiB (not counting the vector mask registers, etc). This makes our life harder if we want to save/restore those (which is technically the correct thing to do for a fully working save/restore), but since they are unlikely to break a ‘reasonable’ program, we will leave them untouched for now.

    Do you want to learn more about how Linux manages processes, does context switches, and similar low-level topics, or just learn more about the kernel in general? Check out these highly informative lecture notes: Operating Systems 2 (maybe after you're done with this article 😆).

    Save and restore memory

    Ok. now finally the exciting part! This part is going to be a bit more difficult if you aren't familiar with how ‘memory paging’ or page faults work. There are two videos which I found quite informative:

    However, don't be alarmed by this – understanding how page tables work helps, but we won't need to do any manual manipulation to these tables. Let's start by looking at how Linux manages a process's memory.

    There are multiple ways we could go about this – one thing we can do is to follow what happens on a fork, and try to replicate that (but only the memory part). We can again use the fork-test.c program for this.

    What does fork do?

    Starting with a simple trace-cmd record, it gives us a bunch of garbage, probably because it's recording the executable initialization as well:

    root@f8b9ed144f52:/# trace-cmd record -p function_graph --max-graph-depth 4 -F -c ./fork-test
      plugin 'function_graph'
    I am the parent. Fork took 222414 ns to return
    I am the child. It took 1151499 ns to fork
    CPU0 data recorded at offset=0xbc000
        54349 bytes in size (294912 uncompressed)
    root@f8b9ed144f52:/# trace-cmd report > a.log
    root@f8b9ed144f52:/# less a.log
           fork-test-92    [000]   153.874423: funcgraph_entry:        1.623 us   |  mutex_unlock();
           fork-test-92    [000]   153.874425: funcgraph_entry:                   |  __f_unlock_pos() {
           fork-test-92    [000]   153.874425: funcgraph_entry:        0.311 us   |    mutex_unlock();
           fork-test-92    [000]   153.874426: funcgraph_exit:         0.961 us   |  }
           fork-test-92    [000]   153.874426: funcgraph_entry:        0.391 us   |  fpregs_assert_state_consistent();
           fork-test-92    [000]   153.874429: funcgraph_entry:                   |  x64_sys_call() {
           fork-test-92    [000]   153.874430: funcgraph_entry:                   |    __x64_sys_execve() {
           fork-test-92    [000]   153.874430: funcgraph_entry:                   |      getname() {
           fork-test-92    [000]   153.874431: funcgraph_entry:        0.612 us   |        getname_flags();
           fork-test-92    [000]   153.874432: funcgraph_exit:         1.212 us   |      }
           fork-test-92    [000]   153.874432: funcgraph_entry:                   |      do_execveat_common.isra.0() {
           fork-test-92    [000]   153.874433: funcgraph_entry:                   |        alloc_bprm() {
              <idle>-0     [000]   153.874798: funcgraph_entry:        0.452 us   |  _raw_spin_lock_irqsave();
              <idle>-0     [000]   153.874798: funcgraph_entry:        0.100 us   |  _raw_spin_unlock_irqrestore();
              <idle>-0     [000]   153.874799: funcgraph_entry:        0.141 us   |  switch_mm_irqs_off();
           ...             ...     ...         ...                     ...           ...
           fork-test-92    [000]   153.878550: funcgraph_entry:                   |  handle_mm_fault() {
           fork-test-92    [000]   153.878550: funcgraph_entry:                   |    __handle_mm_fault() {
           fork-test-92    [000]   153.878550: funcgraph_entry:                   |      pte_offset_map_nolock() {
           fork-test-92    [000]   153.878551: funcgraph_entry:        0.140 us   |        __pte_offset_map();
           fork-test-92    [000]   153.878551: funcgraph_exit:         0.351 us   |      }
           fork-test-92    [000]   153.878551: funcgraph_entry:        0.110 us   |      __rcu_read_unlock();
           fork-test-92    [000]   153.878551: funcgraph_entry:        0.100 us   |      __rcu_read_lock();
           fork-test-92    [000]   153.878551: funcgraph_entry:                   |      filemap_map_pages() {
           fork-test-92    [000]   153.878552: funcgraph_entry:        0.109 us   |        __rcu_read_lock();
           fork-test-92    [000]   153.878552: funcgraph_entry:        0.431 us   |        next_uptodate_folio();
           fork-test-92    [000]   153.878552: funcgraph_entry:        0.230 us   |        __pte_offset_map_lock();
    ...
    

    Thankfully, there is a useful flag for us for exactly this situation:

    -g function-name
        This option is for the function_graph plugin. It will graph the given
        function. That is, it will only trace the function and all functions that it
        calls. You can have more than one -g on the command line.
    

    You can search through the list of available functions by looking at available_filter_functions under tracefs:

    root@f8b9ed144f52:/# cat /sys/kernel/tracing/available_filter_functions | grep fork
    ret_from_fork
    __do_sys_fork
    __do_sys_vfork
    tsk_fork_get_node
    fork_usermode_driver
    

    Let's try __do_sys_fork:

    root@f8b9ed144f52:/# trace-cmd record -p function_graph --max-graph-depth 4 -g __do_sys_fork -F -c ./fork-test
      plugin 'function_graph'
    I am the parent. Fork took 174195 ns to return
    I am the child. It took 1053852 ns to fork
    CPU0 data recorded at offset=0xbc000
        1619 bytes in size (8192 uncompressed)
    root@f8b9ed144f52:/# trace-cmd report
    cpus=1
           fork-test-107   [000]   766.823819: funcgraph_entry:                   |  __do_sys_fork() {
           fork-test-107   [000]   766.823820: funcgraph_entry:                   |    kernel_clone() {
           fork-test-107   [000]   766.823820: funcgraph_entry:                   |      copy_process() {
           fork-test-107   [000]   766.823820: funcgraph_entry:        0.361 us   |        _raw_spin_lock_irq();
           fork-test-107   [000]   766.823821: funcgraph_entry:        0.181 us   |        recalc_sigpending();
           fork-test-107   [000]   766.823821: funcgraph_entry:        0.100 us   |        _raw_spin_unlock_irq();
           fork-test-107   [000]   766.823822: funcgraph_entry:        0.120 us   |        tsk_fork_get_node();
           ...             ...     ...         ...                     ...                 ...
           fork-test-107   [000]   766.823841: funcgraph_entry:        0.120 us   |        posix_cputimers_group_init();
           fork-test-107   [000]   766.823841: funcgraph_entry:        0.100 us   |        __mutex_init();
           fork-test-107   [000]   766.823842: funcgraph_entry:        0.101 us   |        __init_rwsem();
           fork-test-107   [000]   766.823842: funcgraph_entry:      + 44.784 us  |        copy_mm();
           fork-test-107   [000]   766.823887: funcgraph_entry:        0.211 us   |        copy_namespaces();
           fork-test-107   [000]   766.823887: funcgraph_entry:        0.751 us   |        copy_thread();
           fork-test-107   [000]   766.823888: funcgraph_entry:        1.172 us   |        alloc_pid();
           fork-test-107   [000]   766.823889: funcgraph_entry:        0.111 us   |        __mutex_init();
           fork-test-107   [000]   766.823890: funcgraph_entry:        0.130 us   |        user_disable_single_step();
           fork-test-107   [000]   766.823890: funcgraph_entry:        0.151 us   |        clear_posix_cputimers_work();
    

    Hey, isn't this much nicer! Now, one of the function here looks quite interesting: copy_mm. There is a lot of stuff going on underneath that function, but if you follow the function calls, you will eventually find this bit in dup_mmap:

        for_each_vma(vmi, mpnt) {
            ...
            if (!(tmp->vm_flags & VM_WIPEONFORK))
                retval = copy_page_range(tmp, mpnt);
    

    This is executed under a for_each_vma. VMA here means ‘Virtual Memory Area’, and it represents a memory mapping, which can be either a file, other exotic stuff like devices, or anonymous (i.e. ‘normal’ RAM), which is what we care the most about.

    From copy_page_range, after descending through the copy_p?d_range (which we can guess is dealing with lower and lower page tables), ignoring code which handles huge pages, we end up in copy_present_ptes. In this and related memory management functions, we see the concept of a ‘folio’. They are basically consecutive pages that are to be allocated, copied, and freed together.8 For example, when you ask for a large chunk of anonymous memory via mmap, the kernel may allocate in 16 KiB chunks, instead of individual page-by-page.

    copy_present_ptes has branches for both the single-page case and batching copying for multiple consecutive pages, but in both case it does two crucial things which might not be obvious at first glance. First we have:

    folio_ref_add(folio, nr);
    // or, in the single-page case:
    folio_get(folio);
    

    xxx_get is a common naming scheme in the kernel for functions that e.g. increases a reference count. Similarly xxx_put will usually decrement a reference count and free resources. While I don't completely get the naming (I guess you can get stuff and ‘put’ stuff back??), this pattern is also somewhat consistently used for e.g. files, tasks, namespaces and other resources that can have shared ownership.

    Note that in this case, the reference count of this folio is used to determine how many processes9 have this page mapped (potentially in a write-protected state) – we shall see where this is used later.

    It then calls __copy_present_ptes, which contains this bit that should seem very interesting for what we're trying to do here:

    static __always_inline void __copy_present_ptes(struct vm_area_struct *dst_vma,
            struct vm_area_struct *src_vma, pte_t *dst_pte, pte_t *src_pte,
            pte_t pte, unsigned long addr, int nr)
    {
        struct mm_struct *src_mm = src_vma->vm_mm;
    
        /* If it's a COW mapping, write protect it both processes. */
        if (is_cow_mapping(src_vma->vm_flags) && pte_write(pte)) {
            wrprotect_ptes(src_mm, addr, src_pte, nr);
            pte = pte_wrprotect(pte);
        }
    
        /* If it's a shared mapping, mark it clean in the child. */
    

    Hmm, ok. This seems to just be setting the page table entries (PTEs) to be read-only (write-protect) if this page is supposed to CoW. If you drill down into either wrprotect_ptes or pte_wrprotect, none of those functions does anything fancy aside from just manipulating or setting the PTE flags.

    Note that __copy_present_ptes does 2 things at once – copy the PTEs to the child process's page table, and setting them as read-only in both the parent and the child if it is Copy-on-Write (CoW). This is slightly different from what we want to do, because we don't want to create a new page table – we just want to write-protect all the existing pages. How would we do that? We will find out in a bit – for now let's try to find out a bit more about memory management.

    Kgdb fun

    Have you ever wanted to run an actual kernel debugger? We can try breaking in the kernel when our test program makes the fork syscall. First we need a debug break just so that gdb can connect to the kernel and we can set our breakpoint:

    root@f8b9ed144f52:/# echo g > /proc/sysrq-trigger
    [   21.811502][   T71] sysrq: DEBUG
    
    Entering kdb (current=0xffff888003898e80, pid 71) on processor 0 due to NonMaskable Interrupt @ 0xffffffff8115fb08
    [0]kdb>
    

    Now, in another terminal, open the linux-dev directory and do:

    > ./.dev/kgdb.sh
        # sometimes this would fail to connect. If you don't see a backtrace, do "q" then try again
    GNU gdb (GDB) 15.2
    ...
    Reading symbols from ../vmlinux...
    Remote debugging using kgdb.sock
    kgdb_breakpoint () at kernel/debug/debug_core.c:1222
    1222		wmb(); /* Sync point after breakpoint */
    #0  kgdb_breakpoint () at kernel/debug/debug_core.c:1222
    ...
    (gdb) b __do_sys_fork
    Breakpoint 1 at 0xffffffff810836c0: file kernel/fork.c, line 2888.
    (gdb) c
    Continuing.
    

    Now, in the VM terminal, run our test program:

    root@f8b9ed144f52:/# exec ./fork-test # Use exec so that we don't break on the shell's fork10
    

    And back to gdb:

    [Thread 8 exited]
    
    Thread 41 hit Breakpoint 1, __do_sys_fork (__unused=0xffffc9000014bf58) at kernel/fork.c:2888
    2888	{
    (gdb) bt
    #0  __do_sys_fork (__unused=0xffffc9000014bf58) at kernel/fork.c:2888
    #1  0xffffffff81002b79 in x64_sys_call (regs=regs@entry=0xffffc9000014bf58, nr=<optimized out>) at ./arch/x86/include/generated/asm/syscalls_64.h:58
    #2  0xffffffff81869db0 in do_syscall_x64 (regs=0xffffc9000014bf58, nr=<optimized out>) at arch/x86/entry/common.c:52
    #3  do_syscall_64 (regs=0xffffc9000014bf58, nr=<optimized out>) at arch/x86/entry/common.c:83
    #4  0xffffffff81a000b0 in entry_SYSCALL_64 () at arch/x86/entry/entry_64.S:121
    #5  0x00007fd58ea30020 in ?? ()
    #6  0x000055b95d166dd8 in ?? ()
    #7  0x00007ffd638ff2c8 in ?? ()
    #8  0x0000000000000000 in ?? ()
    (gdb)
    

    Maybe let's just break on that __copy_present_ptes function mentioned earlier and see how it's called:

    (gdb) b __copy_present_ptes
    Breakpoint 2 at 0xffffffff8129b0d8: __copy_present_ptes. (2 locations)
    (gdb) c
    Continuing.
    
    Thread 41 hit Breakpoint 2.1, __copy_present_ptes (dst_vma=<optimized out>, src_vma=0xffff88800391c428, dst_pte=0xffff88800392bb30, src_pte=0xffff88800398fb30, pte=..., addr=94254619058176, nr=1) at mm/memory.c:956
    956		if (is_cow_mapping(src_vma->vm_flags) && pte_write(pte)) {
    (gdb) bt
    #0  __copy_present_ptes (dst_vma=<optimized out>, src_vma=0xffff88800391c428, dst_pte=0xffff88800392bb30, src_pte=0xffff88800398fb30, pte=..., addr=94254619058176, nr=1) at mm/memory.c:956
    #1  copy_present_ptes (dst_vma=<optimized out>, src_vma=0xffff88800391c428, dst_pte=0xffff88800392bb30, src_pte=0xffff88800398fb30, pte=..., addr=94254619058176, max_nr=<optimized out>, rss=0xffffc9000014bc80,
        prealloc=<synthetic pointer>) at mm/memory.c:1052
    #2  copy_pte_range (dst_vma=<optimized out>, src_vma=0xffff88800391c428, dst_pmd=<optimized out>, src_pmd=<optimized out>, addr=94254619058176, end=<optimized out>) at mm/memory.c:1167
    #3  copy_pmd_range (dst_vma=<optimized out>, src_vma=0xffff88800391c428, dst_pud=<optimized out>, src_pud=<optimized out>, addr=<optimized out>, end=<optimized out>) at mm/memory.c:1255
    #4  copy_pud_range (dst_vma=<optimized out>, src_vma=0xffff88800391c428, dst_p4d=<optimized out>, src_p4d=<optimized out>, addr=<optimized out>, end=<optimized out>) at mm/memory.c:1292
    #5  copy_p4d_range (dst_vma=<optimized out>, src_vma=<optimized out>, dst_pgd=<optimized out>, src_pgd=<optimized out>, addr=<optimized out>, end=<optimized out>) at mm/memory.c:1316
    #6  copy_page_range (dst_vma=dst_vma@entry=0xffff8880039194c0, src_vma=src_vma@entry=0xffff88800391c428) at mm/memory.c:1414
    #7  0xffffffff81080b27 in dup_mmap (mm=0xffff88800284ca00, oldmm=0xffff88800284cf00) at kernel/fork.c:749
    #8  dup_mm (tsk=0xffff888003898000, oldmm=0xffff88800284cf00) at kernel/fork.c:1691
    #9  copy_mm (clone_flags=clone_flags@entry=0, tsk=tsk@entry=0xffff888003898000) at kernel/fork.c:1743
    #10 0xffffffff8108260a in copy_process (pid=pid@entry=0x0 <fixed_percpu_data>, trace=trace@entry=0, node=node@entry=-1, args=args@entry=0xffffc9000014be98) at kernel/fork.c:2393
    #11 0xffffffff810830a0 in kernel_clone (args=args@entry=0xffffc9000014be98) at kernel/fork.c:2805
    #12 0xffffffff81083705 in __do_sys_fork (__unused=<optimized out>) at kernel/fork.c:2894
    

    The copy_p?d_range functions here are indeed just descending down the page table levels (p4d = level 4, pud = page ‘upper’ directory aka. level 3, pmd = page ‘middle’ directory aka. level 2, and copy_pte_range copies the first level table). In this case, given the backtrace above, the folio is just one page.

    Ok, that's great, but later on if we get a write page fault, how does the kernel know this page is supposed to be CoW'd (rather than, say, treating that fault as an error)? What would happen if our own code manages to set the pages on a process to be write-protected, even when they're not in a CoW state?

    How does the kernel know what to do with write faults?

    Because our fork-test.c program (and really all the programs in general) uses a stack, we can expect that once fork returns it will immediately triggers a page fault due to trying to push new stuff onto the stack. Maybe looking at what happens then could answer our questions about how the kernel knows when to CoW a page? Let's continue to trace our program, after the fork:

    root@9ba6f25d9dac:/# trace-cmd record -p function_graph --max-graph-depth 4 -F -c -e sys_exit_fork ./fork-test
      plugin 'function_graph'
    I am the parent. Fork took 226508 ns to return
    I am the child. It took 1146400 ns to fork
    CPU0 data recorded at offset=0xa9000
        57232 bytes in size (307200 uncompressed)
    root@9ba6f25d9dac:/# trace-cmd report > a.log
    root@9ba6f25d9dac:/# grep -A 20 'sys_exit_fork' a.log # find the point where we return from fork
           fork-test-73    [000]    51.507748: sys_exit_fork:        0x4a
           fork-test-73    [000]    51.507749: funcgraph_exit:         0.702 us   |  }
           fork-test-73    [000]    51.507749: funcgraph_entry:        0.120 us   |  fpregs_assert_state_consistent();
           fork-test-73    [000]    51.507749: funcgraph_entry:                   |  lock_vma_under_rcu() {
           fork-test-73    [000]    51.507750: funcgraph_entry:        0.111 us   |    __rcu_read_lock();
           fork-test-73    [000]    51.507750: funcgraph_entry:        0.110 us   |    down_read_trylock();
           fork-test-73    [000]    51.507750: funcgraph_entry:        0.100 us   |    __rcu_read_unlock();
           fork-test-73    [000]    51.507750: funcgraph_exit:         0.811 us   |  }
           fork-test-73    [000]    51.507750: funcgraph_entry:                   |  handle_mm_fault() {
           fork-test-73    [000]    51.507750: funcgraph_entry:                   |    __handle_mm_fault() {
           fork-test-73    [000]    51.507751: funcgraph_entry:                   |      pte_offset_map_nolock() {
           fork-test-73    [000]    51.507751: funcgraph_entry:        0.140 us   |        __pte_offset_map();
           fork-test-73    [000]    51.507751: funcgraph_exit:         0.361 us   |      }
           fork-test-73    [000]    51.507751: funcgraph_entry:        0.120 us   |      _raw_spin_lock();
           fork-test-73    [000]    51.507751: funcgraph_entry:                   |      do_wp_page() {
           fork-test-73    [000]    51.507751: funcgraph_entry:        0.100 us   |        vm_normal_page();
           fork-test-73    [000]    51.507752: funcgraph_entry:        0.101 us   |        _raw_spin_unlock();
           fork-test-73    [000]    51.507752: funcgraph_entry:        0.100 us   |        __rcu_read_unlock();
           fork-test-73    [000]    51.507752: funcgraph_entry:        0.100 us   |        __vmf_anon_prepare();
           fork-test-73    [000]    51.507752: funcgraph_entry:        0.291 us   |        __folio_alloc_noprof();
           fork-test-73    [000]    51.507753: funcgraph_entry:        0.371 us   |        copy_mc_to_kernel();
    --
           fork-test-74    [000]    51.507790: sys_exit_fork:        0x0
           fork-test-74    [000]    51.507790: funcgraph_exit:         0.251 us   |    }
           fork-test-74    [000]    51.507790: funcgraph_entry:                   |    __rseq_handle_notify_resume() {
           fork-test-74    [000]    51.507791: funcgraph_entry:                   |      lock_mm_and_find_vma() {
           fork-test-74    [000]    51.507791: funcgraph_entry:        0.111 us   |        down_read_trylock();
           fork-test-74    [000]    51.507791: funcgraph_entry:        0.381 us   |        find_vma();
           fork-test-74    [000]    51.507791: funcgraph_exit:         0.811 us   |      }
           fork-test-74    [000]    51.507791: funcgraph_entry:                   |      handle_mm_fault() {
           fork-test-74    [000]    51.507792: funcgraph_entry:        1.713 us   |        __handle_mm_fault();
           fork-test-74    [000]    51.507793: funcgraph_exit:         1.924 us   |      }
           fork-test-74    [000]    51.507793: funcgraph_entry:        0.100 us   |      up_read();
           fork-test-74    [000]    51.507794: funcgraph_exit:         3.647 us   |    }
           fork-test-74    [000]    51.507794: funcgraph_entry:        0.110 us   |    fpregs_assert_state_consistent();
           fork-test-74    [000]    51.507794: funcgraph_entry:                   |    switch_fpu_return() {
           fork-test-74    [000]    51.507794: funcgraph_entry:                   |      restore_fpregs_from_fpstate() {
           fork-test-74    [000]    51.507794: funcgraph_entry:        0.110 us   |        xfd_validate_state();
           fork-test-74    [000]    51.507795: funcgraph_exit:         0.410 us   |      }
           fork-test-74    [000]    51.507795: funcgraph_exit:         0.641 us   |    }
           fork-test-74    [000]    51.507795: funcgraph_exit:         6.993 us   |  }
           fork-test-74    [000]    51.507795: funcgraph_entry:                   |  lock_vma_under_rcu() {
           fork-test-74    [000]    51.507795: funcgraph_entry:        0.111 us   |    __rcu_read_lock();
    

    Remember earlier when we did the ftrace on our little fork program without filtering, there was a bunch of handle_mm_fault? (I've omitted many repeats of it in that trace dump)

    Now we see this again, and we can reasonably guess that it is the entry point to page faults (well, for valid user space addresses that permits the access). In fact, if you look at its signature, it looks like by the time we get there, we've already got the struct vm_area_struct reference:

    /*
     * By the time we get here, we already hold the mm semaphore
     *
     * The mmap_lock may have been released depending on flags and our
     * return value.  See filemap_fault() and __folio_lock_or_retry().
     */
    vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
                   unsigned int flags, struct pt_regs *regs)
    

    Let's try running it again, but this time after the fork, we will break on it. We will first use GDB to find out the process's memory map, for our reference later. In our VM:

    root@9ba6f25d9dac:/# gdb ./fork-test
    GNU gdb (Debian 13.1-3) 13.1
    ...
    Reading symbols from ./fork-test...
    (No debugging symbols found in ./fork-test)
    (gdb) info proc mappings11
    No current process: you must name one.
    Oops, let's try again after the program gets to right before the fork
    (gdb) q
    Re-build in the VM with source debugging ability - could also ‘catch syscall fork’
    (although note that catch traps on syscall exit too)
    root@9ba6f25d9dac:/# gcc -o fork-test fork-test.c -Og -g
    root@9ba6f25d9dac:/# gdb fork-test
    GNU gdb (Debian 13.1-3) 13.1
    ...
    Reading symbols from fork-test...
    (gdb) l 40
    35	int main(int argc, char const *argv[])
    36	{
    37	  uint64_t start = monotonic_ns();
    38	  // libc's fork() actually uses `clone`, but if we're writing a ptrace
    39	  // supervisor we're probably using `fork` directly.
    40	  int ret = syscall(SYS_fork);
    41	  sched_yield();
    42	  uint64_t end = monotonic_ns();
    43	  uint64_t elapsed = end - start;
    44	  if (ret == 0) {
    (gdb) b fork-test.c:40
    Breakpoint 1 at 0x119f: file fork-test.c, line 40.
    (gdb) r
    Starting program: /fork-test
    [Thread debugging using libthread_db enabled]
    Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
    
    Breakpoint 1, main (argc=<optimized out>, argv=<optimized out>) at fork-test.c:40
    40	  int ret = syscall(SYS_fork);
    (gdb) info proc mappings
    process 79
    Mapped address spaces:
    
              Start Addr           End Addr       Size     Offset  Perms  objfile
          0x555555554000     0x555555555000     0x1000        0x0  r--p   /fork-test
          0x555555555000     0x555555556000     0x1000     0x1000  r-xp   /fork-test
          0x555555556000     0x555555557000     0x1000     0x2000  r--p   /fork-test
          0x555555557000     0x555555558000     0x1000     0x2000  r--p   /fork-test
          0x555555558000     0x555555559000     0x1000     0x3000  rw-p   /fork-test
          0x7ffff7dd9000     0x7ffff7ddc000     0x3000        0x0  rw-p
          0x7ffff7ddc000     0x7ffff7e02000    0x26000        0x0  r--p   /usr/lib/x86_64-linux-gnu/libc.so.6
          0x7ffff7e02000     0x7ffff7f57000   0x155000    0x26000  r-xp   /usr/lib/x86_64-linux-gnu/libc.so.6
          0x7ffff7f57000     0x7ffff7faa000    0x53000   0x17b000  r--p   /usr/lib/x86_64-linux-gnu/libc.so.6
          0x7ffff7faa000     0x7ffff7fae000     0x4000   0x1ce000  r--p   /usr/lib/x86_64-linux-gnu/libc.so.6
          0x7ffff7fae000     0x7ffff7fb0000     0x2000   0x1d2000  rw-p   /usr/lib/x86_64-linux-gnu/libc.so.6
          0x7ffff7fb0000     0x7ffff7fbd000     0xd000        0x0  rw-p
          0x7ffff7fc3000     0x7ffff7fc5000     0x2000        0x0  rw-p
          0x7ffff7fc5000     0x7ffff7fc9000     0x4000        0x0  r--p   [vvar]
          0x7ffff7fc9000     0x7ffff7fcb000     0x2000        0x0  r-xp   [vdso]
          0x7ffff7fcb000     0x7ffff7fcc000     0x1000        0x0  r--p   /usr/lib/x86_64-linux-gnu/ld-linux-x86-64.so.2
          0x7ffff7fcc000     0x7ffff7ff1000    0x25000     0x1000  r-xp   /usr/lib/x86_64-linux-gnu/ld-linux-x86-64.so.2
          0x7ffff7ff1000     0x7ffff7ffb000     0xa000    0x26000  r--p   /usr/lib/x86_64-linux-gnu/ld-linux-x86-64.so.2
          0x7ffff7ffb000     0x7ffff7ffd000     0x2000    0x30000  r--p   /usr/lib/x86_64-linux-gnu/ld-linux-x86-64.so.2
          0x7ffff7ffd000     0x7ffff7fff000     0x2000    0x32000  rw-p   /usr/lib/x86_64-linux-gnu/ld-linux-x86-64.so.2
          0x7ffffffde000     0x7ffffffff000    0x21000        0x0  rw-p   [stack]
      0xffffffffff600000 0xffffffffff601000     0x1000        0x0  --xp   [vsyscall]
    (gdb) shell echo g > /proc/sysrq-trigger
    [   27.650816][   T77] sysrq: DEBUG
    
    Entering kdb (current=0xffff88800389c880, pid 77) on processor 0 due to NonMaskable Interrupt @ 0xffffffff8115f7c8
    [0]kdb>
    

    Now, again, in another terminal in the linux-dev directory:

    > ./.dev/kgdb.sh
    GNU gdb (GDB) 15.2
    ...
    We first want to reach the point where the kernel gets the fork call from the program,
    before actually breaking on handle_mm_fault, otherwise we will just be spammed when the
    program loads
    (gdb) b __do_sys_fork
    Breakpoint 1 at 0xffffffff810833a0: file kernel/fork.c, line 2888.
    (gdb) c
    Continuing.
    

    Now go back to the VM terminal and do a ‘c’, which will immediately break in the kernel gdb again as syscall(SYS_fork) is executed. We then go back to kgdb:

    [Thread 8 exited]
    [Thread 77 exited]
    [Switching to Thread 74]
    
    Thread 43 hit Breakpoint 1, __do_sys_fork (__unused=0xffffc90000163f58) at kernel/fork.c:2888
    2888	{
    

    Great, but are we sure this is the right fork we're looking for? Ignore the “Thread 43” – it is gdb confusingly numbering the threads in its own way. We can ask the kernel what task are we currently on:

    (gdb) monitor ps
    36 sleeping system daemon (state [ims]) processes suppressed,
    use 'ps A' to see all.
    Task Addr               Pid   Parent [*] cpu State Thread             Command
    0xffff88800389ab80       74       71  1    0   R  0xffff88800389b580 *fork-test
    
    0xffff888002b58000        1        0  0    0   S  0xffff888002b58a00  init.sh
    0xffff888003898000       70        1  0    0   S  0xffff888003898a00  bash
    0xffff888003898e80       71       70  0    0   S  0xffff888003899880  gdb
    0xffff888003899d00       73       70  0    0   S  0xffff88800389a700  gdb worker
    0xffff88800389ab80       74       71  1    0   R  0xffff88800389b580 *fork-test
    

    Nice, now we can break on handle_mm_fault, which could trigger in either the parent or the child.

    (gdb) b handle_mm_fault
    Breakpoint 2 at 0xffffffff81296e90: file mm/memory.c, line 6044.
    (gdb) c
    Continuing.
    [New Thread 78]
    [Switching to Thread 71]
    
    Thread 41 hit Breakpoint 2, handle_mm_fault (vma=vma@entry=0xffff88800391ee40, address=address@entry=94342972527488, flags=flags@entry=4948, regs=regs@entry=0xffffc90000153f58) at mm/memory.c:6044
    6044	{
    Hmm... this doesn't look right. PID 71 is gdb...
    (gdb) monitor ps
    36 sleeping system daemon (state [ims]) processes suppressed,
    use 'ps A' to see all.
    Task Addr               Pid   Parent [*] cpu State Thread             Command
    0xffff888003898e80       71       70  1    0   R  0xffff888003899880 *gdb
    ...
    

    Oops, that's not what we want. Let's keep doing c until we get to handle_mm_fault in the fork-test process, either child or parent (the first one would copy memory). The confusing ‘two numbering system’ for threads in kgdb definitely isn't helpful, but you can always do either a monitor ps or info threads to check. The breakpoint message uses the ‘gdb numbering’, which is the first column of info threads.

    (gdb) info threads
      Id   Target Id                      Frame
      ...
    * 41   Thread 71 (gdb)                handle_mm_fault (vma=vma@entry=0xffff888003912e40, address=address@entry=94440112313216, flags=flags@entry=4948, regs=regs@entry=0xffffc9000015bf58) at mm/memory.c:6044
      42   Thread 73 (gdb worker)         0x0000000000000000 in fixed_percpu_data ()
      43   Thread 74 (fork-test)          0x0000000000000000 in fixed_percpu_data ()
      45   Thread 78 (fork-test)          0x0000000000000000 in fixed_percpu_data ()
    (gdb) c
    Continuing.
    [Thread 8 exited]
    
    Thread 41 hit Breakpoint 2, handle_mm_fault (vma=vma@entry=0xffff888003912e40, address=address@entry=94440112313216, flags=flags@entry=884, regs=regs@entry=0xffffc9000015bf58) at mm/memory.c:6044
    6044	{
    (gdb) c
    Continuing.
    
    ... more repetations omitted ...
    
    (gdb) c
    Continuing.
    [Switching to Thread 78]
    
    Thread 45 hit Breakpoint 2, handle_mm_fault (vma=vma@entry=0xffff888003a451c8, address=address@entry=140737351884904, flags=flags@entry=533, regs=regs@entry=0xffffc9000016bda8) at mm/memory.c:6044
    6044	{
    (gdb) monitor ps
    36 sleeping system daemon (state [ims]) processes suppressed,
    use 'ps A' to see all.
    Task Addr               Pid   Parent [*] cpu State Thread             Command
    0xffff88800389c880       78       71  1    0   R  0xffff88800389d280 *fork-test
    
    0xffff888002b58000        1        0  0    0   S  0xffff888002b58a00  init.sh
    0xffff888003898000       70        1  0    0   S  0xffff888003898a00  bash
    0xffff888003898e80       71       70  0    0   R  0xffff888003899880  gdb
    0xffff888003899d00       73       70  0    0   S  0xffff88800389a700  gdb worker
    0xffff88800389ab80       74       71  0    0   t  0xffff88800389b580  fork-test
    0xffff88800389c880       78       71  1    0   R  0xffff88800389d280 *fork-test
    ok nice, let's look at the faulted address/VMA:
    (gdb) print/x *vma
    $2 = {{{vm_start = 0x7ffff7dd9000, vm_end = 0x7ffff7ddc000}, vm_rcu = {next = 0x7ffff7dd9000, func = 0x7ffff7ddc000}}, vm_mm = 0xffff88800284cf00, vm_page_prot = {pgprot = 0x8000000000000025}, {vm_flags = 0x100073,
        __vm_flags = 0x100073}, detached = 0x0, vm_lock_seq = 0x0, vm_lock = 0xffff88800392cd70, shared = {rb = {__rb_parent_color = 0x0, rb_right = 0x0, rb_left = 0x0}, rb_subtree_last = 0x0}, anon_vma_chain = {next = 0xffff888003a4b210,
        prev = 0xffff888003a4b1d0}, anon_vma = 0xffff888003a37af8, vm_ops = 0x0, vm_pgoff = 0x7ffff7dd9, vm_file = 0x0, vm_private_data = 0x0, swap_readahead_info = {counter = 0x0}, vm_userfaultfd_ctx = {<No data fields>}}
    (gdb) print/x address
    $3 = 0x7ffff7dda068
    

    Hmm… This is not the area previously marked [stack] in info proc mappings 🤔

    Remember earlier in the trace-cmd result, there was a __rseq_handle_notify_resume right after fork, even before the first handle_mm_fault? After searching around I realized that this is likely here to handle this thing called restartable sequences, which after reading some Stack Overflow I came to the conclusion that it is a feature that will result in the kernel doing something to the user-space (like writing to some memory) after a process switches CPU, or in this case apparently also after a fork. In that case, this fault is caused by the kernel itself writing to that CoW'd memory. I guess that's pretty cool – the kernel is handling a fault caused by itself. Let's see if we can get a backtrace of the thing that caused the fault:

    (gdb) bt
    #0  handle_mm_fault (vma=vma@entry=0xffff888003a3e260, address=address@entry=140737351884904, flags=flags@entry=533, regs=regs@entry=0xffffc9000004bda8) at mm/memory.c:6044
    #1  0xffffffff8106f63d in do_user_addr_fault (regs=regs@entry=0xffffc9000004bda8, error_code=error_code@entry=3, address=address@entry=140737351884904) at arch/x86/mm/fault.c:1389
    #2  0xffffffff81857850 in handle_page_fault (regs=0xffffc9000004bda8, error_code=3, address=140737351884904) at arch/x86/mm/fault.c:1481
    #3  exc_page_fault (regs=0xffffc9000004bda8, error_code=3) at arch/x86/mm/fault.c:1539
    #4  0xffffffff81a0125b in asm_exc_page_fault () at ./arch/x86/include/asm/idtentry.h:623
    #5  0x00007ffff7edd799 in ?? ()
    #6  0xffffc9000004bf58 in ?? ()
    #7  0x0000000000000000 in ?? ()
    

    The gdb backtrace here doesn't tell us the full picture (although it does tell us that there are functions above handle_mm_fault which wern't traced). In fact, the kernel's own bt does a better job when things cross an exception or interrupt boundary (which in this case would be the page fault):

    (gdb) monitor bt
    Stack traceback for pid 78
    0xffff88800389c880       78       71  1    0   R  0xffff88800389d280 *fork-test
    CPU: 0 UID: 0 PID: 78 Comm: fork-test Not tainted 6.12.9-dev-00028-ga175abc44fb9-dirty #164
    Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS Arch Linux 1.16.3-1-1 04/01/2014
    Call Trace:
     <TASK>
     dump_stack_lvl+0x53/0x70
     asm_exc_int3+0x3e/0x50
    RIP: 0010:handle_mm_fault+0x0/0x210
    Code: 78 08 ba 01 00 00 00 48 89 de e8 8b 68 fe ff 65 ff 0d 4c 4e d9 7e 0f 85 dd fb ff ff 0f 1f 44 00 00 e9 d3 fb ff ff 0f 1f 40 00 <f3> 0f 1e fa 0f 1f 44 00 00 55 89 d0 83 e0 01 48 89 e5 41 57 41 56
    RSP: 0018:ffffc9000004bd28 EFLAGS: 00000246
    RAX: 0000000000000000 RBX: 0000000000000003 RCX: ffffc9000004bda8
    RDX: 0000000000000215 RSI: 00007ffff7dda068 RDI: ffff888003a3e260
    RBP: ffffc9000004bd68 R08: ffff888003a3e260 R09: ffff888003a3e260
    R10: ffff88800397fe08 R11: ffff88800397fe0c R12: ffffc9000004bda8
    R13: 00007ffff7dda068 R14: ffff88800284cf00 R15: 0000000000000215
     ? handle_mm_fault+0x1/0x210
     ? do_user_addr_fault+0x16d/0x5f0
     exc_page_fault+0x80/0x160
     asm_exc_page_fault+0x2b/0x30
    RIP: 0010:__put_user_8+0x11/0x20
    Code: ca c3 0f 1f 80 00 00 00 00 f3 0f 1e fa 0f 01 cb 89 01 31 c9 0f 01 ca c3 90 f3 0f 1e fa 48 89 cb 48 c1 fb 3f 48 09 d9 0f 01 cb <48> 89 01 31 c9 0f 01 ca c3 66 0f 1f 44 00 00 f3 0f 1e fa 0f 01 cb
    RSP: 0018:ffffc9000004be58 EFLAGS: 00050202
    RAX: 0000000000000000 RBX: 0000000000000000 RCX: 00007ffff7dda068
    RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
    RBP: ffffc9000004bef8 R08: 0000000000000000 R09: 0000000000000000
    R10: 0000000000000001 R11: 0000000000000000 R12: ffff88800389c880
    R13: 0000000000000000 R14: ffffc9000004bf58 R15: 00007ffff7edd799
     ? __rseq_handle_notify_resume+0x225/0x470
     ? arch_do_signal_or_restart+0x4a/0x270
     syscall_exit_to_user_mode+0xe8/0x140
     ret_from_fork+0x2d/0x60
     ret_from_fork_asm+0x11/0x20
    RIP: 0033:0x7ffff7edd799
    Code: 08 89 e8 5b 5d c3 66 2e 0f 1f 84 00 00 00 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 37 06 0d 00 f7 d8 64 89 01 48
    RSP: 002b:00007fffffffebb8 EFLAGS: 00000306 ORIG_RAX: 0000000000000039
    RAX: 0000000000000000 RBX: 00007fffffffed08 RCX: 00007ffff7edd799
    RDX: 00007ffff7fc9b7f RSI: 0000000000000002 RDI: 00007fffffffebc0
    RBP: 0000000dde6dc87b R08: 00007ffff7fc5080 R09: 000000000000003b
    R10: 0000000000000000 R11: 0000000000000206 R12: 0000000000000000
    R13: 00007fffffffed18 R14: 0000555555557dd8 R15: 00007ffff7ffd020
     </TASK>
    

    Anyway, it's not terribly important to understand the exact working of these stuff. Let's continue and see if we get another handle_mm_fault on the child's stack.

    (gdb) c
    Continuing.
    ... the next one is the same thing in the parent, continuing again ...
    [Switching to Thread 78]
    
    Thread 45 hit Breakpoint 2, handle_mm_fault (vma=vma@entry=0xffff888003956be0, address=address@entry=140737488350168, flags=flags@entry=4693, regs=regs@entry=0xffffc90000173f58) at mm/memory.c:6044
    6044	{
    (gdb) print/x vma->vm_start
    $4 = 0x7ffffffde000
    Previously:
          0x7ffffffde000     0x7ffffffff000    0x21000        0x0  rw-p   [stack]
    
    

    Nice! Now, this whole process was a bit fiddly. With some clever if statements you can probably make the kernel automatically break when the right handle_mm_fault happens (e.g. first time after a process called ‘fork-test’ returns from fork). Alternatively you can place a bunch of trace_printks to study its behaviour.

    Before we continue, remember that the function we're in, handle_mm_fault, is not the first thing that the kernel enters when a page fault happens – by this point a number of checks has already been done. Most importantly for us, there is this bit in the function just one level up – do_user_addr_fault:

        if (unlikely(access_error(error_code, vma))) {
            bad_area_access_error(regs, error_code, address, NULL, vma);
            count_vm_vma_lock_event(VMA_LOCK_SUCCESS);
            return;
        }
        fault = handle_mm_fault(vma, address, flags | FAULT_FLAG_VMA_LOCK, regs);
        ...
    

    And in access_error:

        if (error_code & X86_PF_WRITE) {
            /* write, present and write, not present: */
            if (unlikely(vma->vm_flags & VM_SHADOW_STACK))
                return 1;
            if (unlikely(!(vma->vm_flags & VM_WRITE)))
                return 1;
            return 0;
        }
    

    This means that even though our page was write-protected, the fact that we got to handle_mm_fault means that our VMA is marked as writable, and we learn that vm_flags stores the permission bits like VM_WRITE. While the fact that our VMA is still writable (even when it is write-protected in the actual page table) might feel like a somewhat obvious fact (after all, the user did map the area as writable, they are just write-protected for CoW magic to work), it is interesting because it highlights the fact that the VMA permissions and page table permissions are decoupled. This tells us that, to achieve our initial goal of write-protecting the process's memory for checkpointing, we should not be changing any VMA permissions.

    Let's see how far we get in that function:

    (gdb) n
    47			return this_cpu_read_const(const_pcpu_hot.current_task);
    (gdb) bt
    #0  handle_mm_fault (vma=vma@entry=0xffff888003956be0, address=address@entry=140737488350168, flags=flags@entry=4693, regs=regs@entry=0xffffc90000173f58) at ./arch/x86/include/asm/current.h:47
    ...
    Note that this is not in fact returning from handle_mm_fault, it's inlining something in asm/current.h.
    (gdb) n
    6052		ret = sanitize_fault_flags(vma, &flags);
    (gdb)
    6056		if (!arch_vma_access_permitted(vma, flags & FAULT_FLAG_WRITE,
    (gdb)
    6063		is_droppable = !!(vma->vm_flags & VM_DROPPABLE);
    (gdb) n
    11		return !!(vma->vm_flags & VM_HUGETLB);
    (gdb) bt
    #0  handle_mm_fault (vma=vma@entry=0xffff888003956be0, address=address@entry=140737488350168, flags=flags@entry=4693, regs=regs@entry=0xffffc90000173f58)
        at ./include/linux/hugetlb_inline.h:11
    ...
    (gdb) n
    6077			ret = __handle_mm_fault(vma, address, flags);
    (gdb)
    

    This seems worthy of stepping into – there isn't much left in handle_mm_fault itself anyway.

    (gdb) s
    __handle_mm_fault (vma=vma@entry=0xffff888003956be0, address=address@entry=140737488350168, flags=flags@entry=4693) at mm/memory.c:5818
    5818	{
    (gdb) n
    5819		struct vm_fault vmf = {
    (gdb)
    3133		struct file *vm_file = vma->vm_file;
    (gdb)
    5825			.gfp_mask = __get_fault_gfp_mask(vma),
    (gdb)
    5833		pgd = pgd_offset(mm, address);
    (gdb)
    5824			.pgoff = linear_page_index(vma, address),
    (gdb)
    5833		pgd = pgd_offset(mm, address);
    (gdb)
    5834		p4d = p4d_alloc(mm, pgd, address);
    (gdb)
    5835		if (!p4d)
    (gdb)
    5838		vmf.pud = pud_alloc(mm, p4d, address);
    (gdb)
    5839		if (!vmf.pud)
    (gdb)
    5869		vmf.pmd = pmd_alloc(mm, vmf.pud, address);
    (gdb)
    5870		if (!vmf.pmd)
    (gdb)
    5884			vmf.orig_pmd = pmdp_get_lockless(vmf.pmd);
    (gdb) n
    5909		return handle_pte_fault(&vmf);
    (gdb)
    

    Okay, so it's just travelling down the page table levels, and allocating any non-existent ones (these alloc functions doesn't do anything if the entry is already present). We then end up at handle_pte_fault

    (gdb) s
    0xffffffff8129633e in handle_pte_fault (vmf=<optimized out>) at mm/memory.c:5752
    5752			vmf->pte = pte_offset_map_nolock(vmf->vma->vm_mm, vmf->pmd,
    

    This function turned out to be quite confusing to step through due to heavily inlining and gdb not being helpful. Skipping over a bunch of things that has to do with getting the correct pte, we arrive at this check:

    ...
    (gdb) n
    5780		if (vmf->flags & (FAULT_FLAG_WRITE|FAULT_FLAG_UNSHARE)) {
    

    This is probably the part we want! It's checking if this is a write fault, and will do interesting write-fault specific things immediately afterwards if it is:

        if (vmf->flags & (FAULT_FLAG_WRITE|FAULT_FLAG_UNSHARE)) {
            if (!pte_write(entry))
                return do_wp_page(vmf);
            else if (likely(vmf->flags & FAULT_FLAG_WRITE))
                entry = pte_mkdirty(entry);
        }
    

    Some confusing gdb stepping output follows, again due to heavy inlining, but we do enter do_wp_page:

    ...
    (gdb) print vmf->flags
    $6 = (FAULT_FLAG_WRITE | FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE | FAULT_FLAG_USER | FAULT_FLAG_INTERRUPTIBLE | FAULT_FLAG_ORIG_PTE_VALID | FAULT_FLAG_VMA_LOCK)
    (gdb) s
    do_wp_page (vmf=vmf@entry=0xffffc90000173df0) at mm/memory.c:3657
    3657	{
    (gdb) n
    3659		struct vm_area_struct *vma = vmf->vma;
    (gdb)
    3663		if (likely(!unshare)) {
    (gdb)
    3664			if (userfaultfd_pte_wp(vma, ptep_get(vmf->pte))) {
    (gdb)
    3694		vmf->page = vm_normal_page(vma, vmf->address, vmf->orig_pte);
    (gdb) n
    3696		if (vmf->page)
    (gdb)
    3697			folio = page_folio(vmf->page);
    (gdb)
    3703		if (vma->vm_flags & (VM_SHARED | VM_MAYSHARE)) {
    (gdb)
    689		return ((unsigned long)folio->mapping & PAGE_MAPPING_ANON) != 0;
    (gdb)
    3723		if (folio && folio_test_anon(folio) &&    (added for completeness)
    3724		    (PageAnonExclusive(vmf->page) || wp_can_reuse_anon_folio(folio, vma))) {
    (gdb)
    (Note: if branch not taken)
    3738			folio_get(folio);
    (gdb) n
    3740		pte_unmap_unlock(vmf->pte, vmf->ptl);
    (gdb)
    3745		return wp_page_copy(vmf);
    (gdb)
    

    Hmm, the PageAnonExclusive || wp_can_reuse_anon_folio check didn't pass. Going by the code and comments in the function, a reasonable guess is that it is checking if we're the only owner of the page (see wp_can_reuse_anon_folio which contains refcount checks), and ‘reuse’ i.e. not copy the page. Since we've just done folio_get/folio_ref_add earlier during the fork, this expectedly wasn't the case. However, we expect that when the parent process hits this point again (assuming the pte is still write-protected on the parent's page table), it will actually do the ‘reuse’.

    For now, we arrived at wp_page_copy. It is also quite long, but it basically boils down to allocating a new page, and copying the content over. It has checks against multiple threads running this on the same page, which complicates the code a bit. For our purpose we don't really need to go deeper at this point. It might be more interesting now to test our assumptions on the ‘resue’ case. We can do that by just waiting for the parent to also fault. In fact we get it pretty quick:

    (gdb) info threads
      Id   Target Id                      Frame
      ...
      40   Thread 76 (gdb)                0x0000000000000000 in fixed_percpu_data ()
      41   Thread 78 (gdb worker)         0x0000000000000000 in fixed_percpu_data ()
      42   Thread 79 (fork-test)          0x0000000000000000 in fixed_percpu_data ()
    * 44   Thread 83 (fork-test)          handle_mm_fault (vma=vma@entry=0xffff888003956be0, address=address@entry=140737488350168, flags=flags@entry=4693, regs=regs@entry=0xffffc90000173f58)
        at mm/memory.c:6088
    (gdb) c
    Continuing.
    [Switching to Thread 79]
    
    Thread 42 hit Breakpoint 2, handle_mm_fault (vma=vma@entry=0xffff8880039627b8, address=address@entry=140737488350168, flags=flags@entry=4693, regs=regs@entry=0xffffc9000016bf58)
        at mm/memory.c:6044
    6044	{
    (gdb) print/x vma->vm_start
    $12 = 0x7ffffffde000
    (gdb)
    

    This is again the stack VMA! At this point I realized that trying to set line number breakpoints in this function is just not reliable. So let's do a breakpoint on both wp_page_copy and wp_page_reuse – this way we can see which way that if branch goes:

    (gdb) b wp_page_reuse
    Breakpoint 3 at 0xffffffff8129017a: wp_page_reuse. (3 locations)
    (gdb) b wp_page_copy
    Breakpoint 4 at 0xffffffff812937cd: file mm/memory.c, line 3333.
    (gdb) c
    Continuing.
    
    Thread 42 hit Breakpoint 3.3, wp_page_reuse (folio=<optimized out>, vmf=0xffffc90000163df0) at mm/memory.c:3240
    3240		struct vm_area_struct *vma = vmf->vma;
    (gdb)
    

    Great success! This also confirms that it is completely normal for a process to hit a write fault on a page which already completely belongs to it. The kernel determines if copying is needed by checking if the reference count is 1 (usually), and if yes, basically does nothing (except making the pte writable before returning to user, so we don't get more faults). If we find a way to make the process's page write-protected, we can then simply add our own logic to this do_wp_page function that will copy off the page content somewhere else.

    There is another case, relevant to us, where the kernel will make a page writable for the user following a write fault. This is when an anonymous page is first accessed (either being written to or being read from). This is due to memory overcommitment, in which pages are not actually ‘allocated’ until the user program tries to use it. In this case, we will hit this if branch in handle_pte_fault, rather than going to do_wp_page:

        if (!vmf->pte)
            return do_pte_missing(vmf);
    
    static vm_fault_t do_pte_missing(struct vm_fault *vmf)
    {
        if (vma_is_anonymous(vmf->vma))
            return do_anonymous_page(vmf);
        else
            return do_fault(vmf);   (this is for files and stuff)
    }
    

    And for an anonymous mappings, the kernel will allocate a new zero-filed page on write. We therefore need our own logic there to also track newly allocated pages, and either deallocate them or revert them to be zero-filled on reverting the checkpoint.

    In summary, ignoring the problem of how do we write-protect these pages in the first place, there are two places we need to hook into, in order to copy off page content:

    This doesn't deal with shared pages at all, or file-mapped pages, or huge pages, but we will set that aside for now. With the knowledge we gaind through all this, we should be in a good position to support those, if we need to in the future.

    Since we don't want to deal with huge pages, we will disable them in further testing (otherwise our code will miss write faults on huge pages). This can be done by disabling the kernel feature CONFIG_TRANSPARENT_HUGEPAGE. You can do that with the make nconfig tool, or with the following patch:

    diff --git a/.config b/.config
    index 651bd63becee..64634d6749a0 100644
    --- a/.config
    +++ b/.config
    @@ -729,7 +729,6 @@ CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD=y
    CONFIG_HAVE_ARCH_HUGE_VMAP=y
    CONFIG_HAVE_ARCH_HUGE_VMALLOC=y
    CONFIG_ARCH_WANT_HUGE_PMD_SHARE=y
    -CONFIG_ARCH_WANT_PMD_MKWRITE=y
    CONFIG_HAVE_ARCH_SOFT_DIRTY=y
    CONFIG_HAVE_MOD_ARCH_SPECIFIC=y
    CONFIG_MODULES_USE_ELF_RELA=y
    @@ -912,7 +911,6 @@ CONFIG_COMPACT_UNEVICTABLE_DEFAULT=1
    # CONFIG_PAGE_REPORTING is not set
    CONFIG_MIGRATION=y
    CONFIG_ARCH_ENABLE_HUGEPAGE_MIGRATION=y
    -CONFIG_ARCH_ENABLE_THP_MIGRATION=y
    CONFIG_PCP_BATCH_SCALE_MAX=5
    CONFIG_PHYS_ADDR_T_64BIT=y
    # CONFIG_KSM is not set
    @@ -921,16 +919,8 @@ CONFIG_ARCH_SUPPORTS_MEMORY_FAILURE=y
    # CONFIG_MEMORY_FAILURE is not set
    CONFIG_ARCH_WANT_GENERAL_HUGETLB=y
    CONFIG_ARCH_WANTS_THP_SWAP=y
    -CONFIG_TRANSPARENT_HUGEPAGE=y
    -CONFIG_TRANSPARENT_HUGEPAGE_ALWAYS=y
    -# CONFIG_TRANSPARENT_HUGEPAGE_MADVISE is not set
    -# CONFIG_TRANSPARENT_HUGEPAGE_NEVER is not set
    -CONFIG_THP_SWAP=y
    -# CONFIG_READ_ONLY_THP_FOR_FS is not set
    +# CONFIG_TRANSPARENT_HUGEPAGE is not set
    CONFIG_PGTABLE_HAS_HUGE_LEAVES=y
    -CONFIG_ARCH_SUPPORTS_HUGE_PFNMAP=y
    -CONFIG_ARCH_SUPPORTS_PMD_PFNMAP=y
    -CONFIG_ARCH_SUPPORTS_PUD_PFNMAP=y
    CONFIG_NEED_PER_CPU_EMBED_FIRST_CHUNK=y
    CONFIG_NEED_PER_CPU_PAGE_FIRST_CHUNK=y
    CONFIG_HAVE_SETUP_PER_CPU_AREA=y

    Let's add some trace_printks and see how often they are being called now, again using our test program:

    diff --git a/mm/memory.c b/mm/memory.c
    index d322ddfe6791..81874aaa72b5 100644
    --- a/mm/memory.c
    +++ b/mm/memory.c
    @@ -3660,6 +3660,13 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf)
    struct folio *folio = NULL;
    pte_t pte;
    +#ifdef CONFIG_ICK
    + if (current->hack_target.hack) {
    + trace_printk("faulting on write-protected page %lx\n", vmf->address);
    + // TODO: add ick checkpoint code here
    + }
    +#endif
    +
    if (likely(!unshare)) {
    if (userfaultfd_pte_wp(vma, ptep_get(vmf->pte))) {
    if (!userfaultfd_wp_async(vma)) {
    @@ -4741,6 +4748,13 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
    int nr_pages = 1;
    pte_t entry;
    +#ifdef CONFIG_ICK
    + if (current->hack_target.hack) {
    + trace_printk("faulting on non-present anonymous page %lx\n", addr);
    + // TODO: add ick checkpoint code here
    + }
    +#endif
    +
    /* File mapping without ->vm_ops ? */
    if (vma->vm_flags & VM_SHARED)
    return VM_FAULT_SIGBUS;

    And don't forget to recompile the kernel.

    For useful reference, I will use gdb to launch the test program, break on main, then print out the memory mappings before continuing. We can then cross-reference this with the trace output to see if the pages being write-faulted make sense:

    root@ca88e916e28d:/# trace-cmd record -p nop \
      -e console /usr/bin/gdb -q --batch ./my-hackme \
        -ex 'starti' -ex 'b main' -ex 'c' \
        -ex 'info proc mappings' \
        -ex 'c'
    [    7.211177][   T90] execveat: /my-hackme[90] to be hacked
    
    Program stopped.
    0x00000000004061c0 in _start ()
    Breakpoint 1 at 0x4062e5: file my-hackme.cpp, line 7.
    
    Breakpoint 1, main (argc=1, argv=0x7fffffffecf8) at my-hackme.cpp:7
    7	int main(int argc, char const *argv[]) {
    process 90
    
    Mapped address spaces:
    
              Start Addr           End Addr       Size     Offset  Perms  objfile
                0x400000           0x401000     0x1000        0x0  r--p   /my-hackme
                0x401000           0x583000   0x182000     0x1000  r-xp   /my-hackme
                0x583000           0x5d8000    0x55000   0x183000  r--p   /my-hackme
                0x5d8000           0x5e4000     0xc000   0x1d7000  r--p   /my-hackme
                0x5e4000           0x5e6000     0x2000   0x1e3000  rw-p   /my-hackme
                0x5e6000           0x5ef000     0x9000        0x0  rw-p
                0x5ef000           0x611000    0x22000        0x0  rw-p   [heap]
          0x7ffff7ff9000     0x7ffff7ffd000     0x4000        0x0  r--p   [vvar]
          0x7ffff7ffd000     0x7ffff7fff000     0x2000        0x0  r-xp   [vdso]
          0x7ffffffde000     0x7ffffffff000    0x21000        0x0  rw-p   [stack]
      0xffffffffff600000 0xffffffffff601000     0x1000        0x0  --xp   [vsyscall]
    [    7.310599][   T90] hack: 0 gave different output
    Enter number:
    Program received signal SIGSEGV, Segmentation fault.
    0x000000000000003a in ?? ()
    CPU0 data recorded at offset=0xaa000
        470 bytes in size (4096 uncompressed)
    CPU1 data recorded at offset=0xab000
        0 bytes in size (0 uncompressed)
    

    (-e console makes Linux include our console prints in the trace)

    And now the trace output:

    root@ca88e916e28d:/# trace-cmd report
    cpus=2
           my-hackme-90    [000]     7.211182: console:              execveat: /my-hackme[90] to be hacked
           my-hackme-90    [000]     7.212482: bprint:               do_anonymous_page: faulting on non-present anonymous page 7fffffffe000
           my-hackme-90    [000]     7.254486: bprint:               do_wp_page: faulting on write-protected page 5ed000
           my-hackme-90    [000]     7.255538: bprint:               do_wp_page: faulting on write-protected page 5e2000
           my-hackme-90    [000]     7.255672: bprint:               do_anonymous_page: faulting on non-present anonymous page 5e8000
           my-hackme-90    [000]     7.257253: bprint:               do_anonymous_page: faulting on non-present anonymous page 5ef000
           my-hackme-90    [000]     7.257777: bprint:               do_anonymous_page: faulting on non-present anonymous page 5ee000
           my-hackme-90    [000]     7.258034: bprint:               do_anonymous_page: faulting on non-present anonymous page 5e9000
           my-hackme-90    [000]     7.258194: bprint:               do_anonymous_page: faulting on non-present anonymous page 7fffffffd000
           my-hackme-90    [000]     7.259049: bprint:               do_anonymous_page: faulting on non-present anonymous page 5f0000
           my-hackme-90    [000]     7.261284: bprint:               do_wp_page: faulting on write-protected page 5e9000
           my-hackme-90    [000]     7.261595: bprint:               do_anonymous_page: faulting on non-present anonymous page 5e6000
           my-hackme-90    [000]     7.261596: bprint:               do_wp_page: faulting on write-protected page 5e6000
           my-hackme-90    [000]     7.262312: bprint:               do_anonymous_page: faulting on non-present anonymous page 5e7000
           my-hackme-90    [000]     7.267858: bprint:               do_anonymous_page: faulting on non-present anonymous page 602000
           my-hackme-90    [000]     7.310312: bprint:               do_anonymous_page: faulting on non-present anonymous page 603000
           my-hackme-90    [000]     7.310598: bprint:               ksys_write: hacked process attempted write with data Enter number:
           my-hackme-90    [000]     7.310603: console:              hack: 0 gave different output
           my-hackme-90    [000]     7.311935: bprint:               do_anonymous_page: faulting on non-present anonymous page 604000
           my-hackme-90    [000]     7.311938: bprint:               ksys_read: ick checkpoint on hacked process my-hackme[90]
           my-hackme-90    [000]     7.311939: bprint:               ick_checkpoint_proc: ick: Checkpointed my-hackme[90]
           my-hackme-90    [000]     7.311939: bprint:               ksys_read: Providing number 1 to hacked process my-hackme[90]
           my-hackme-90    [000]     7.313393: bprint:               ksys_write: hacked process attempted write with data Nope! 1 was a wrong guess. The correct number is 2037191.
           my-hackme-90    [000]     7.313394: bprint:               ick_revert_proc: Restored process my-hackme[90]
    

    Since we haven't done anything to write-protect the data, nothing happens after ick_checkpoint_proc, but we do see our trace_printks in do_wp_page / do_anonymous_page in the initial phase of the execution!

    Ok, so how do we write-protect the pages?

    We can't exactly re-use the copy_???_range code used by fork, since that creates a copy of the page table, while we want to modify it ‘in-place’. However, there is another syscall which we can reasonably guess will do something very similar to what we want: mprotect. It just changes the current process's memory mapping, and so must contain code which alters the page table without also copying it. Let's have a look.

    To get started, you can again use trace-cmd with the appropriate function / function_graph tracer flags. However this time, given we have already read a good amount of memory management code, let's just see if we can figure it out ‘statically’ — just by reading the code.

    The first thing to do is to find the entrypoint for mprotect. We can search for the regex SYSCALL_DEFINE.\(mprotect (there is a number in the macro name after SYSCALL_DEFINE specifying the number of arguments), which lead us to mm/mprotect.c:856:

    SYSCALL_DEFINE3(mprotect, unsigned long, start, size_t, len,
            unsigned long, prot)
    {
        return do_mprotect_pkey(start, len, prot, -1);
    }
    

    Some man-page browsing would tell us that ‘pkey’ is a hardware feature to allow ‘switching’ memory protections within the same process, and we can ignore any code that has to do with it in do_mprotect_pkey. Skipping some error checking, we have this (with my own comments in boxes):

    static int do_mprotect_pkey(unsigned long start, size_t len,
            unsigned long prot, int pkey)
    {
        ...
        if (mmap_write_lock_killable(current->mm))
            
    We might need to call this too in our own code!
    return -EINTR; ... tlb_gather_mmu(&tlb, current->mm);
    Initalize a structure which will track page table modifications to flush the TLB. We will need this too.
    nstart = start; tmp = vma->vm_start; for_each_vma_range(vmi, vma, end) {
    Previously we saw for_each_vma, now we have one that takes an end point (earlier in the function it passed the start point to vma_iter_init)
    This is likely the interesting part as it's doing the mprotect thing for each VMA.
    unsigned long mask_off_old_flags; unsigned long newflags; int new_vma_pkey; ... /* * Each mprotect() call explicitly passes r/w/x permissions. * If a permission is not passed to mprotect(), it must be * cleared from the VMA. */ mask_off_old_flags = VM_ACCESS_FLAGS | VM_FLAGS_CLEAR; new_vma_pkey = arch_override_mprotect_pkey(vma, prot, pkey); newflags = calc_vm_prot_bits(prot, new_vma_pkey);
    Simply turns the PROT_READ/PROT_WRITE/PROT_EXEC passed in by user-space into VM_READ/VM_WRITE/VM_EXEC
    Contains some hand-‘optimized’ bit manipulation C code from before the first Linux git commit, and is basically nop since PROT_... == VM_....
    newflags |= (vma->vm_flags & ~mask_off_old_flags); /* newflags >> 4 shift VM_MAY% in place of VM_% */ if ((newflags & ~(newflags >> 4)) & VM_ACCESS_FLAGS) { error = -EACCES; break; } ... tmp = vma->vm_end; if (tmp > end) tmp = end; if (vma->vm_ops && vma->vm_ops->mprotect) {
    vma->vm_ops is (only) defined for file/device mappings, and these drivers/fs code might want to do something special.
    Not relevant to us.
    error = vma->vm_ops->mprotect(vma, nstart, tmp, newflags); if (error) break; } error = mprotect_fixup(&vmi, &tlb, vma, &prev, nstart, tmp, newflags);
    This is the bit that actually does the thing. Why is it called ..._fixup??
    if (error) break; tmp = vma_iter_end(&vmi); nstart = tmp; prot = reqprot; } tlb_finish_mmu(&tlb); ... mmap_write_unlock(current->mm);

    Now, inside mprotect_fixup:

    int
    mprotect_fixup(struct vma_iterator *vmi, struct mmu_gather *tlb,
               struct vm_area_struct *vma, struct vm_area_struct **pprev,
               unsigned long start, unsigned long end, unsigned long newflags)
    {
        struct mm_struct *mm = vma->vm_mm;
        unsigned long oldflags = vma->vm_flags;
        long nrpages = (end - start) >> PAGE_SHIFT;
        unsigned int mm_cp_flags = 0;
        unsigned long charged = 0;
        int error;
    
        if (!can_modify_vma(vma))
            return -EPERM;
    
        if (newflags == oldflags) {
            *pprev = vma;
            return 0;
        }
    
        ...
    
        vma = vma_modify_flags(vmi, *pprev, vma, start, end, newflags);
        
    This merges adjacent VMAs when possible, potentially also changing the flag?
    if (IS_ERR(vma)) { error = PTR_ERR(vma); goto fail; } *pprev = vma; /* * vm_flags and vm_page_prot are protected by the mmap_lock * held in write mode. */ vma_start_write(vma);
    Earlier in do_mprotect_pkey we have taken the mmap_lock by calling mmap_write_lock_killable.
    This call increments a sequence number on the VMA so that readers trying to opportunistically avoid taking the
    mmap_lock when reading will actually fallback to taking it (and hence waiting until this mprotect call is done).
    This means that if we want to modify vma->vm_flags or vma->vm_page_prot, we need to both take the
    mmap_lock, and call this function.
    vm_flags_reset(vma, newflags);
    This changes the VMA flags.
    if (vma_wants_manual_pte_write_upgrade(vma)) mm_cp_flags |= MM_CP_TRY_CHANGE_WRITABLE;
    Hmm… Interesting flag. If you read into it, it suggests that change_protection will ‘try’ to change the page to writable,
    which implies that it might not always do it. We will find out more (although you might have guessed what's going on already)
    vma_set_page_prot(vma);
    “Update vma->vm_page_prot to reflect vma->vm_flags.
    vma->vm_page_prot is basically the actual page protection flags that will get put in the page table.
    More on this later
    change_protection(tlb, vma, start, end, mm_cp_flags);
    Ok, this is basically changing the actual page table to reflect vma->vm_page_prot. Inside it:
    pgprot_t newprot = vma->vm_page_prot;
    ...
    pages = change_protection_range(tlb, vma, start, end, newprot,
              cp_flags);
    

    and inside change_protection_range, we descend down the page table again (like in copy_page_range earlier when forking),
    and eventually changes the PTE:

    ptent = pte_modify(oldpte, newprot);
    

    in change_pte_range


    if ((oldflags & VM_ACCOUNT) && !(newflags & VM_ACCOUNT)) vm_unacct_memory(nrpages);

    How does vma_set_page_prot derives the actual page protection flags? You might have thought that it would simply do a ‘translation’ of vma->vm_flags, turning e.g. read-only pages to read-only, and read-write pages to RW. However, that's not the case. Let's look at the function:

    /* Update vma->vm_page_prot to reflect vma->vm_flags. */
    void vma_set_page_prot(struct vm_area_struct *vma)
    {
        unsigned long vm_flags = vma->vm_flags;
        pgprot_t vm_page_prot;
    
        vm_page_prot = vm_pgprot_modify(vma->vm_page_prot, vm_flags);
        
    Gets the correct vma->vm_page_prot for this vma->vm_flags
    if (vma_wants_writenotify(vma, vm_page_prot)) {
    Hmm… vma_wants_writenotify — That's a very interesting function name!
    vm_flags &= ~VM_SHARED;
    If we want to be ‘notified’ of writes, we pretend it's a private mapping…?
    vm_page_prot = vm_pgprot_modify(vm_page_prot, vm_flags);
    …and get the correct pgprot_t for this vma->vm_flags again.
    } /* remove_protection_ptes reads vma->vm_page_prot without mmap_lock */ WRITE_ONCE(vma->vm_page_prot, vm_page_prot);
    …and assign it here to our vma
    }

    In this case vm_pgprot_modify is the inner “calculate what's the pgprot_t for this vm_flags_t” function (it doesn't actually ‘modify’ anything, but it preserves unchanged bits in the old flag), which calls vm_get_page_prot which has this code:

    unsigned long val = pgprot_val(protection_map[vm_flags &
                      (VM_READ|VM_WRITE|VM_EXEC|VM_SHARED)]);
    

    And what is in protection_map?

    static pgprot_t protection_map[16] __ro_after_init = {
        [VM_NONE]                                    = PAGE_NONE,
        [VM_READ]                                    = PAGE_READONLY,
        [VM_WRITE]                                   = PAGE_COPY,
        [VM_WRITE | VM_READ]                         = PAGE_COPY,
        [VM_EXEC]                                    = PAGE_READONLY_EXEC,
        [VM_EXEC | VM_READ]                          = PAGE_READONLY_EXEC,
        [VM_EXEC | VM_WRITE]                         = PAGE_COPY_EXEC,
        [VM_EXEC | VM_WRITE | VM_READ]               = PAGE_COPY_EXEC,
        [VM_SHARED]                                  = PAGE_NONE,
        [VM_SHARED | VM_READ]                        = PAGE_READONLY,
        [VM_SHARED | VM_WRITE]                       = PAGE_SHARED,
        [VM_SHARED | VM_WRITE | VM_READ]             = PAGE_SHARED,
        [VM_SHARED | VM_EXEC]                        = PAGE_READONLY_EXEC,
        [VM_SHARED | VM_EXEC | VM_READ]              = PAGE_READONLY_EXEC,
        [VM_SHARED | VM_EXEC | VM_WRITE]             = PAGE_SHARED_EXEC,
        [VM_SHARED | VM_EXEC | VM_WRITE | VM_READ]   = PAGE_SHARED_EXEC
    };
    
    

    PAGE_COPY is a suspicious name given what we know already. It is defined as:

    #define PAGE_COPY	     __pg(__PP|   0|_USR|___A|__NX|   0|   0|   0)
    

    whereas for a shared writable mapping, PAGE_SHARED is defined as:

    #define PAGE_SHARED	     __pg(__PP|__RW|_USR|___A|__NX|   0|   0|   0)
    

    Note that PAGE_COPY doesn't have _RW! Remember earlier there is some code which pretends our VMA is not VM_SHARED if we want to be ‘notified’ of writes? This is why — non-shared mappings don't get to have write permission set. I also wish this code is not so convoluted.

    Ok, what does this leave us with? We can find a way to make vma_wants_writenotify return true, then somehow get the VMA to ‘re-sync’ its page permissions. Alternatively, even easier, we can override vma->vm_page_prot ourselves, then call change_protection without passing the MM_CP_TRY_CHANGE_WRITABLE flag. (That flag basically tells it to make the page writable for exclusively-held, i.e. refcount=0, pages)

    diff --git a/kernel/ick.c b/kernel/ick.c
    index c242185f519f..6e45ce5c0608 100644
    --- a/kernel/ick.c
    +++ b/kernel/ick.c
    @@ -4,6 +4,52 @@
    #include <linux/errno.h>
    #include <linux/ptrace.h>
    #include <linux/sched/task_stack.h>
    +#include <asm/tlb.h>
    +
    +// Mark pages as write-protected
    +static void mark_pages(void) {
    + struct mm_struct *mm = current->mm;
    + // Kernel threads don't have their own `mm`, and we can't checkpoint a kernel
    + // thread anyway
    + BUG_ON(!mm);
    + struct vm_area_struct *vma;
    + struct mmu_gather tlb;
    + VMA_ITERATOR(vmi, mm, 0);
    + pgprotval_t vm_page_prot;
    +
    + if (mmap_write_lock_killable(mm)) {
    + // If we return from here, we're getting killed anyway
    + return;
    + }
    + tlb_gather_mmu(&tlb, mm);
    +
    + for_each_vma(vmi, vma) {
    + if (!(vma->vm_flags & VM_WRITE)) {
    + // The VMA can still have VM_MAYWRITE, which means that a future mprotect
    + // call can make it writable (for example because the underlying file is
    + // opened as writable). For now we don't care - we will block all
    + // memory-related syscalls anyway.
    + trace_printk("Skipping non-writable VMA %lx-%lx (%s)\n",
    + vma->vm_start, vma->vm_end,
    + vma->vm_file ? (char *)vma->vm_file->f_path.dentry->d_iname : "anon");
    + continue;
    + }
    +
    + trace_printk("Marking VMA %lx-%lx (%lu KiB) as read-only\n",
    + vma->vm_start, vma->vm_end, (vma->vm_end - vma->vm_start) / 1024);
    +
    + vma_start_write(vma);
    + // Maybe there's a more standard way to do this? But various useful
    + // functions in mm/memory.c are static
    + vm_page_prot = vma->vm_page_prot.pgprot;
    + vm_page_prot &= ~_PAGE_RW;
    + WRITE_ONCE(vma->vm_page_prot.pgprot, vm_page_prot);
    + change_protection(&tlb, vma, vma->vm_start, vma->vm_end, 0);
    + }
    +
    + tlb_finish_mmu(&tlb);
    + mmap_write_unlock(mm);
    +}
    int ick_checkpoint_proc(void) {
    struct ick_checked_process *ick_data;
    @@ -28,6 +74,8 @@ int ick_checkpoint_proc(void) {
    #error "Unsupported architecture"
    #endif
    + mark_pages();
    +
    // TODO: implement rest
    current->ick_data = ick_data;

    Using the same command we used in the last section we can print the memory maps, and verify that the correct areas are getting write-protected:

    root@42bd2e1ffd95:/# trace-cmd record -p nop \
      -e console /usr/bin/gdb -q --batch ./my-hackme \
        -ex 'starti' -ex 'b main' -ex 'c' \
        -ex 'info proc mappings' \
        -ex 'c'
    [    5.489927][   T90] execveat: /my-hackme[90] to be hacked
    
    Program stopped.
    0x00000000004061c0 in _start ()
    Breakpoint 1 at 0x4062e5: file my-hackme.cpp, line 7.
    
    Breakpoint 1, main (argc=1, argv=0x7fffffffecf8) at my-hackme.cpp:7
    7	int main(int argc, char const *argv[]) {
    process 90
    Mapped address spaces:
    
              Start Addr           End Addr       Size     Offset  Perms  objfile
                0x400000           0x401000     0x1000        0x0  r--p   /my-hackme
                0x401000           0x583000   0x182000     0x1000  r-xp   /my-hackme
                0x583000           0x5d8000    0x55000   0x183000  r--p   /my-hackme
                0x5d8000           0x5e4000     0xc000   0x1d7000  r--p   /my-hackme
                0x5e4000           0x5e6000     0x2000   0x1e3000  rw-p   /my-hackme
                0x5e6000           0x5ef000     0x9000        0x0  rw-p
                0x5ef000           0x611000    0x22000        0x0  rw-p   [heap]
          0x7ffff7ff9000     0x7ffff7ffd000     0x4000        0x0  r--p   [vvar]
          0x7ffff7ffd000     0x7ffff7fff000     0x2000        0x0  r-xp   [vdso]
          0x7ffffffde000     0x7ffffffff000    0x21000        0x0  rw-p   [stack]
      0xffffffffff600000 0xffffffffff601000     0x1000        0x0  --xp   [vsyscall]
    [    5.546769][   T90] hack: 0 gave different output
    Enter number:
    Program received signal SIGSEGV, Segmentation fault.
    0x000000000000003a in ?? ()
    CPU0 data recorded at offset=0xa4000
        675 bytes in size (4096 uncompressed)
    CPU1 data recorded at offset=0xa5000
        0 bytes in size (0 uncompressed)
    root@42bd2e1ffd95:/# trace-cmd report
    cpus=2
           my-hackme-90    [000]     5.498556: console:              execveat: /my-hackme[90] to be hacked
           ...
           my-hackme-90    [000]     5.555393: bprint:               ksys_write: hacked process attempted write with data Enter number:
           my-hackme-90    [000]     5.555397: console:              hack: 0 gave different output
           my-hackme-90    [000]     5.556869: bprint:               __handle_mm_fault: faulting on non-present anonymous page 604000
           my-hackme-90    [000]     5.556871: bprint:               ksys_read: ick checkpoint on hacked process my-hackme[90]
           my-hackme-90    [000]     5.556872: bprint:               mark_pages: Skipping non-writable VMA 400000-401000 (my-hackme)
           my-hackme-90    [000]     5.556873: bprint:               mark_pages: Skipping non-writable VMA 401000-583000 (my-hackme)
           my-hackme-90    [000]     5.556873: bprint:               mark_pages: Skipping non-writable VMA 583000-5d8000 (my-hackme)
           my-hackme-90    [000]     5.556873: bprint:               mark_pages: Skipping non-writable VMA 5d8000-5e4000 (my-hackme)
           my-hackme-90    [000]     5.556874: bprint:               mark_pages: Marking VMA 5e4000-5e6000 (8 KiB) as read-only
           my-hackme-90    [000]     5.556875: bprint:               mark_pages: Marking VMA 5e6000-5ef000 (36 KiB) as read-only
           my-hackme-90    [000]     5.556876: bprint:               mark_pages: Marking VMA 5ef000-611000 (136 KiB) as read-only
           my-hackme-90    [000]     5.556877: bprint:               mark_pages: Skipping non-writable VMA 7ffff7ff9000-7ffff7ffd000 (anon)
           my-hackme-90    [000]     5.556877: bprint:               mark_pages: Skipping non-writable VMA 7ffff7ffd000-7ffff7fff000 (anon)
           my-hackme-90    [000]     5.556877: bprint:               mark_pages: Marking VMA 7ffffffde000-7ffffffff000 (132 KiB) as read-only
           my-hackme-90    [000]     5.556878: bprint:               ick_checkpoint_proc: ick: Checkpointed my-hackme[90]
           my-hackme-90    [000]     5.556878: bprint:               ksys_read: Providing number 1 to hacked process my-hackme[90]
           my-hackme-90    [000]     5.556880: bprint:               do_wp_page: faulting on write-protected page 603000
           my-hackme-90    [000]     5.556881: bprint:               do_wp_page: faulting on write-protected page 5e4000
           my-hackme-90    [000]     5.556882: bprint:               do_wp_page: faulting on write-protected page 7fffffffd000
           my-hackme-90    [000]     5.556957: bprint:               do_wp_page: faulting on write-protected page 5ef000
           my-hackme-90    [000]     5.557387: bprint:               do_wp_page: faulting on write-protected page 5e6000
           my-hackme-90    [000]     5.557388: bprint:               do_wp_page: faulting on write-protected page 602000
           my-hackme-90    [000]     5.557390: bprint:               do_wp_page: faulting on write-protected page 5e9000
           my-hackme-90    [000]     5.557457: bprint:               do_wp_page: faulting on write-protected page 5e7000
           my-hackme-90    [000]     5.557457: bprint:               do_wp_page: faulting on write-protected page 5e8000
           my-hackme-90    [000]     5.557719: bprint:               ksys_write: hacked process attempted write with data Nope! 1 was a wrong guess. The correct number is 1118849.
           my-hackme-90    [000]     5.557719: bprint:               ick_revert_proc: Restored process my-hackme[90]
    

    It's still segfaulting as we aren't actually copying off and restoring the memory, but we are seeing prints from our earlier modified do_wp_page function after the checkpoint. We're now excitingly close to our final goal!

    Let's actually copy the pages, then restore them!

    The kernel includes definition for some very useful data structures, one of which is a red-black tree — a type of binary search tree commonly used to implement key-value stores. In our case, we can use it to map the user space's modified page addresses to (pointer to) the page's original content, at the time of the checkpoint. There is very good kernel documentation for using this data structure, with code examples.

    Blocking off other syscalls

    Final thoughts

    When I was a lot younger, there was this book which attempted to help the reader understand Linux by listing a bunch of source codes and make commentaries on it. It started from bootup. I couldn't even get through a chapter. I think for me, I needed to play with stuff to understand. Try to implement something (like in this case a checkpoint feature), play with kgdb, etc.

    Also learned about stuff incrementally by reading code, which would otherwise be overwhelming if I tried to read up on them at once - like folio etc

    Addendum: Hacking it with ptrace

    Earlier I mentioned there are alternative approaches, and specifically ones that rely on ptrace. Indeed, we didn't have to do this in the kernel, and we don't even have to write code to save/restore state ourselves either.

    A natural and simpler thing to try might be to somehow force the process to fork before reading the input, and check it only in the child (while the parent is paused). This way, the parent effectively always has the ‘clean’ state. If an input fails, we can simply kill the child, let the parent fork again, and try another input, without having it go through the 200ms initialization each time.

    While this does sounds quite tricky, the ptrace API, which is what debuggers uses behind the scene, allows a tracer to do a lot of interesting things to its child, like hooking and injecting syscalls, rewriting memory, etc. And so it is not difficult to imagine a C program which uses ptrace to do this, feeding the child input by writing directly to the buffer provided to read (and setting the right register to return the correct value).

    A program like this could also fake /proc/self/status and the result of ptrace(PTRACE_TRACEME), thus defeating the debugger detection.

    Now, doing all this forking isn't free. With such short runtime per iteration (5-10us), the overhead of fork and the context switch between the supervisor, the parent to be forked, and the forked child process quickly dominates the time spent. I created a simple C program fork-test.c which measures time elapsed between just calling fork and it returning in the child, and from my testing, forking even this simple process usually takes ~50-150us. This is without the involvement of any ptrace supervisor, the process has less memory mappings than the real binary we're testing, and we're not writing or reading its memory from a remote process. It is reasonable to assume that if we were to actually do the proposed ptrace-based experiment, one iteration including handling read/write will take at least 100us on average (probably more).

    With 100us per iteration, searching through all the inputs will take 5 minutes, which is not bad at all. The result does depends on time though, and for certain time periods there won't always be a valid input. This means that if we don't feel like sitting behind the computer waiting for potentially tens of minutes, we need to either parallelize it by a lot (with different processes trying different ranges of input), or advance our system clock to the future, so that when we come back from our half-hour coffee break the solution is not expired already.

    I have tried but have not been successful in coercing ChatGPT (o1) to give me an out-of-the-box working solution for this – it is tripping over getting the child to behave correctly and I ran out of patience debugging.

    Saving the state and restoring, without forking a new process, is also doable with ptrace, since we can read and write the memory of our traced process, and get/set registers. One way we can do this is to save off the content of all writable memory regions (which is not very large in this case), and restore them along with the registers after each iteration. If we want to go even further, we can use userfaultfd to monitor which pages are written to after we inject our test input, and only restore those pages. For a program which has a lot more state than a few MB, this could be quite worthwhile.

    Since ChatGPT already failed me once, I'm not going to try and convince it to make this work either. In principle it should all work, but this is not the approach I ended up using.

    Discuss on GitHub

    Footnotes

    « home