The short story of 1 Linux Kernel Use-After-Free bug and 2 CVEs (CVE-2020-14356 and CVE-2020-25220)

Name:     Linux kernel Cgroup BPF Use-After-Free
Author:   Adam Zabrocki (pi3@pi3.com.pl)
Date:       May 27, 2020

First things first – short history:

In 2019 Tejun Heo discovered a racing problem with lifetime of the cgroup_bpf which could result in double-free and other memory corruptions. This bug was fixed in kernel 5.3. More information about the problem and the patch can be found here:

https://lore.kernel.org/patchwork/patch/1094080/

Roman Gushchin discovered another problem with the newly fixed code which could lead to use-after-free vulnerability. His report and fix can be found here:

https://lore.kernel.org/bpf/20191227215034.3169624-1-guro@fb.com/

During the discussion on the fix, Alexei Starovoitov pointed out that walking through the cgroup hierarchy without holding cgroup_mutex might be dangerous:

https://lore.kernel.org/bpf/20200104003523.rfte5rw6hbnncjes@ast-mbp/

However, Roman and Alexei concluded that it shouldn’t be a problem:

https://lore.kernel.org/bpf/20200106220746.fm3hp3zynaiaqgly@ast-mbp/

Unfortunately, there is another Use-After-Free bug related to the Cgroup BPF release logic.

The “new” bug – details (a lot of details ;-)):

During LKRG development and tests, one of my VMs was generating a kernel crash during shutdown procedure. This specific machine had the newest kernel at that time (5.7.x) and I compiled it with all debug information as well as SLAB DEBUG feature. When I analyzed the crash, it had nothing to do with LKRG. Later I confirmed that kernels without LKRG are always hitting that issue:

      KERNEL: linux-5.7/vmlinux
    DUMPFILE: /var/crash/202006161848/dump.202006161848  [PARTIAL DUMP]
        CPUS: 1
        DATE: Tue Jun 16 18:47:40 2020
      UPTIME: 14:09:24
LOAD AVERAGE: 0.21, 0.37, 0.50
       TASKS: 234
    NODENAME: oi3
     RELEASE: 5.7.0-g4
     VERSION: #28 SMP PREEMPT Fri Jun 12 18:09:14 UTC 2020
     MACHINE: x86_64  (3694 Mhz)
      MEMORY: 8 GB
       PANIC: "Oops: 0000 [#1] PREEMPT SMP PTI" (check log for details)
         PID: 1060499
     COMMAND: "sshd"
        TASK: ffff9d8c36b33040  [THREAD_INFO: ffff9d8c36b33040]
         CPU: 0
       STATE:  (PANIC)

crash> bt
PID: 1060499  TASK: ffff9d8c36b33040  CPU: 0   COMMAND: "sshd"
 #0 [ffffb0fc41b1f990] machine_kexec at ffffffff9404d22f
 #1 [ffffb0fc41b1f9d8] __crash_kexec at ffffffff941c19b8
 #2 [ffffb0fc41b1faa0] crash_kexec at ffffffff941c2b60
 #3 [ffffb0fc41b1fab0] oops_end at ffffffff94019d3e
 #4 [ffffb0fc41b1fad0] page_fault at ffffffff95c0104f
    [exception RIP: __cgroup_bpf_run_filter_skb+401]
    RIP: ffffffff9423e801  RSP: ffffb0fc41b1fb88  RFLAGS: 00010246
    RAX: 0000000000000000  RBX: ffff9d8d56ae1ee0  RCX: 0000000000000028
    RDX: 0000000000000000  RSI: ffff9d8e25c40b00  RDI: ffffffff9423e7f3
    RBP: 0000000000000000   R8: 0000000000000000   R9: 0000000000000000
    R10: 0000000000000003  R11: 0000000000000000  R12: 0000000000000000
    R13: 0000000000000000  R14: 0000000000000000  R15: 0000000000000001
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
 #5 [ffffb0fc41b1fbd0] ip_finish_output at ffffffff957d71b3
 #6 [ffffb0fc41b1fbf8] __ip_queue_xmit at ffffffff957d84e1
 #7 [ffffb0fc41b1fc50] __tcp_transmit_skb at ffffffff957f4b27
 #8 [ffffb0fc41b1fd58] tcp_write_xmit at ffffffff957f6579
 #9 [ffffb0fc41b1fdb8] __tcp_push_pending_frames at ffffffff957f737d
#10 [ffffb0fc41b1fdd0] tcp_close at ffffffff957e6ec1
#11 [ffffb0fc41b1fdf8] inet_release at ffffffff9581809f
#12 [ffffb0fc41b1fe10] __sock_release at ffffffff95616848
#13 [ffffb0fc41b1fe30] sock_close at ffffffff956168bc
#14 [ffffb0fc41b1fe38] __fput at ffffffff942fd3cd
#15 [ffffb0fc41b1fe78] task_work_run at ffffffff94148a4a
#16 [ffffb0fc41b1fe98] do_exit at ffffffff9412b144
#17 [ffffb0fc41b1ff08] do_group_exit at ffffffff9412b8ae
#18 [ffffb0fc41b1ff30] __x64_sys_exit_group at ffffffff9412b92f
#19 [ffffb0fc41b1ff38] do_syscall_64 at ffffffff940028d7
#20 [ffffb0fc41b1ff50] entry_SYSCALL_64_after_hwframe at ffffffff95c0007c
    RIP: 00007fe54ea30136  RSP: 00007fff33413468  RFLAGS: 00000202
    RAX: ffffffffffffffda  RBX: 00007fff334134e0  RCX: 00007fe54ea30136
    RDX: 00000000000000ff  RSI: 000000000000003c  RDI: 00000000000000ff
    RBP: 00000000000000ff   R8: 00000000000000e7   R9: fffffffffffffdf0
    R10: 000055a091a22d09  R11: 0000000000000202  R12: 000055a091d67f20
    R13: 00007fe54ea5afa0  R14: 000055a091d7ef70  R15: 000055a091d70a20
    ORIG_RAX: 00000000000000e7  CS: 0033  SS: 002b

1060499 is a sshd’s child:

...
root        5462  0.0  0.0  12168  7276 ?        Ss   04:38   0:00 sshd: /usr/sbin/sshd -D [listener] 0 of 10-100 startups
...
root     1060499  0.0  0.1  13936  9056 ?        Ss   17:51   0:00  \_ sshd: pi3 [priv]
pi3      1062463  0.0  0.0  13936  5852 ?        S    17:51   0:00      \_ sshd: pi3@pts/3
...

Crash happens in function “__cgroup_bpf_run_filter_skb”, exactly in this piece of code:

0xffffffff9423e7ee <__cgroup_bpf_run_filter_skb+382>: callq  0xffffffff94153cb0 <preempt_count_add>
0xffffffff9423e7f3 <__cgroup_bpf_run_filter_skb+387>: callq  0xffffffff941925a0 <__rcu_read_lock>
0xffffffff9423e7f8 <__cgroup_bpf_run_filter_skb+392>: mov 0x3e8(%rbp),%rax
0xffffffff9423e7ff <__cgroup_bpf_run_filter_skb+399>: xor %ebp,%ebp
0xffffffff9423e801 <__cgroup_bpf_run_filter_skb+401>: mov 0x10(%rax),%rdi
                                                          ^^^^^^^^^^^^^^^
0xffffffff9423e805 <__cgroup_bpf_run_filter_skb+405>: lea 0x10(%rax),%r14
0xffffffff9423e809 <__cgroup_bpf_run_filter_skb+409>: test %rdi,%rdi

where RAX: 0000000000000000. However, when I was playing with repro under SLAB_DEBUG, I often got RAX: 6b6b6b6b6b6b6b6b:

    [exception RIP: __cgroup_bpf_run_filter_skb+401]
    RIP: ffffffff9123e801  RSP: ffffb136c16ffb88  RFLAGS: 00010246
    RAX: 6b6b6b6b6b6b6b6b  RBX: ffff9ce3e5a0e0e0  RCX: 0000000000000028
    RDX: 0000000000000000  RSI: ffff9ce3de26b280  RDI: ffffffff9123e7f3
    RBP: 0000000000000000   R8: 0000000000000000   R9: 0000000000000000
    R10: 0000000000000003  R11: 0000000000000000  R12: 0000000000000000
    R13: 0000000000000000  R14: 0000000000000000  R15: 0000000000000001

So we have kind of a Use-After-Free bug. This bug is triggerable from user-mode. I’ve looked under IDA for the binary:

.text:FFFFFFFF8123E7EE skb = rbx      ; sk_buff * ; PIC mode
.text:FFFFFFFF8123E7EE type = r15     ; bpf_attach_type
.text:FFFFFFFF8123E7EE save_sk = rsi  ; sock *
.text:FFFFFFFF8123E7EE        call    near ptr preempt_count_add-0EAB43h
.text:FFFFFFFF8123E7F3        call    near ptr __rcu_read_lock-0AC258h ; PIC mode
.text:FFFFFFFF8123E7F8        mov     ret, [rbp+3E8h]
.text:FFFFFFFF8123E7FF        xor     ebp, ebp
.text:FFFFFFFF8123E801 _cn = rbp      ; u32
.text:FFFFFFFF8123E801        mov     rdi, [ret+10h]  ; prog
.text:FFFFFFFF8123E805        lea     r14, [ret+10h]

and this code is referencing cgroups from the socket. Source code:

int __cgroup_bpf_run_filter_skb(struct sock *sk,
				struct sk_buff *skb,
				enum bpf_attach_type type)
{
    ...
	struct cgroup *cgrp;
    ...
... cgrp = sock_cgroup_ptr(&sk->sk_cgrp_data); ... if (type == BPF_CGROUP_INET_EGRESS) { ret = BPF_PROG_CGROUP_INET_EGRESS_RUN_ARRAY( cgrp->bpf.effective[type], skb, __bpf_prog_run_save_cb); ... ... }

Debugger:

crash> x/4i 0xffffffff9423e7f8
   0xffffffff9423e7f8:  mov    0x3e8(%rbp),%rax
   0xffffffff9423e7ff:  xor    %ebp,%ebp
   0xffffffff9423e801:  mov    0x10(%rax),%rdi
   0xffffffff9423e805:  lea    0x10(%rax),%r14
crash> p/x (int)&((struct cgroup*)0)->bpf
$2 = 0x3e0
crash> ptype struct cgroup_bpf
type = struct cgroup_bpf {
    struct bpf_prog_array *effective[28];
    struct list_head progs[28];
    u32 flags[28];
    struct bpf_prog_array *inactive;
    struct percpu_ref refcnt;
    struct work_struct release_work;
}
crash> print/a sizeof(struct bpf_prog_array)
$3 = 0x10
crash> print/a ((struct sk_buff *)0xffff9ce3e5a0e0e0)->sk
$4 = 0xffff9ce3de26b280
crash> print/a ((struct sock *)0xffff9ce3de26b280)->sk_cgrp_data
$5 = {
  {
    {
      is_data = 0x0,
      padding = 0x68,
      prioidx = 0xe241,
      classid = 0xffff9ce3
    },
    val = 0xffff9ce3e2416800
  }
}

We also know that R15: 0000000000000001 == type == BPF_CGROUP_INET_EGRESS

crash> p/a ((struct cgroup *)0xffff9ce3e2416800)->bpf.effective[1]
$6 = 0x6b6b6b6b6b6b6b6b
crash> x/20a 0xffff9ce3e2416800
0xffff9ce3e2416800:     0x6b6b6b6b6b6b016b      0x6b6b6b6b6b6b6b6b
0xffff9ce3e2416810:     0x6b6b6b6b6b6b6b6b      0x6b6b6b6b6b6b6b6b
0xffff9ce3e2416820:     0x6b6b6b6b6b6b6b6b      0x6b6b6b6b6b6b6b6b
0xffff9ce3e2416830:     0x6b6b6b6b6b6b6b6b      0x6b6b6b6b6b6b6b6b
0xffff9ce3e2416840:     0x6b6b6b6b6b6b6b6b      0x6b6b6b6b6b6b6b6b
0xffff9ce3e2416850:     0x6b6b6b6b6b6b6b6b      0x6b6b6b6b6b6b6b6b
0xffff9ce3e2416860:     0x6b6b6b6b6b6b6b6b      0x6b6b6b6b6b6b6b6b
0xffff9ce3e2416870:     0x6b6b6b6b6b6b6b6b      0x6b6b6b6b6b6b6b6b
0xffff9ce3e2416880:     0x6b6b6b6b6b6b6b6b      0x6b6b6b6b6b6b6b6b
0xffff9ce3e2416890:     0x6b6b6b6b6b6b6b6b      0x6b6b6b6b6b6b6b6b
crash>

This pointer (struct cgroup *)

	cgrp = sock_cgroup_ptr(&sk->sk_cgrp_data);

Points to the freed object. However, kernel still keeps eBPF rules attached to the socket under cgroups. When process (sshd) dies (do_exit() call) and cleanup is executed, all sockets are being closed. If such socket has “pending” packets, the following code path is executed:

do_exit -> ... -> sock_close -> __sock_release -> inet_release -> tcp_close -> __tcp_push_pending_frames -> tcp_write_xmit -> __tcp_transmit_skb -> __ip_queue_xmit -> ip_finish_output -> __cgroup_bpf_run_filter_skb

However, there is nothing wrong with such logic and path. The real problem is that cgroups disappeared while still holding active clients. How is that even possible? Just before the crash I can see the following entry in kernel logs:

[190820.457422] ------------[ cut here ]------------
[190820.457465] percpu ref (cgroup_bpf_release_fn) <= 0 (-70581) after switching to atomic
                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[190820.457511] WARNING: CPU: 0 PID: 9 at lib/percpu-refcount.c:161 percpu_ref_switch_to_atomic_rcu+0x112/0x120
[190820.457511] Modules linked in: [last unloaded: p_lkrg]
[190820.457513] CPU: 0 PID: 9 Comm: ksoftirqd/0 Kdump: loaded Tainted: G           OE     5.7.0-g4 #28
[190820.457513] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/13/2018
[190820.457515] RIP: 0010:percpu_ref_switch_to_atomic_rcu+0x112/0x120
[190820.457516] Code: eb b6 80 3d 11 95 5a 02 00 0f 85 65 ff ff ff 48 8b 55 d8 48 8b 75 e8 48 c7 c7 d0 9f 78 93 c6 05 f5 94 5a 02 01 e8 00 57 88 ff <0f> 0b e9 43 ff ff ff 0f 0b eb 9d cc cc cc 8d 8c 16 ef be ad de 89
[190820.457516] RSP: 0018:ffffb136c0087e00 EFLAGS: 00010286
[190820.457517] RAX: 0000000000000000 RBX: 7ffffffffffeec4a RCX: 0000000000000000
[190820.457517] RDX: 0000000000000101 RSI: ffffffff949235c0 RDI: 00000000ffffffff
[190820.457517] RBP: ffff9ce3e204af20 R08: 6d6f7461206f7420 R09: 63696d6f7461206f
[190820.457517] R10: 7320726574666120 R11: 676e696863746977 R12: 00003452c5002ce8
[190820.457518] R13: ffff9ce3f6e2b450 R14: ffff9ce2c7fc3100 R15: 0000000000000000
[190820.457526] FS:  0000000000000000(0000) GS:ffff9ce3f6e00000(0000) knlGS:0000000000000000
[190820.457527] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[190820.457527] CR2: 00007f516c2b9000 CR3: 0000000222c64006 CR4: 00000000003606f0
[190820.457550] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[190820.457551] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[190820.457551] Call Trace:
[190820.457577]  rcu_core+0x1df/0x530
[190820.457598]  ? smpboot_register_percpu_thread+0xd0/0xd0
[190820.457609]  __do_softirq+0xfc/0x331
[190820.457629]  ? smpboot_register_percpu_thread+0xd0/0xd0
[190820.457630]  run_ksoftirqd+0x21/0x30
[190820.457649]  smpboot_thread_fn+0x195/0x230
[190820.457660]  kthread+0x139/0x160
[190820.457670]  ? __kthread_bind_mask+0x60/0x60
[190820.457671]  ret_from_fork+0x35/0x40
[190820.457682] ---[ end trace 63d2aef89e998452 ]---

I was testing the same scenario a few times and I had the following results:

 percpu ref (cgroup_bpf_release_fn) <= 0 (-70581) after switching to atomic
 percpu ref (cgroup_bpf_release_fn) <= 0 (-18829) after switching to atomic
 percpu ref (cgroup_bpf_release_fn) <= 0 (-29849) after switching to atomic

Let’s look at this function:

/**
 * cgroup_bpf_release_fn() - callback used to schedule releasing
 *                           of bpf cgroup data
 * @ref: percpu ref counter structure
 */
static void cgroup_bpf_release_fn(struct percpu_ref *ref)
{
	struct cgroup *cgrp = container_of(ref, struct cgroup, bpf.refcnt);

	INIT_WORK(&cgrp->bpf.release_work, cgroup_bpf_release);
	queue_work(system_wq, &cgrp->bpf.release_work);
}

So that’s the callback used to release bpf cgroup data. Sounds like it is being called while there could be still active socket attached to such cgroup:

/**
 * cgroup_bpf_release() - put references of all bpf programs and
 *                        release all cgroup bpf data
 * @work: work structure embedded into the cgroup to modify
 */
static void cgroup_bpf_release(struct work_struct *work)
{
	struct cgroup *p, *cgrp = container_of(work, struct cgroup,
					       bpf.release_work);
	struct bpf_prog_array *old_array;
	unsigned int type;

	mutex_lock(&cgroup_mutex);

	for (type = 0; type < ARRAY_SIZE(cgrp->bpf.progs); type++) {
		struct list_head *progs = &cgrp->bpf.progs[type];
		struct bpf_prog_list *pl, *tmp;

		list_for_each_entry_safe(pl, tmp, progs, node) {
			list_del(&pl->node);
			if (pl->prog)
				bpf_prog_put(pl->prog);
			if (pl->link)
				bpf_cgroup_link_auto_detach(pl->link);
			bpf_cgroup_storages_unlink(pl->storage);
			bpf_cgroup_storages_free(pl->storage);
			kfree(pl);
			static_branch_dec(&cgroup_bpf_enabled_key);
		}
		old_array = rcu_dereference_protected(
				cgrp->bpf.effective[type],
				lockdep_is_held(&cgroup_mutex));
		bpf_prog_array_free(old_array);
	}

	mutex_unlock(&cgroup_mutex);

	for (p = cgroup_parent(cgrp); p; p = cgroup_parent(p))
		cgroup_bpf_put(p);

	percpu_ref_exit(&cgrp->bpf.refcnt);
	cgroup_put(cgrp);
}

while:

static void bpf_cgroup_link_auto_detach(struct bpf_cgroup_link *link)
{
	cgroup_put(link->cgroup);
	link->cgroup = NULL;
}

So if cgroup dies, all the potential clients are being auto_detached. However, they might not be aware about such situation. When is cgroup_bpf_release_fn() executed?

/**
 * cgroup_bpf_inherit() - inherit effective programs from parent
 * @cgrp: the cgroup to modify
 */
int cgroup_bpf_inherit(struct cgroup *cgrp)
{
    ...
  	ret = percpu_ref_init(&cgrp->bpf.refcnt, cgroup_bpf_release_fn, 0,
			      GFP_KERNEL);
    ...
}

It is automatically executed when cgrp->bpf.refcnt drops to 1. However, in the warning logs before kernel had crashed, we saw that such reference counter is below 0. Cgroup was already freed.

Originally, I thought that the problem might be related to the code walking through the cgroup hierarchy without holding cgroup_mutex, which was pointed out by Alexei. I’ve prepared the patch and recompiled the kernel:

$ diff -u cgroup.c linux-5.7/kernel/bpf/cgroup.c
--- cgroup.c    2020-05-31 23:49:15.000000000 +0000
+++ linux-5.7/kernel/bpf/cgroup.c       2020-07-17 16:31:10.712969480 +0000
@@ -126,11 +126,11 @@
                bpf_prog_array_free(old_array);
        }

-       mutex_unlock(&cgroup_mutex);
-
        for (p = cgroup_parent(cgrp); p; p = cgroup_parent(p))
                cgroup_bpf_put(p);

+       mutex_unlock(&cgroup_mutex);
+
        percpu_ref_exit(&cgrp->bpf.refcnt);
        cgroup_put(cgrp);
 }

Interestingly, without this patch I was able to generate this kernel crash every time when I was rebooting the machine (100% repro). After this patch crashing ratio dropped to around 30%. However, I was still able to hit the same code-path and generate kernel dump. The patch indeed helps but it looks like it’s not the real problem since I can still hit the crash (just much less often).

I stepped back and looked again where the bug is. Corrupted pointer (struct cgroup *) is comming from that line:

	cgrp = sock_cgroup_ptr(&sk->sk_cgrp_data);

this code is related to the CONFIG_SOCK_CGROUP_DATA. Linux source has an interesting comment about it in “cgroup-defs.h” file:

/*
 * sock_cgroup_data is embedded at sock->sk_cgrp_data and contains
 * per-socket cgroup information except for memcg association.
 *
 * On legacy hierarchies, net_prio and net_cls controllers directly set
 * attributes on each sock which can then be tested by the network layer.
 * On the default hierarchy, each sock is associated with the cgroup it was
 * created in and the networking layer can match the cgroup directly.
 *
 * To avoid carrying all three cgroup related fields separately in sock,
 * sock_cgroup_data overloads (prioidx, classid) and the cgroup pointer.
 * On boot, sock_cgroup_data records the cgroup that the sock was created
 * in so that cgroup2 matches can be made; however, once either net_prio or
 * net_cls starts being used, the area is overriden to carry prioidx and/or
 * classid.  The two modes are distinguished by whether the lowest bit is
 * set.  Clear bit indicates cgroup pointer while set bit prioidx and
 * classid.
 *
 * While userland may start using net_prio or net_cls at any time, once
 * either is used, cgroup2 matching no longer works.  There is no reason to
 * mix the two and this is in line with how legacy and v2 compatibility is
 * handled.  On mode switch, cgroup references which are already being
 * pointed to by socks may be leaked.  While this can be remedied by adding
 * synchronization around sock_cgroup_data, given that the number of leaked
 * cgroups is bound and highly unlikely to be high, this seems to be the
 * better trade-off.
 */

and later:

/*
 * There's a theoretical window where the following accessors race with
 * updaters and return part of the previous pointer as the prioidx or
 * classid.  Such races are short-lived and the result isn't critical.
 */

This means that sock_cgroup_data “carries” the information whether net_prio or net_cls starts being used and in such case sock_cgroup_data overloads (prioidx, classid) and the cgroup pointer. In our crash we can extract this information:

crash> print/a ((struct sock *)0xffff9ce3de26b280)->sk_cgrp_data
$5 = {
  {
    {
      is_data = 0x0,
      padding = 0x68,
      prioidx = 0xe241,
      classid = 0xffff9ce3
    },
    val = 0xffff9ce3e2416800
  }
}

Described socket keeps the “sk_cgrp_data” pointer with the information of being “attached” to the cgroup2. However, cgroup2 has been destroyed.
Now we have all the information to solve the mystery of this bug:

  1. Process creates a socket and both of them are inside some cgroup v2 (non-root)
    • cgroup BPF is cgroup2 only
  2. At some point net_prio or net_cls is being used:
    • this operation is disabling cgroup2 socket matching
    • now, all related sockets should be converted to use net_prio, and sk_cgrp_data should be updated
  3. The socket is cloned, but not the reference to the cgroup (ref: point 1)
    • this essentially moves the socket to the new cgroup
  4. All tasks in the old cgroup (ref: point 1) must die and when this happens, this cgroup dies as well
  5. When original process is starting to “use” the socket, it might attempt to access cgroup which is already “dead”. This essentially generates Use-After-Free condition
    • in my specific case, process was killed or invoked exit()
    • during the execution of do_exit() function, all file descriptors and all sockets are being closed
    • one of the socket still points to the previously destroyed cgroup2 BPF (OpenSSH might install BPF)
    • __cgroup_bpf_run_filter_skb runs attached BPF and we have Use-After-Free

To confirm that scenario, I’ve modified some of the Linux kernel sources:

  1. Function cgroup_sk_alloc_disable():
    • I’ve added dump_stack();
  2. Function cgroup_bpf_release():
    • I’ve moved mutex to guard code responsible for walking through the cgroup hierarchy

I’ve managed to reproduce this bug again and this is what I can see in the logs:

...
[   72.061197] kmem.limit_in_bytes is deprecated and will be removed. Please report your usecase to linux-mm@kvack.org if you depend on this functionality.
[   72.121572] cgroup: cgroup: disabling cgroup2 socket matching due to net_prio or net_cls activation
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[   72.121574] CPU: 0 PID: 6958 Comm: kubelet Kdump: loaded Not tainted 5.7.0-g6 #32
[   72.121574] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/13/2018
[   72.121575] Call Trace:
[   72.121580]  dump_stack+0x50/0x70
[   72.121582]  cgroup_sk_alloc_disable.cold+0x11/0x25
                ^^^^^^^^^^^^^^^^^^^^^^^
[   72.121584]  net_prio_attach+0x22/0xa0
                ^^^^^^^^^^^^^^^
[   72.121586]  cgroup_migrate_execute+0x371/0x430
[   72.121587]  cgroup_attach_task+0x132/0x1f0
[   72.121588]  __cgroup1_procs_write.constprop.0+0xff/0x140
                ^^^^^^^^^^^^^^^^^^^^^^
[   72.121590]  kernfs_fop_write+0xc9/0x1a0
[   72.121592]  vfs_write+0xb1/0x1a0
[   72.121593]  ksys_write+0x5a/0xd0
[   72.121595]  do_syscall_64+0x47/0x190
[   72.121596]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[   72.121598] RIP: 0033:0x48abdb
[   72.121599] Code: ff e9 69 ff ff ff cc cc cc cc cc cc cc cc cc e8 7b 68 fb ff 48 8b 7c 24 10 48 8b 74 24 18 48 8b 54 24 20 48 8b 44 24 08 0f 05 <48> 3d 01 f0 ff ff 76 20 48 c7 44 24 28 ff ff ff ff 48 c7 44 24 30
[   72.121600] RSP: 002b:000000c00110f778 EFLAGS: 00000212 ORIG_RAX: 0000000000000001
[   72.121601] RAX: ffffffffffffffda RBX: 000000c000060000 RCX: 000000000048abdb
[   72.121601] RDX: 0000000000000004 RSI: 000000c00110f930 RDI: 000000000000001e
[   72.121601] RBP: 000000c00110f7c8 R08: 000000c00110f901 R09: 0000000000000004
[   72.121602] R10: 000000c0011a39a0 R11: 0000000000000212 R12: 000000000000019b
[   72.121602] R13: 000000000000019a R14: 0000000000000200 R15: 0000000000000000

As we can see, net_prio is being activated and cgroup2 socket matching is being disabled. Next:

[  287.497527] percpu ref (cgroup_bpf_release_fn) <= 0 (-79) after switching to atomic
[  287.497535] WARNING: CPU: 0 PID: 9 at lib/percpu-refcount.c:161 percpu_ref_switch_to_atomic_rcu+0x11f/0x12a
[  287.497536] Modules linked in:
[  287.497537] CPU: 0 PID: 9 Comm: ksoftirqd/0 Kdump: loaded Not tainted 5.7.0-g6 #32
[  287.497538] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/13/2018
[  287.497539] RIP: 0010:percpu_ref_switch_to_atomic_rcu+0x11f/0x12a

cgroup_bpf_release_fn is being executed multiple times. All cgroup BPF entries has been deleted and freed. Next:

[  287.543976] general protection fault, probably for non-canonical address 0x6b6b6b6b6b6b6b6b: 0000 [#1] PREEMPT SMP PTI
[  287.544062] CPU: 0 PID: 11398 Comm: realpath Kdump: loaded Tainted: G        W         5.7.0-g6 #32
[  287.544133] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/13/2018
[  287.544217] RIP: 0010:__cgroup_bpf_run_filter_skb+0xd4/0x230
[  287.544267] Code: 00 48 01 c8 48 89 43 50 41 83 ff 01 0f 84 c2 00 00 00 e8 6f 55 f1 ff e8 5a 3e f5 ff 44 89 fa 48 8d 84 d5 e0 03 00 00 48 8b 00 <48> 8b 78 10 4c 8d 78 10 48 85 ff 0f 84 29 01 00 00 bd 01 00 00 00
[  287.544398] RSP: 0018:ffff957740003af8 EFLAGS: 00010206
[  287.544446] RAX: 6b6b6b6b6b6b6b6b RBX: ffff8911f339cf00 RCX: 0000000000000028
[  287.544506] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000001
[  287.544566] RBP: ffff8911e2eb5000 R08: 0000000000000000 R09: 0000000000000001
[  287.544625] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000014
[  287.544685] R13: 0000000000000014 R14: 0000000000000000 R15: 0000000000000000
[  287.544753] FS:  00007f86e885a580(0000) GS:ffff8911f6e00000(0000) knlGS:0000000000000000
[  287.544833] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  287.544919] CR2: 000055fb75e86da4 CR3: 0000000221316003 CR4: 00000000003606f0
[  287.544996] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  287.545063] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  287.545129] Call Trace:
[  287.545167]  <IRQ>
[  287.545204]  sk_filter_trim_cap+0x10c/0x250
[  287.545253]  ? nf_ct_deliver_cached_events+0xb6/0x120
[  287.545308]  ? tcp_v4_inbound_md5_hash+0x47/0x160
[  287.545359]  tcp_v4_rcv+0xb49/0xda0
[  287.545404]  ? nf_hook_slow+0x3a/0xa0
[  287.545449]  ip_protocol_deliver_rcu+0x26/0x1d0
[  287.545500]  ip_local_deliver_finish+0x50/0x60
[  287.545550]  ip_sublist_rcv_finish+0x38/0x50
[  287.545599]  ip_sublist_rcv+0x16d/0x200
[  287.545645]  ? ip_rcv_finish_core.constprop.0+0x470/0x470
[  287.545701]  ip_list_rcv+0xf1/0x115
[  287.545746]  __netif_receive_skb_list_core+0x249/0x270
[  287.545801]  netif_receive_skb_list_internal+0x19f/0x2c0
[  287.545856]  napi_complete_done+0x8e/0x130
[  287.545905]  e1000_clean+0x27e/0x600
[  287.545951]  ? security_cred_free+0x37/0x50
[  287.545999]  net_rx_action+0x133/0x3b0
[  287.546045]  __do_softirq+0xfc/0x331
[  287.546091]  irq_exit+0x92/0x110
[  287.546133]  do_IRQ+0x6d/0x120
[  287.546175]  common_interrupt+0xf/0xf
[  287.546219]  </IRQ>
[  287.546255] RIP: 0010:__x64_sys_exit_group+0x4/0x10

We have our crash referencing freed memory. 

First CVE – CVE-2020-14356:

I’ve decided to report this issue to the Linux Kernel security mailing list around the mid-July (2020). Roman Gushchin replied to my report and suggested to verify if I can still repro this issue when commit ad0f75e5f57c (“cgroup: fix cgroup_sk_alloc() for sk_clone_lock()”) is applied. This commit was merged to the Linux Kernel git source tree just a few days before my report. I’ve carefully verified it and indeed it fixed the problem. However, commit ad0f75e5f57c is not fully complete and a follow-up fix 14b032b8f8fc (“cgroup: Fix sock_cgroup_data on big-endian.”) should be applied as well.


After this conversation Greg KH decided to backport Roman’s patches to the LTS kernels. In the meantime, I’ve decided to apply for CVE number (through RedHat) to track this issue:

  1. CVE-2020-14356 was allocated to track this issue
  2. For some unknown reasons, this bug was classified as NULL pointer dereference 🙂

RedHat correctly acknowledged this issue as Use-After-Free and in their own description and bugzilla they specify:

However, in CVE MITRE portal we can see a very inaccurate description:

  • “A flaw null pointer dereference in the Linux kernel cgroupv2 subsystem in versions before 5.7.10 was found in the way when reboot the system. A local user could use this flaw to crash the system or escalate their privileges on the system.”
    https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2020-14356

First, it is not NULL pointer dereference but Use-After-Free bug. Maybe it is badly classified based on that opened bug:
https://bugzilla.kernel.org/show_bug.cgi?id=208003

People have started to hit this Use-After-Free bug in the form of NULL pointer dereference “kernel panic”.

Additionally, the entire description of the bug is wrong. I’ve raised that concern to the CVE MITRE but the invalid description is still there. There is also a small Twitter discussion about that here:
https://twitter.com/Adam_pi3/status/1296212546043740160

Second CVE – CVE-2020-25220:

During analysis of this bug, I contacted Brad Spengler. When the patch for this issue was backported to LTS kernels, Brad noticed that it conflicted with his pre-existing backport, and that the upstream backport looked incorrect. I was surprised since I had reviewed the original commit for mainline kernel (5.7) and it was fine. Having this in mind, I decided to carefully review the backported patch:
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-4.14.y&id=82fd2138a5ffd7e0d4320cdb669e115ee976a26e

and it really looks incorrect. Part of the original fix is the following code:

+void cgroup_sk_clone(struct sock_cgroup_data *skcd)
+{
+   if (skcd->val) {
+       if (skcd->no_refcnt)
+           return;
+       /*
+        * We might be cloning a socket which is left in an empty
+        * cgroup and the cgroup might have already been rmdir'd.
+        * Don't use cgroup_get_live().
+        */
+       cgroup_get(sock_cgroup_ptr(skcd));
+       cgroup_bpf_get(sock_cgroup_ptr(skcd));
+   }
+}

However, backported patch has the following logic:

+void cgroup_sk_clone(struct sock_cgroup_data *skcd)
+{
+   /* Socket clone path */
+   if (skcd->val) {
+       /*
+        * We might be cloning a socket which is left in an empty
+        * cgroup and the cgroup might have already been rmdir'd.
+        * Don't use cgroup_get_live().
+        */
+       cgroup_get(sock_cgroup_ptr(skcd));
+   }
+}

There is a missing check:

+       if (skcd->no_refcnt)
+           return;

which could result in reference counter bug and in the end Use-After-Free again. It looks like the backported patch for stable kernels is still buggy.

I’ve contacted RedHat again and they started to provide correct patches for their own kernels. However, LTS kernels were still buggy. I’ve also asked to assign a separate CVE for that issue but RedHat suggested that I do it myself.

After that, I went for vacation and forgot about this issue 🙂 Recently, I’ve decided to apply for CVE to track the “bad patch” issue, and CVE-2020-25220 was allocated. It is worth to point out that someone from Huawei at some point realized that patch is wrong and LTS got a correct fix as well:

https://www.spinics.net/lists/stable/msg405099.html

What is worth to mention, grsecurity backport was never affected by the CVE-2020-25220.

Summary:

Original issue, tracked by CVE-2020-14356, affects kernels starting from 4.5+ up to 5.7.10.

  • RedHat correctly fixed all their kernels, and has proper description of the bug
  • CVE MITRE still has invalid and misleading description

Badly backported patch, tracked by CVE-2020-25220, affects kernels:

  • 4.19 until version 4.19.140 (exclusive)
  • 4.14 until version 4.14.194 (exclusive)
  • 4.9 until version 4.9.233 (exclusive)

*grsecurity kernels were never affected by the CVE-2020-25220


Best regards,
Adam ‘pi3’ Zabrocki

Comments

Leave a Reply




CAPTCHA * Time limit is exhausted. Please reload the CAPTCHA.