11
Sep
The short story of 1 Linux Kernel Use-After-Free bug and 2 CVEs (CVE-2020-14356 and CVE-2020-25220)
Name: Linux kernel Cgroup BPF Use-After-Free
Author: Adam Zabrocki (pi3@pi3.com.pl)
Date: May 27, 2020
First things first – short history:
In 2019 Tejun Heo discovered a racing problem with lifetime of the cgroup_bpf which could result in double-free and other memory corruptions. This bug was fixed in kernel 5.3. More information about the problem and the patch can be found here:
https://lore.kernel.org/patchwork/patch/1094080/
Roman Gushchin discovered another problem with the newly fixed code which could lead to use-after-free vulnerability. His report and fix can be found here:
https://lore.kernel.org/bpf/20191227215034.3169624-1-guro@fb.com/
During the discussion on the fix, Alexei Starovoitov pointed out that walking through the cgroup hierarchy without holding cgroup_mutex might be dangerous:
https://lore.kernel.org/bpf/20200104003523.rfte5rw6hbnncjes@ast-mbp/
However, Roman and Alexei concluded that it shouldn’t be a problem:
https://lore.kernel.org/bpf/20200106220746.fm3hp3zynaiaqgly@ast-mbp/
Unfortunately, there is another Use-After-Free bug related to the Cgroup BPF release logic.
The “new” bug – details (a lot of details ;-)):
During LKRG development and tests, one of my VMs was generating a kernel crash during shutdown procedure. This specific machine had the newest kernel at that time (5.7.x) and I compiled it with all debug information as well as SLAB DEBUG feature. When I analyzed the crash, it had nothing to do with LKRG. Later I confirmed that kernels without LKRG are always hitting that issue:
KERNEL: linux-5.7/vmlinux
DUMPFILE: /var/crash/202006161848/dump.202006161848 [PARTIAL DUMP]
CPUS: 1
DATE: Tue Jun 16 18:47:40 2020
UPTIME: 14:09:24
LOAD AVERAGE: 0.21, 0.37, 0.50
TASKS: 234
NODENAME: oi3
RELEASE: 5.7.0-g4
VERSION: #28 SMP PREEMPT Fri Jun 12 18:09:14 UTC 2020
MACHINE: x86_64 (3694 Mhz)
MEMORY: 8 GB
PANIC: "Oops: 0000 [#1] PREEMPT SMP PTI" (check log for details)
PID: 1060499
COMMAND: "sshd"
TASK: ffff9d8c36b33040 [THREAD_INFO: ffff9d8c36b33040]
CPU: 0
STATE: (PANIC)
crash> bt
PID: 1060499 TASK: ffff9d8c36b33040 CPU: 0 COMMAND: "sshd"
#0 [ffffb0fc41b1f990] machine_kexec at ffffffff9404d22f
#1 [ffffb0fc41b1f9d8] __crash_kexec at ffffffff941c19b8
#2 [ffffb0fc41b1faa0] crash_kexec at ffffffff941c2b60
#3 [ffffb0fc41b1fab0] oops_end at ffffffff94019d3e
#4 [ffffb0fc41b1fad0] page_fault at ffffffff95c0104f
[exception RIP: __cgroup_bpf_run_filter_skb+401]
RIP: ffffffff9423e801 RSP: ffffb0fc41b1fb88 RFLAGS: 00010246
RAX: 0000000000000000 RBX: ffff9d8d56ae1ee0 RCX: 0000000000000028
RDX: 0000000000000000 RSI: ffff9d8e25c40b00 RDI: ffffffff9423e7f3
RBP: 0000000000000000 R8: 0000000000000000 R9: 0000000000000000
R10: 0000000000000003 R11: 0000000000000000 R12: 0000000000000000
R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000001
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
#5 [ffffb0fc41b1fbd0] ip_finish_output at ffffffff957d71b3
#6 [ffffb0fc41b1fbf8] __ip_queue_xmit at ffffffff957d84e1
#7 [ffffb0fc41b1fc50] __tcp_transmit_skb at ffffffff957f4b27
#8 [ffffb0fc41b1fd58] tcp_write_xmit at ffffffff957f6579
#9 [ffffb0fc41b1fdb8] __tcp_push_pending_frames at ffffffff957f737d
#10 [ffffb0fc41b1fdd0] tcp_close at ffffffff957e6ec1
#11 [ffffb0fc41b1fdf8] inet_release at ffffffff9581809f
#12 [ffffb0fc41b1fe10] __sock_release at ffffffff95616848
#13 [ffffb0fc41b1fe30] sock_close at ffffffff956168bc
#14 [ffffb0fc41b1fe38] __fput at ffffffff942fd3cd
#15 [ffffb0fc41b1fe78] task_work_run at ffffffff94148a4a
#16 [ffffb0fc41b1fe98] do_exit at ffffffff9412b144
#17 [ffffb0fc41b1ff08] do_group_exit at ffffffff9412b8ae
#18 [ffffb0fc41b1ff30] __x64_sys_exit_group at ffffffff9412b92f
#19 [ffffb0fc41b1ff38] do_syscall_64 at ffffffff940028d7
#20 [ffffb0fc41b1ff50] entry_SYSCALL_64_after_hwframe at ffffffff95c0007c
RIP: 00007fe54ea30136 RSP: 00007fff33413468 RFLAGS: 00000202
RAX: ffffffffffffffda RBX: 00007fff334134e0 RCX: 00007fe54ea30136
RDX: 00000000000000ff RSI: 000000000000003c RDI: 00000000000000ff
RBP: 00000000000000ff R8: 00000000000000e7 R9: fffffffffffffdf0
R10: 000055a091a22d09 R11: 0000000000000202 R12: 000055a091d67f20
R13: 00007fe54ea5afa0 R14: 000055a091d7ef70 R15: 000055a091d70a20
ORIG_RAX: 00000000000000e7 CS: 0033 SS: 002b
1060499 is a sshd’s child:
...
root 5462 0.0 0.0 12168 7276 ? Ss 04:38 0:00 sshd: /usr/sbin/sshd -D [listener] 0 of 10-100 startups
...
root 1060499 0.0 0.1 13936 9056 ? Ss 17:51 0:00 \_ sshd: pi3 [priv]
pi3 1062463 0.0 0.0 13936 5852 ? S 17:51 0:00 \_ sshd: pi3@pts/3
...
Crash happens in function “__cgroup_bpf_run_filter_skb”, exactly in this piece of code:
0xffffffff9423e7ee <__cgroup_bpf_run_filter_skb+382>: callq 0xffffffff94153cb0 <preempt_count_add>
0xffffffff9423e7f3 <__cgroup_bpf_run_filter_skb+387>: callq 0xffffffff941925a0 <__rcu_read_lock>
0xffffffff9423e7f8 <__cgroup_bpf_run_filter_skb+392>: mov 0x3e8(%rbp),%rax
0xffffffff9423e7ff <__cgroup_bpf_run_filter_skb+399>: xor %ebp,%ebp
0xffffffff9423e801 <__cgroup_bpf_run_filter_skb+401>: mov 0x10(%rax),%rdi
^^^^^^^^^^^^^^^
0xffffffff9423e805 <__cgroup_bpf_run_filter_skb+405>: lea 0x10(%rax),%r14
0xffffffff9423e809 <__cgroup_bpf_run_filter_skb+409>: test %rdi,%rdi
where RAX: 0000000000000000. However, when I was playing with repro under SLAB_DEBUG, I often got RAX: 6b6b6b6b6b6b6b6b:
[exception RIP: __cgroup_bpf_run_filter_skb+401]
RIP: ffffffff9123e801 RSP: ffffb136c16ffb88 RFLAGS: 00010246
RAX: 6b6b6b6b6b6b6b6b RBX: ffff9ce3e5a0e0e0 RCX: 0000000000000028
RDX: 0000000000000000 RSI: ffff9ce3de26b280 RDI: ffffffff9123e7f3
RBP: 0000000000000000 R8: 0000000000000000 R9: 0000000000000000
R10: 0000000000000003 R11: 0000000000000000 R12: 0000000000000000
R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000001
So we have kind of a Use-After-Free bug. This bug is triggerable from user-mode. I’ve looked under IDA for the binary:
.text:FFFFFFFF8123E7EE skb = rbx ; sk_buff * ; PIC mode
.text:FFFFFFFF8123E7EE type = r15 ; bpf_attach_type
.text:FFFFFFFF8123E7EE save_sk = rsi ; sock *
.text:FFFFFFFF8123E7EE call near ptr preempt_count_add-0EAB43h
.text:FFFFFFFF8123E7F3 call near ptr __rcu_read_lock-0AC258h ; PIC mode
.text:FFFFFFFF8123E7F8 mov ret, [rbp+3E8h]
.text:FFFFFFFF8123E7FF xor ebp, ebp
.text:FFFFFFFF8123E801 _cn = rbp ; u32
.text:FFFFFFFF8123E801 mov rdi, [ret+10h] ; prog
.text:FFFFFFFF8123E805 lea r14, [ret+10h]
and this code is referencing cgroups from the socket. Source code:
int __cgroup_bpf_run_filter_skb(struct sock *sk,
struct sk_buff *skb,
enum bpf_attach_type type)
{
...
struct cgroup *cgrp;
...
...
cgrp = sock_cgroup_ptr(&sk->sk_cgrp_data);
...
if (type == BPF_CGROUP_INET_EGRESS) {
ret = BPF_PROG_CGROUP_INET_EGRESS_RUN_ARRAY(
cgrp->bpf.effective[type], skb, __bpf_prog_run_save_cb);
...
...
}
Debugger:
crash> x/4i 0xffffffff9423e7f8
0xffffffff9423e7f8: mov 0x3e8(%rbp),%rax
0xffffffff9423e7ff: xor %ebp,%ebp
0xffffffff9423e801: mov 0x10(%rax),%rdi
0xffffffff9423e805: lea 0x10(%rax),%r14
crash> p/x (int)&((struct cgroup*)0)->bpf
$2 = 0x3e0
crash> ptype struct cgroup_bpf
type = struct cgroup_bpf {
struct bpf_prog_array *effective[28];
struct list_head progs[28];
u32 flags[28];
struct bpf_prog_array *inactive;
struct percpu_ref refcnt;
struct work_struct release_work;
}
crash> print/a sizeof(struct bpf_prog_array)
$3 = 0x10
crash> print/a ((struct sk_buff *)0xffff9ce3e5a0e0e0)->sk
$4 = 0xffff9ce3de26b280
crash> print/a ((struct sock *)0xffff9ce3de26b280)->sk_cgrp_data
$5 = {
{
{
is_data = 0x0,
padding = 0x68,
prioidx = 0xe241,
classid = 0xffff9ce3
},
val = 0xffff9ce3e2416800
}
}
We also know that R15: 0000000000000001 == type == BPF_CGROUP_INET_EGRESS
crash> p/a ((struct cgroup *)0xffff9ce3e2416800)->bpf.effective[1]
$6 = 0x6b6b6b6b6b6b6b6b
crash> x/20a 0xffff9ce3e2416800
0xffff9ce3e2416800: 0x6b6b6b6b6b6b016b 0x6b6b6b6b6b6b6b6b
0xffff9ce3e2416810: 0x6b6b6b6b6b6b6b6b 0x6b6b6b6b6b6b6b6b
0xffff9ce3e2416820: 0x6b6b6b6b6b6b6b6b 0x6b6b6b6b6b6b6b6b
0xffff9ce3e2416830: 0x6b6b6b6b6b6b6b6b 0x6b6b6b6b6b6b6b6b
0xffff9ce3e2416840: 0x6b6b6b6b6b6b6b6b 0x6b6b6b6b6b6b6b6b
0xffff9ce3e2416850: 0x6b6b6b6b6b6b6b6b 0x6b6b6b6b6b6b6b6b
0xffff9ce3e2416860: 0x6b6b6b6b6b6b6b6b 0x6b6b6b6b6b6b6b6b
0xffff9ce3e2416870: 0x6b6b6b6b6b6b6b6b 0x6b6b6b6b6b6b6b6b
0xffff9ce3e2416880: 0x6b6b6b6b6b6b6b6b 0x6b6b6b6b6b6b6b6b
0xffff9ce3e2416890: 0x6b6b6b6b6b6b6b6b 0x6b6b6b6b6b6b6b6b
crash>
This pointer (struct cgroup *)
cgrp = sock_cgroup_ptr(&sk->sk_cgrp_data);
Points to the freed object. However, kernel still keeps eBPF rules attached to the socket under cgroups. When process (sshd) dies (do_exit() call) and cleanup is executed, all sockets are being closed. If such socket has “pending” packets, the following code path is executed:
do_exit -> ... -> sock_close -> __sock_release -> inet_release -> tcp_close -> __tcp_push_pending_frames -> tcp_write_xmit -> __tcp_transmit_skb -> __ip_queue_xmit -> ip_finish_output -> __cgroup_bpf_run_filter_skb
However, there is nothing wrong with such logic and path. The real problem is that cgroups disappeared while still holding active clients. How is that even possible? Just before the crash I can see the following entry in kernel logs:
[190820.457422] ------------[ cut here ]------------
[190820.457465] percpu ref (cgroup_bpf_release_fn) <= 0 (-70581) after switching to atomic
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[190820.457511] WARNING: CPU: 0 PID: 9 at lib/percpu-refcount.c:161 percpu_ref_switch_to_atomic_rcu+0x112/0x120
[190820.457511] Modules linked in: [last unloaded: p_lkrg]
[190820.457513] CPU: 0 PID: 9 Comm: ksoftirqd/0 Kdump: loaded Tainted: G OE 5.7.0-g4 #28
[190820.457513] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/13/2018
[190820.457515] RIP: 0010:percpu_ref_switch_to_atomic_rcu+0x112/0x120
[190820.457516] Code: eb b6 80 3d 11 95 5a 02 00 0f 85 65 ff ff ff 48 8b 55 d8 48 8b 75 e8 48 c7 c7 d0 9f 78 93 c6 05 f5 94 5a 02 01 e8 00 57 88 ff <0f> 0b e9 43 ff ff ff 0f 0b eb 9d cc cc cc 8d 8c 16 ef be ad de 89
[190820.457516] RSP: 0018:ffffb136c0087e00 EFLAGS: 00010286
[190820.457517] RAX: 0000000000000000 RBX: 7ffffffffffeec4a RCX: 0000000000000000
[190820.457517] RDX: 0000000000000101 RSI: ffffffff949235c0 RDI: 00000000ffffffff
[190820.457517] RBP: ffff9ce3e204af20 R08: 6d6f7461206f7420 R09: 63696d6f7461206f
[190820.457517] R10: 7320726574666120 R11: 676e696863746977 R12: 00003452c5002ce8
[190820.457518] R13: ffff9ce3f6e2b450 R14: ffff9ce2c7fc3100 R15: 0000000000000000
[190820.457526] FS: 0000000000000000(0000) GS:ffff9ce3f6e00000(0000) knlGS:0000000000000000
[190820.457527] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[190820.457527] CR2: 00007f516c2b9000 CR3: 0000000222c64006 CR4: 00000000003606f0
[190820.457550] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[190820.457551] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[190820.457551] Call Trace:
[190820.457577] rcu_core+0x1df/0x530
[190820.457598] ? smpboot_register_percpu_thread+0xd0/0xd0
[190820.457609] __do_softirq+0xfc/0x331
[190820.457629] ? smpboot_register_percpu_thread+0xd0/0xd0
[190820.457630] run_ksoftirqd+0x21/0x30
[190820.457649] smpboot_thread_fn+0x195/0x230
[190820.457660] kthread+0x139/0x160
[190820.457670] ? __kthread_bind_mask+0x60/0x60
[190820.457671] ret_from_fork+0x35/0x40
[190820.457682] ---[ end trace 63d2aef89e998452 ]---
I was testing the same scenario a few times and I had the following results:
percpu ref (cgroup_bpf_release_fn) <= 0 (-70581) after switching to atomic
percpu ref (cgroup_bpf_release_fn) <= 0 (-18829) after switching to atomic
percpu ref (cgroup_bpf_release_fn) <= 0 (-29849) after switching to atomic
Let’s look at this function:
/**
* cgroup_bpf_release_fn() - callback used to schedule releasing
* of bpf cgroup data
* @ref: percpu ref counter structure
*/
static void cgroup_bpf_release_fn(struct percpu_ref *ref)
{
struct cgroup *cgrp = container_of(ref, struct cgroup, bpf.refcnt);
INIT_WORK(&cgrp->bpf.release_work, cgroup_bpf_release);
queue_work(system_wq, &cgrp->bpf.release_work);
}
So that’s the callback used to release bpf cgroup data. Sounds like it is being called while there could be still active socket attached to such cgroup:
/**
* cgroup_bpf_release() - put references of all bpf programs and
* release all cgroup bpf data
* @work: work structure embedded into the cgroup to modify
*/
static void cgroup_bpf_release(struct work_struct *work)
{
struct cgroup *p, *cgrp = container_of(work, struct cgroup,
bpf.release_work);
struct bpf_prog_array *old_array;
unsigned int type;
mutex_lock(&cgroup_mutex);
for (type = 0; type < ARRAY_SIZE(cgrp->bpf.progs); type++) {
struct list_head *progs = &cgrp->bpf.progs[type];
struct bpf_prog_list *pl, *tmp;
list_for_each_entry_safe(pl, tmp, progs, node) {
list_del(&pl->node);
if (pl->prog)
bpf_prog_put(pl->prog);
if (pl->link)
bpf_cgroup_link_auto_detach(pl->link);
bpf_cgroup_storages_unlink(pl->storage);
bpf_cgroup_storages_free(pl->storage);
kfree(pl);
static_branch_dec(&cgroup_bpf_enabled_key);
}
old_array = rcu_dereference_protected(
cgrp->bpf.effective[type],
lockdep_is_held(&cgroup_mutex));
bpf_prog_array_free(old_array);
}
mutex_unlock(&cgroup_mutex);
for (p = cgroup_parent(cgrp); p; p = cgroup_parent(p))
cgroup_bpf_put(p);
percpu_ref_exit(&cgrp->bpf.refcnt);
cgroup_put(cgrp);
}
while:
static void bpf_cgroup_link_auto_detach(struct bpf_cgroup_link *link)
{
cgroup_put(link->cgroup);
link->cgroup = NULL;
}
So if cgroup dies, all the potential clients are being auto_detached. However, they might not be aware about such situation. When is cgroup_bpf_release_fn() executed?
/**
* cgroup_bpf_inherit() - inherit effective programs from parent
* @cgrp: the cgroup to modify
*/
int cgroup_bpf_inherit(struct cgroup *cgrp)
{
...
ret = percpu_ref_init(&cgrp->bpf.refcnt, cgroup_bpf_release_fn, 0,
GFP_KERNEL);
...
}
It is automatically executed when cgrp->bpf.refcnt drops to 1. However, in the warning logs before kernel had crashed, we saw that such reference counter is below 0. Cgroup was already freed.
Originally, I thought that the problem might be related to the code walking through the cgroup hierarchy without holding cgroup_mutex, which was pointed out by Alexei. I’ve prepared the patch and recompiled the kernel:
$ diff -u cgroup.c linux-5.7/kernel/bpf/cgroup.c
--- cgroup.c 2020-05-31 23:49:15.000000000 +0000
+++ linux-5.7/kernel/bpf/cgroup.c 2020-07-17 16:31:10.712969480 +0000
@@ -126,11 +126,11 @@
bpf_prog_array_free(old_array);
}
- mutex_unlock(&cgroup_mutex);
-
for (p = cgroup_parent(cgrp); p; p = cgroup_parent(p))
cgroup_bpf_put(p);
+ mutex_unlock(&cgroup_mutex);
+
percpu_ref_exit(&cgrp->bpf.refcnt);
cgroup_put(cgrp);
}
Interestingly, without this patch I was able to generate this kernel crash every time when I was rebooting the machine (100% repro). After this patch crashing ratio dropped to around 30%. However, I was still able to hit the same code-path and generate kernel dump. The patch indeed helps but it looks like it’s not the real problem since I can still hit the crash (just much less often).
I stepped back and looked again where the bug is. Corrupted pointer (struct cgroup *) is comming from that line:
cgrp = sock_cgroup_ptr(&sk->sk_cgrp_data);
this code is related to the CONFIG_SOCK_CGROUP_DATA. Linux source has an interesting comment about it in “cgroup-defs.h” file:
/*
* sock_cgroup_data is embedded at sock->sk_cgrp_data and contains
* per-socket cgroup information except for memcg association.
*
* On legacy hierarchies, net_prio and net_cls controllers directly set
* attributes on each sock which can then be tested by the network layer.
* On the default hierarchy, each sock is associated with the cgroup it was
* created in and the networking layer can match the cgroup directly.
*
* To avoid carrying all three cgroup related fields separately in sock,
* sock_cgroup_data overloads (prioidx, classid) and the cgroup pointer.
* On boot, sock_cgroup_data records the cgroup that the sock was created
* in so that cgroup2 matches can be made; however, once either net_prio or
* net_cls starts being used, the area is overriden to carry prioidx and/or
* classid. The two modes are distinguished by whether the lowest bit is
* set. Clear bit indicates cgroup pointer while set bit prioidx and
* classid.
*
* While userland may start using net_prio or net_cls at any time, once
* either is used, cgroup2 matching no longer works. There is no reason to
* mix the two and this is in line with how legacy and v2 compatibility is
* handled. On mode switch, cgroup references which are already being
* pointed to by socks may be leaked. While this can be remedied by adding
* synchronization around sock_cgroup_data, given that the number of leaked
* cgroups is bound and highly unlikely to be high, this seems to be the
* better trade-off.
*/
and later:
/*
* There's a theoretical window where the following accessors race with
* updaters and return part of the previous pointer as the prioidx or
* classid. Such races are short-lived and the result isn't critical.
*/
This means that sock_cgroup_data “carries” the information whether net_prio or net_cls starts being used and in such case sock_cgroup_data overloads (prioidx, classid) and the cgroup pointer. In our crash we can extract this information:
crash> print/a ((struct sock *)0xffff9ce3de26b280)->sk_cgrp_data
$5 = {
{
{
is_data = 0x0,
padding = 0x68,
prioidx = 0xe241,
classid = 0xffff9ce3
},
val = 0xffff9ce3e2416800
}
}
Described socket keeps the “sk_cgrp_data” pointer with the information of being “attached” to the cgroup2. However, cgroup2 has been destroyed.
Now we have all the information to solve the mystery of this bug:
- Process creates a socket and both of them are inside some cgroup v2 (non-root)
- cgroup BPF is cgroup2 only
- At some point net_prio or net_cls is being used:
- this operation is disabling cgroup2 socket matching
- now, all related sockets should be converted to use net_prio, and sk_cgrp_data should be updated
- The socket is cloned, but not the reference to the cgroup (ref: point 1)
- this essentially moves the socket to the new cgroup
- All tasks in the old cgroup (ref: point 1) must die and when this happens, this cgroup dies as well
- When original process is starting to “use” the socket, it might attempt to access cgroup which is already “dead”. This essentially generates Use-After-Free condition
- in my specific case, process was killed or invoked exit()
- during the execution of do_exit() function, all file descriptors and all sockets are being closed
- one of the socket still points to the previously destroyed cgroup2 BPF (OpenSSH might install BPF)
- __cgroup_bpf_run_filter_skb runs attached BPF and we have Use-After-Free
To confirm that scenario, I’ve modified some of the Linux kernel sources:
- Function cgroup_sk_alloc_disable():
- I’ve added dump_stack();
- Function cgroup_bpf_release():
- I’ve moved mutex to guard code responsible for walking through the cgroup hierarchy
I’ve managed to reproduce this bug again and this is what I can see in the logs:
...
[ 72.061197] kmem.limit_in_bytes is deprecated and will be removed. Please report your usecase to linux-mm@kvack.org if you depend on this functionality.
[ 72.121572] cgroup: cgroup: disabling cgroup2 socket matching due to net_prio or net_cls activation
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[ 72.121574] CPU: 0 PID: 6958 Comm: kubelet Kdump: loaded Not tainted 5.7.0-g6 #32
[ 72.121574] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/13/2018
[ 72.121575] Call Trace:
[ 72.121580] dump_stack+0x50/0x70
[ 72.121582] cgroup_sk_alloc_disable.cold+0x11/0x25
^^^^^^^^^^^^^^^^^^^^^^^
[ 72.121584] net_prio_attach+0x22/0xa0
^^^^^^^^^^^^^^^
[ 72.121586] cgroup_migrate_execute+0x371/0x430
[ 72.121587] cgroup_attach_task+0x132/0x1f0
[ 72.121588] __cgroup1_procs_write.constprop.0+0xff/0x140
^^^^^^^^^^^^^^^^^^^^^^
[ 72.121590] kernfs_fop_write+0xc9/0x1a0
[ 72.121592] vfs_write+0xb1/0x1a0
[ 72.121593] ksys_write+0x5a/0xd0
[ 72.121595] do_syscall_64+0x47/0x190
[ 72.121596] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 72.121598] RIP: 0033:0x48abdb
[ 72.121599] Code: ff e9 69 ff ff ff cc cc cc cc cc cc cc cc cc e8 7b 68 fb ff 48 8b 7c 24 10 48 8b 74 24 18 48 8b 54 24 20 48 8b 44 24 08 0f 05 <48> 3d 01 f0 ff ff 76 20 48 c7 44 24 28 ff ff ff ff 48 c7 44 24 30
[ 72.121600] RSP: 002b:000000c00110f778 EFLAGS: 00000212 ORIG_RAX: 0000000000000001
[ 72.121601] RAX: ffffffffffffffda RBX: 000000c000060000 RCX: 000000000048abdb
[ 72.121601] RDX: 0000000000000004 RSI: 000000c00110f930 RDI: 000000000000001e
[ 72.121601] RBP: 000000c00110f7c8 R08: 000000c00110f901 R09: 0000000000000004
[ 72.121602] R10: 000000c0011a39a0 R11: 0000000000000212 R12: 000000000000019b
[ 72.121602] R13: 000000000000019a R14: 0000000000000200 R15: 0000000000000000
As we can see, net_prio is being activated and cgroup2 socket matching is being disabled. Next:
[ 287.497527] percpu ref (cgroup_bpf_release_fn) <= 0 (-79) after switching to atomic
[ 287.497535] WARNING: CPU: 0 PID: 9 at lib/percpu-refcount.c:161 percpu_ref_switch_to_atomic_rcu+0x11f/0x12a
[ 287.497536] Modules linked in:
[ 287.497537] CPU: 0 PID: 9 Comm: ksoftirqd/0 Kdump: loaded Not tainted 5.7.0-g6 #32
[ 287.497538] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/13/2018
[ 287.497539] RIP: 0010:percpu_ref_switch_to_atomic_rcu+0x11f/0x12a
cgroup_bpf_release_fn is being executed multiple times. All cgroup BPF entries has been deleted and freed. Next:
[ 287.543976] general protection fault, probably for non-canonical address 0x6b6b6b6b6b6b6b6b: 0000 [#1] PREEMPT SMP PTI
[ 287.544062] CPU: 0 PID: 11398 Comm: realpath Kdump: loaded Tainted: G W 5.7.0-g6 #32
[ 287.544133] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/13/2018
[ 287.544217] RIP: 0010:__cgroup_bpf_run_filter_skb+0xd4/0x230
[ 287.544267] Code: 00 48 01 c8 48 89 43 50 41 83 ff 01 0f 84 c2 00 00 00 e8 6f 55 f1 ff e8 5a 3e f5 ff 44 89 fa 48 8d 84 d5 e0 03 00 00 48 8b 00 <48> 8b 78 10 4c 8d 78 10 48 85 ff 0f 84 29 01 00 00 bd 01 00 00 00
[ 287.544398] RSP: 0018:ffff957740003af8 EFLAGS: 00010206
[ 287.544446] RAX: 6b6b6b6b6b6b6b6b RBX: ffff8911f339cf00 RCX: 0000000000000028
[ 287.544506] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000001
[ 287.544566] RBP: ffff8911e2eb5000 R08: 0000000000000000 R09: 0000000000000001
[ 287.544625] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000014
[ 287.544685] R13: 0000000000000014 R14: 0000000000000000 R15: 0000000000000000
[ 287.544753] FS: 00007f86e885a580(0000) GS:ffff8911f6e00000(0000) knlGS:0000000000000000
[ 287.544833] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 287.544919] CR2: 000055fb75e86da4 CR3: 0000000221316003 CR4: 00000000003606f0
[ 287.544996] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 287.545063] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 287.545129] Call Trace:
[ 287.545167] <IRQ>
[ 287.545204] sk_filter_trim_cap+0x10c/0x250
[ 287.545253] ? nf_ct_deliver_cached_events+0xb6/0x120
[ 287.545308] ? tcp_v4_inbound_md5_hash+0x47/0x160
[ 287.545359] tcp_v4_rcv+0xb49/0xda0
[ 287.545404] ? nf_hook_slow+0x3a/0xa0
[ 287.545449] ip_protocol_deliver_rcu+0x26/0x1d0
[ 287.545500] ip_local_deliver_finish+0x50/0x60
[ 287.545550] ip_sublist_rcv_finish+0x38/0x50
[ 287.545599] ip_sublist_rcv+0x16d/0x200
[ 287.545645] ? ip_rcv_finish_core.constprop.0+0x470/0x470
[ 287.545701] ip_list_rcv+0xf1/0x115
[ 287.545746] __netif_receive_skb_list_core+0x249/0x270
[ 287.545801] netif_receive_skb_list_internal+0x19f/0x2c0
[ 287.545856] napi_complete_done+0x8e/0x130
[ 287.545905] e1000_clean+0x27e/0x600
[ 287.545951] ? security_cred_free+0x37/0x50
[ 287.545999] net_rx_action+0x133/0x3b0
[ 287.546045] __do_softirq+0xfc/0x331
[ 287.546091] irq_exit+0x92/0x110
[ 287.546133] do_IRQ+0x6d/0x120
[ 287.546175] common_interrupt+0xf/0xf
[ 287.546219] </IRQ>
[ 287.546255] RIP: 0010:__x64_sys_exit_group+0x4/0x10
We have our crash referencing freed memory.
First CVE – CVE-2020-14356:
I’ve decided to report this issue to the Linux Kernel security mailing list around the mid-July (2020). Roman Gushchin replied to my report and suggested to verify if I can still repro this issue when commit ad0f75e5f57c (“cgroup: fix cgroup_sk_alloc() for sk_clone_lock()”) is applied. This commit was merged to the Linux Kernel git source tree just a few days before my report. I’ve carefully verified it and indeed it fixed the problem. However, commit ad0f75e5f57c is not fully complete and a follow-up fix 14b032b8f8fc (“cgroup: Fix sock_cgroup_data on big-endian.”) should be applied as well.
After this conversation Greg KH decided to backport Roman’s patches to the LTS kernels. In the meantime, I’ve decided to apply for CVE number (through RedHat) to track this issue:
- CVE-2020-14356 was allocated to track this issue
- For some unknown reasons, this bug was classified as NULL pointer dereference 🙂
RedHat correctly acknowledged this issue as Use-After-Free and in their own description and bugzilla they specify:
- “A use-after-free flaw was found in the Linux kernel’s cgroupv2 (…)”
https://access.redhat.com/security/cve/cve-2020-14356 - “It was found that the Linux kernel’s use after free issue (…)”
https://bugzilla.redhat.com/show_bug.cgi?id=1868453
However, in CVE MITRE portal we can see a very inaccurate description:
- “A flaw null pointer dereference in the Linux kernel cgroupv2 subsystem in versions before 5.7.10 was found in the way when reboot the system. A local user could use this flaw to crash the system or escalate their privileges on the system.”
https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2020-14356
First, it is not NULL pointer dereference but Use-After-Free bug. Maybe it is badly classified based on that opened bug:
https://bugzilla.kernel.org/show_bug.cgi?id=208003
People have started to hit this Use-After-Free bug in the form of NULL pointer dereference “kernel panic”.
Additionally, the entire description of the bug is wrong. I’ve raised that concern to the CVE MITRE but the invalid description is still there. There is also a small Twitter discussion about that here:
https://twitter.com/Adam_pi3/status/1296212546043740160
Second CVE – CVE-2020-25220:
During analysis of this bug, I contacted Brad Spengler. When the patch for this issue was backported to LTS kernels, Brad noticed that it conflicted with his pre-existing backport, and that the upstream backport looked incorrect. I was surprised since I had reviewed the original commit for mainline kernel (5.7) and it was fine. Having this in mind, I decided to carefully review the backported patch:
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-4.14.y&id=82fd2138a5ffd7e0d4320cdb669e115ee976a26e
and it really looks incorrect. Part of the original fix is the following code:
+void cgroup_sk_clone(struct sock_cgroup_data *skcd)
+{
+ if (skcd->val) {
+ if (skcd->no_refcnt)
+ return;
+ /*
+ * We might be cloning a socket which is left in an empty
+ * cgroup and the cgroup might have already been rmdir'd.
+ * Don't use cgroup_get_live().
+ */
+ cgroup_get(sock_cgroup_ptr(skcd));
+ cgroup_bpf_get(sock_cgroup_ptr(skcd));
+ }
+}
However, backported patch has the following logic:
+void cgroup_sk_clone(struct sock_cgroup_data *skcd)
+{
+ /* Socket clone path */
+ if (skcd->val) {
+ /*
+ * We might be cloning a socket which is left in an empty
+ * cgroup and the cgroup might have already been rmdir'd.
+ * Don't use cgroup_get_live().
+ */
+ cgroup_get(sock_cgroup_ptr(skcd));
+ }
+}
There is a missing check:
+ if (skcd->no_refcnt)
+ return;
which could result in reference counter bug and in the end Use-After-Free again. It looks like the backported patch for stable kernels is still buggy.
I’ve contacted RedHat again and they started to provide correct patches for their own kernels. However, LTS kernels were still buggy. I’ve also asked to assign a separate CVE for that issue but RedHat suggested that I do it myself.
After that, I went for vacation and forgot about this issue 🙂 Recently, I’ve decided to apply for CVE to track the “bad patch” issue, and CVE-2020-25220 was allocated. It is worth to point out that someone from Huawei at some point realized that patch is wrong and LTS got a correct fix as well:
https://www.spinics.net/lists/stable/msg405099.html
What is worth to mention, grsecurity backport was never affected by the CVE-2020-25220.
Summary:
Original issue, tracked by CVE-2020-14356, affects kernels starting from 4.5+ up to 5.7.10.
- RedHat correctly fixed all their kernels, and has proper description of the bug
- CVE MITRE still has invalid and misleading description
Badly backported patch, tracked by CVE-2020-25220, affects kernels:
- 4.19 until version 4.19.140 (exclusive)
- 4.14 until version 4.14.194 (exclusive)
- 4.9 until version 4.9.233 (exclusive)
*grsecurity kernels were never affected by the CVE-2020-25220
Best regards,
Adam ‘pi3’ Zabrocki
Comments
Leave a Reply