I'm testing out k8s debugging features including debug pods and ephemeral containers, and I just can't work out how to properly map a "target" pod's file system into the debug container.
I want to link two disjoint mount namespaces with a recursive bind mount* so container A sees container B's root as /containerB
or vice versa. Including all volumes and other mounts.
Goal: Access to both debug and target container file systems at the same time
The goal is to have the target pod's full filesystem tree, including volumes and other mounts mapped to a subdir of the debug container e.g. /run/target
. If the target container mounts persistent volumes, those mount points should be mapped, so e.g. if target container has /data
then the debug container should have a mounted /run/target/data
.
Alternately, it'd be ok to "inject" the debug container file system tree into the target container, so there's e.g. a /run/debug
that exposes the debug container root available when nsenter
ing the debug container. Including its mounts like procfs, so it's fully functional.
I want to be able to e.g. gdb -p $target_pid
where gdb
is provided by the debug container. gdb
has to be able to find the process executables from the target container for this.
I've explored a few workaround approaches. But what I really want to do is mount --rbind
the target container FS tree onto the guest or vice versa. Given a custom-built privileged debug container like:
apiVersion: v1
kind: Pod
metadata:
name: debugcontainer
namespace: default
spec:
nodeName: TARGET_NODE_NAME_HERE
enableServiceLinks: true
hostIPC: true
hostNetwork: true
hostPID: true
restartPolicy: Never
containers:
- image: DIAG_CONTAINER_IMAGE_HERE # you can experiment using something like ubuntu:20.04
name: debugger
stdin: true
tty: true
volumeMounts:
- mountPath: /target
name: target
#- mountPath: /host
# mountPropagation: None
# name: host-root
securityContext:
privileged: true
runAsGroup: 0
runAsUser: 0
volumes:
- emptyDir: {}
name: target
#- hostPath:
# path: "/"
# type: ""
# name: host-root
where the debug container is launched into the same node as the target container, I can:
- See target container processes in
ps
- attach to processes with
strace
,gdb
etc because the privileged debug container hasCAP_SYS_PTRACE
nsenter -t $some_target_container_pid --all
to "become" a proc in the target container, as if I'd donekubectl exec
. I can no longer "see" or access the debug container files/tools.nsenter -t $some_target_container_pid -m --root=/ --wd=/
to enter the target proc's mount namespace, but retain the privs of the debug container. I can no longer "see" or access the debug container files/tools.
But I cannot:
- See files in the target container at the same time as having access to the tools in the debug container - e.g.
gdb
can't find the executables being debugged - See contents of volumes in the target container and apply debug container tools to them
Is there any recognised way to do this?
It's not totally k8s specific: the same issue applies with Docker, containerd, runc
, etc.
You might expect this to be possible by using mount --rbind
to "inject" the debug container into the target container via the host container namespace using a hostPath
volume
with mountPropagation: Bidirectional
. But containerd
mounts the container root image, sets mount propagation to private then mounts inner volumes. So the host mount namespace doesn't see the mounts made inside the container root image, and procs in the container don't see new mounts added by the host after the container's first process starts. See https://man7.org/linux/man-pages/man7/mount_namespaces.7.html for details.
I've tried using nsenter
to "cross" mount namespaces, but I can't get a bind mount to work. E.g. in the debug container I can
nsenter -t $some_target_container_pid --root=/ -m /bin/bash
which gives me a shell in which .
(CWD) is the debug container rootfs, and /
is the target container rootfs. But I can't seem to bind-mount them:
$ mkdir /run/debug
$ mount --rbind . /run/debug
mount: /run/debug: wrong fs type, bad option, bad superblock on ., missing codepage or helper program, or other error.
The same occurs if I use nsenter --wd=/
without --root
, and try to mount --rbind / ./run/debug
.
I've tried using unshare -m
to create a new inner mount namespace first. And I've tried mount --make-rprivate /
on the debug container tree before the bind mount. Same deal.
I can't work out why: there's nothing in dmesg and the error is very generic. I'm guessing it's due to the disjoint roots and/or disjoint mount namespaces. It doesn't seem to be due to the kernel's protection against bind mount circularity. And I'm using recursive binds, so it shouldn't be due to the protection against mount tree escapes in linux user namespaces.
An alterative to --rbind
ing a FS tree would be if I had a way to mount --bind
by mount id as shown in /proc/$target_pid/mountinfo
. I could then clone all the mounts from the target pid into the debug container's mount namespace. But I can't mount --bind
using a normal absolute path, because the target and debug container's mount namespaces are disjoint, and both have subtrees of mounts with private propagation.
I've tried using a target process's /proc/$pid/ns/mnt
mount namespace, as I've seen reference to bind-mounting using it. But on my kernel 5.16 it's a tree of fake symlinks, not a fs tree:
$ readlink /proc/self/ns/mnt
mnt:[4026531840]
$ ls /proc/self/ns/mnt/
ls: cannot access '/proc/self/ns/mnt/': Not a directory
The closest thing I have to a workaround at the moment is the nsenter
hack with the working directory. This offers very limited tooling injection into the target container. Where pid 1055 is a pid in the target container:
# nsenter -t 1055 -p -m --wd=/ /bin/bash
shell-init: error retrieving current directory: getcwd: cannot access parent directories: No such file or directory
# ls /
...target container rootfs contents here...
# ls .
...debug container rootfs here...
# ls ..
...debug container rootfs here too because . is a root...
# pwd
pwd: error retrieving current directory: getcwd: cannot access parent directories: No such file or directory
# ls usr/bin/gdb
usr/bin/gdb
# ls /usr/bin/gdb
ls: cannot access '/usr/bin/gdb': No such file or directory
but I can't bind mount like I want, from within the same nsenter session:
# mkdir /run/debug
# mount --rbind . /run/debug
mount: /run/debug: wrong fs type, bad option, bad superblock on ., missing codepage or helper program, or other error.
Hints?
Reference links:
- https://unix.stackexchange.com/q/473717/45708
- https://medium.com/kokster/kubernetes-mount-propagation-5306c36a4a2d
- https://man7.org/linux/man-pages/man1/nsenter.1.html
- https://man7.org/linux/man-pages/man7/user_namespaces.7.html
- https://man7.org/linux/man-pages/man7/namespaces.7.html
- https://man7.org/linux/man-pages/man1/unshare.1.html
- https://unix.stackexchange.com/questions/594545/can-i-move-mount-to-other-mount-namespace
- https://unix.stackexchange.com/questions/693822/bind-mount-across-namespaces-with-disjoint-roots
It's possible to make a symlink to the target container's context via
/proc/${target_container_pid}/root
./proc/$pid/root
looks like a symlink. If youreadlink /proc/$pid/root
it points to/
. But it's the root of the target process, and if you dereference it in the kernel vfs layer you see the target process's root. If you resolve the symlink in userspace you will see the root of the processing doing the dereferencing.I haven't been able to bind mount the tree -
mount -o bind /proc/$pid/root/ /target
will bind the rootfs of themount
process itself into/target
, not the rootfs of the target process. But it doesn't matter much, as a symlink is sufficient.(I'd write a patch for the
kubectl debug
documentation but I can't get my org to agree to the mandatory CLA required even for trivial docs patches...)