I will attach the minimized test case below. However, it is a simple Dockerfile that has these lines:
VOLUME ["/sys/fs/cgroup"]
CMD ["/lib/systemd/systemd"]
It is Debian:buster-slim based image, and runs systemd inside the container. Effectively, I used to run the container like this:
$ docker run --name any --tmpfs /run \
--tmpfs /run/lock --tmpfs /tmp \
-v /sys/fs/cgroup:/sys/fs/cgroup:ro -it image_name
It used to work fine before I upgraded a bunch of host Linux packages. The host kernel/systemd now seems to default cgroup v2. Before, it was cgroup. It stopped working. However, if I give the kernel option so that the host uses cgroup, then it works again.
Without giving the kernel option, the fix was to add --cgroupns=host
to docker run
besides mounting /sys/fs/cgroup
as read-write (:rw
in place of :ro
).
I'd like to avoid forcing the users to give the kernel option. Although I am far from an expert, forcing the host namespace for a docker container does not sound right to me.
I am trying to understand why this is happening, and figure out what should be done. My goal is to run systemd inside a docker, where the host follows cgroup v2.
Here's the error I am seeing:
$ docker run --name any --tmpfs /run --tmpfs /run/lock --tmpfs /tmp \
-v /sys/fs/cgroup:/sys/fs/cgroup:rw -it image_name
systemd 241 running in system mode. (+PAM +AUDIT +SELINUX +IMA +APPARMOR +SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT +GNUTLS +ACL +XZ +LZ4 +SECCOMP +BLKID +ELFUTILS +KMOD -IDN2 +IDN -PCRE2 default-hierarchy=hybrid)
Detected virtualization docker.
Detected architecture x86-64.
Welcome to Debian GNU/Linux 10 (buster)!
Set hostname to <5e089ab33b12>.
Failed to create /init.scope control group: Read-only file system
Failed to allocate manager object: Read-only file system
[!!!!!!] Failed to allocate manager object.
Exiting PID 1...
It does not look right but especially this line seems suspicous:
Failed to create /init.scope control group: Read-only file system
It seems like there should have been something before /init.scope
. That was why I reviewed the docker run
options, and tried the --cgroupsns
option. If I add the --cgroupns=host
, it works. If I mount /sys/fs/cgroup
as read-only, then it fails with a different error, and the corresponding line looks like this:
Failed to create /system.slice/docker-0be34b8ec5806b0760093e39dea35f4305262d276ecc5047a5f0ff43871ed6d0.scope/init.scope control group: Read-only file system
To me, it is like the docker daemon/engine fails to configure XXX.slice or something like that for the container. I assume that docker may be to some extend responsible for giving the namespace but something is not going well. However, I can't be so sure at all. What would be the issue/fix?
The Dockerfile I used for this experiment is as follows:
FROM debian:buster-slim
ENV container docker
ENV LC_ALL C
ENV DEBIAN_FRONTEND noninteractive
USER root
WORKDIR /root
RUN set -x
RUN apt-get update -y \
&& apt-get install --no-install-recommends -y systemd \
&& apt-get clean \
&& rm -rf /var/lib/apt/lists/* /tmp/* /var/tmp/* \
&& rm -f /var/run/nologin
RUN rm -f /lib/systemd/system/multi-user.target.wants/* \
/etc/systemd/system/*.wants/* \
/lib/systemd/system/local-fs.target.wants/* \
/lib/systemd/system/sockets.target.wants/*udev* \
/lib/systemd/system/sockets.target.wants/*initctl* \
/lib/systemd/system/sysinit.target.wants/systemd-tmpfiles-setup* \
/lib/systemd/system/systemd-update-utmp*
VOLUME [ "/sys/fs/cgroup" ]
CMD ["/lib/systemd/systemd"]
I am using Debian. The docker version is 20.10.3 or so. Google search told me that docker supports cgroup v2 as of 20.10 but I don't actually understand what that "support" means.
tl;dr
It seems to me that this use case is not explicitly supported yet. You can almost get it working but not quite.
The root cause
When systemd sees a unified cgroupfs at
/sys/fs/cgroup
it assumes it should be able to write to it which normally should be possible but is not the case here.The basics
First of all, you need to create a systemd slice for docker containers and tell docker to use it - my current
docker/daemon.json
:Each slice gets its own nested cgroup. There is one caveat though: Each group might only be a "leaf" or "intermediary". Once a process takes ownershop of a cgroup no other can manage it. This means that the actual container process needs and will get its own private group attached below the configured one in the form of a systemd scope.
Now a newly started container will get its own group which should be available in a path that (depending on your setup) resembles:
And here is the important part: You must not mount a volume into container's
/sys/fs/cgroup
. The path to its private group mentioned above should get mounted there automatically.The goal
Now, in theory, the container should be able to manage this delegated, private group by itself almost fully. This would allow its own init process to create child groups.
The problem
The problem is that the
/sys/fs/cgroup
path in the container gets mounted read-only. I've checked apparmor rules and switched seccomp to unconfined to no avail.The hypothesis
I am not completely certain yet - my current hypothesis is that this is a security feature of docker/moby/containerd. Without private groups it makes perfect sense to mount this path
ro
.Potential solutions
What I've also discovered is that enabling user namespace remapping causes the private
/sys/fs/cgroup
to be mounted withrw
as expected!This is far from perfect though - the cgroup (among others) mount has wrong ownership: it's owned by the real system root (UID0) while the container has been remapped to a completely different user. Once I've manually adjusted the owner - the container was able to start a systemd init sucessfully.
I suspect this is a deficiency of docker's userns remapping feature and might be fixed sooner or later. Keep in mind that I might be wrong about this - I did not confirm.
Discussion
Userns remapping has got a lot of drawbacks and the best possible scenario for me would be to get the cgroupfs mounted
rw
without it. I still don't know if this is done on purpose or if it's some kind of limitation of the cgroup/userns implementation.Notes
It's not enough that your kernel has cgroupv2 enabled. Depending on the linux distribution bundled systemd might prefer to use v1 by default.
You can tell systemd to use cgroupv2 via kernel cmdline parameter:
systemd.unified_cgroup_hierarchy=1
It might also be needed to explictly disable hybrid cgroupv1 support to avoid problems using:
systemd.legacy_systemd_cgroup_controller=0
Or completely disable cgroupv1 in the kernel with:
cgroup_no_v1=all
For those wondering how to solve this with the kernel commandline:
This creates a "hybrid" cgroup setup, which makes the host cgroup v1 available again for the container's systemd.
https://github.com/systemd/systemd/issues/13477#issuecomment-528113009
Thanks to @pinkeen 's answer, here is my Dockerfile and command line, it works fine. I hope this helps:
Note: you MUST use Docker 20.10 or above, and your system enabled cgroupv2 (check if
/sys/fs/cgroup/cgroup.controllers
) exists.