I have a bunch of libvirt-lxc containers whose configuration I migrated from Debian jessie to a fresh Debian buster host. I re-created the rootfs’ for the containers using lxc-create -t debian -- --release buster
and later remapped the uid/gid numbers of the rootfs with a script I know to work correctly.
The container configuration looks like this:
<domain type='lxc'>
<name>some-container</name>
<uuid>1dbc80cf-e287-43cb-97ad-d4bdb662ca43</uuid>
<title>Some Container</title>
<memory unit='KiB'>2097152</memory>
<currentMemory unit='KiB'>2097152</currentMemory>
<memtune>
<swap_hard_limit unit='KiB'>2306867</swap_hard_limit>
</memtune>
<vcpu placement='static'>1</vcpu>
<resource>
<partition>/machine</partition>
</resource>
<os>
<type arch='x86_64'>exe</type>
<init>/sbin/init</init>
</os>
<idmap>
<uid start='0' target='200000' count='65535'/>
<gid start='0' target='200000' count='65535'/>
</idmap>
<clock offset='utc'/>
<on_poweroff>destroy</on_poweroff>
<on_reboot>restart</on_reboot>
<on_crash>destroy</on_crash>
<devices>
<emulator>/usr/lib/libvirt/libvirt_lxc</emulator>
<filesystem type='mount' accessmode='passthrough'>
<source dir='/var/lib/lxc/some-container/rootfs/'/>
<target dir='/'/>
</filesystem>
<filesystem type='mount' accessmode='passthrough'>
<source dir='/var/www/some-container/static/'/>
<target dir='/var/www/some-container/static/'/>
</filesystem>
<interface type='bridge'>
<mac address='52:54:00:a1:98:03'/>
<source bridge='guests0'/>
<ip address='192.0.2.3' family='ipv4' prefix='24'/>
<ip address='2001:db8::3' family='ipv6' prefix='112'/>
<route family='ipv4' address='0.0.0.0' prefix='0' gateway='192.0.2.1'/>
<route family='ipv6' address='2000::' prefix='3' gateway='fe80::1'/>
<target dev='vcontainer0'/>
<guest dev='eth0'/>
</interface>
<console type='pty' tty='/dev/pts/21'>
<source path='/dev/pts/21'/>
<target type='lxc' port='0'/>
<alias name='console0'/>
</console>
<hostdev mode='capabilities' type='misc'>
<source>
<char>/dev/net/tun</char>
</source>
</hostdev>
</devices>
</domain>
(IP addresses have been changed to use the documentation/example IPv4/IPv6 prefixes.) The mountpoints exist and are prepared. I have about 15 containers similar to this. The following things happen:
When the host is freshly booted, I can either:
- start a container with user namespacing, and then only containers without user namespacing
- start a container without user namespacing, and then no containers with user namespacing
When I run virsh -c lxc:/// start some-container
after any other container is already started, libvirt claims to have started the container:
# virsh -c lxc:/// start some-container
Domain some-container started
It also shows as running in the virsh -c lxc:/// list
output, but there is no process running under the root UID of the container. Running systemctl restart libvirtd
makes libvirt recognize that the container is actually dead and mark it as shut off
in virsh -c lxc:/// list
again.
When looking into the libvirt logs, I can’t find anything useful:
2019-05-09 15:38:38.264+0000: starting up
PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin LIBVIRT_DEBUG=4 LIBVIRT_LOG_OUTPUTS=4:stderr /usr/lib/libvirt/libvirt_lxc --name some-container --console 25 --security=apparmor --handshake 52 --veth vnet0
PATH=/bin:/sbin TERM=linux container=lxc-libvirt HOME=/ container_uuid=1dbc80cf-e287-43cb-97ad-d4bdb662ca43 LIBVIRT_LXC_UUID=1dbc80cf-e287-43cb-97ad-d4bdb662ca43 LIBVIRT_LXC_NAME=some-container /sbin/init
(NB: I tried with and without apparmor)
I became quite desperate and attached strace with strace -ff -o somedir/foo -p
to libvirtd and then started a container. After a lot of digging, I found out that libvirt starts /sbin/init
inside the container, which then quickly exits with status code 255. This is after an EACCESS upon doing something with cgroups:
openat(AT_FDCWD, "/sys/fs/cgroup/systemd/system.slice/libvirtd.service/init.scope/cgroup.procs", O_WRONLY|O_NOCTTY|O_CLOEXEC) = -1 EACCES (Permission denied)
writev(3, [{iov_base="\33[0;1;31m", iov_len=9}, {iov_base="Failed to create /system.slice/l"..., iov_len=91}, {iov_base="\33[0m", iov_len=4}, {iov_base="\n", iov_len=1}], 4) = 105
epoll_ctl(4, EPOLL_CTL_DEL, 5, NULL) = 0
close(5) = 0
close(4) = 0
writev(3, [{iov_base="\33[0;1;31m", iov_len=9}, {iov_base="Failed to allocate manager objec"..., iov_len=52}, {iov_base="\33[0m", iov_len=4}, {iov_base="\n", iov_len=1}], 4) = 66
openat(AT_FDCWD, "/dev/console", O_WRONLY|O_NOCTTY|O_CLOEXEC) = 4
ioctl(4, TCGETS, {B38400 opost isig icanon echo ...}) = 0
ioctl(4, TIOCGWINSZ, {ws_row=0, ws_col=0, ws_xpixel=0, ws_ypixel=0}) = 0
writev(4, [{iov_base="[", iov_len=1}, {iov_base="\33[0;1;31m!!!!!!\33[0m", iov_len=19}, {iov_base="] ", iov_len=2}, {iov_base="Failed to allocate manager objec"..., iov_len=34}, {iov_base="\n", iov_len=1}], 5) = 57
close(4) = 0
writev(3, [{iov_base="\33[0;1;31m", iov_len=9}, {iov_base="Exiting PID 1...", iov_len=16}, {iov_base="\33[0m", iov_len=4}, {iov_base="\n", iov_len=1}], 4) = 30
exit_group(255) = ?
+++ exited with 255 +++
Digging further, I figured that libvirt is not creating a Cgroup namespace for the containers, and apparently they all try to use the same cgroup path. With that, the behaviour makes sense: If the first container which is started is user-namespaced, it takes ownership of the cgroup subtree and other user-namespaced containers cannot use it. The non-user-namespaced containers can simply take over the cgroup tree because they run as UID 0.
The question is now: why are the cgroups configured incorrectly? Is it a libvirt bug? Is it a misconfiguration on my system?
I came up with the idea of trying to use separate
<partition/>
s for each container, to try to isolate from one another.When I tried that, I got the following error:
And that was actually familiar. I once opened an invalid bugreport because of this.
This error is caused by libvirt not detecting systemd correctly, which is in turn caused by
systemd-container
not being installed. The fix is:That fixes both the original issue and my attempt to work around it.