2021-10-05 UPDATED QUESTION AND TEXT AFTER MORE ANALYSIS, STRIPPED DOWN TO MINIMAL CASE
Short description
A Nomad / Consul cluster is running, with Traefik (with minimal configuration) as a system task on each Nomad client. There are 3 nomad servers, 3 consul servers, 3 nomad clients and 3 Gluster servers at this point. Set-up very similar to this article set on setting up a Nomad / Consul cluster
Basic images and sites work well.
The issue
I've started porting the first bigger PHP based site (with larger number of page dependency loads on the site) to this cluster and am running into a weird issue that I have pinpointed, but cannot resolve properly.
The tasks load well and registers as up in Consul, Traefik and Nomad. Small pages (with few dependencies) work well.
Whenever a page has too much dependency loads, Apache stalls those specific connections.
When I open a fresh Incognito browser window, and go the the url, the main page and around 10-15 of the dependencies load. The others stay in a pending state in the browser. The browser status keeps 'spinning' (as in loading). Closing the window and opening a new one allows me to repeat the process.
I've nailed down the issue to the fact that the PHP sessions directory is mapped (via Docker) to a directory on a GlusterFS mount.
Moving the volume mapping to a different directory that is host based on the same server removes the issue and the site loads as it should.
Conclusion: The interaction between Docker volumes and the Gluster mount is causing issues under 'heavy load'. With just a few requests everything works well. With lots of requests to access the PHP session file things stall and do not recover.
Question: This is probably caused by either a Gluster configuration issue or the way the mount is configured in /etc/fstab. Please help to fix this issue!
ISOLATION
The PHP sessions directory is set to /var/php_session in the images PHP config and mapped in Nomad / Docker to /data/storage/test/php_sessions.
The /data/storage/test/php_sessions directory is owned by user 20000 to make sure all nodes have access to the same PHP sessions:
client:/data/storage/test$ ls -ln .
drwxr-xr-x 2 20000 20000 6 Oct 5 14:53 php_sessions
drwxr-xr-x 2 20000 20000 6 Oct 5 14:53 upload
When changing the nomad volume mapping (in /etc/nomad/nomad.hcl) from:
client {
host_volume "test-sessions" {
path = "/data/storage/test/php_sessions"
read_only = false
}
}
to
client {
host_volume "test-sessions" {
path = "/tmp/php_sessions"
read_only = false
}
}
(And making sure /tmp/php_sessions is also owned by user 20000)
Everything works again.
Detailed data (More on request)
Contents of /etc/fstab:
LABEL=cloudimg-rootfs / ext4 defaults 0 1
LABEL=UEFI /boot/efi vfat defaults 0 1
gluster-01,gluster-02,gluster-03:/storage /data/storage glusterfs _netdev,defaults,direct-io-mode=disable,rw
Dockerfile for site image:
FROM php:7.4.1-apache
ENV APACHE_DOCUMENT_ROOT /var/www/htdocs
WORKDIR /var/www
RUN docker-php-ext-install mysqli pdo_mysql
# Make Apache root configurable
RUN sed -ri -e 's!/var/www/html!${APACHE_DOCUMENT_ROOT}!g' /etc/apache2/sites-available/*.conf
RUN sed -ri -e 's!/var/www/!${APACHE_DOCUMENT_ROOT}!g' /etc/apache2/apache2.conf /etc/apache2/conf-available/*.conf
# Listen on port 1080 by default for non-root user
RUN sed -ri 's/Listen 80/Listen 1080/g' /etc/apache2/ports.conf
RUN sed -ri 's/:80/:1080/g' /etc/apache2/sites-enabled/*
# Use own config
COPY data/000-default.conf /etc/apache2/sites-enabled/
# Enable Production ini
RUN cp /usr/local/etc/php/php.ini-production /usr/local/etc/php/php.ini
RUN a2enmod rewrite && a2enmod remoteip
COPY --from=composer:latest /usr/bin/composer /usr/local/bin/composer
COPY --chown=www-data:www-data . /var/www
RUN /usr/local/bin/composer --no-cache --no-ansi --no-interaction install
# Finally add security changes
COPY data/changes.ini /usr/local/etc/php/conf.d/
The nomad file stripped down to what triggers the issue: with the following Nomad job plan:
job "test" {
datacenters = ["dc1"]
group "test-staging" {
count = 1
network {
port "php_http" {
to = 1080
}
}
volume "test-sessions" {
type = "host"
read_only = false
source = "test-sessions"
}
volume "test-upload" {
type = "host"
read_only = false
source = "test-upload"
}
service {
name = "test-staging"
port = "php_http"
tags = [
"traefik.enable=true",
"traefik.http.routers.test.php_staging.rule=Host(`staging.xxxxxx.com`)",
]
check {
type = "tcp"
port = "php_http"
interval = "5s"
timeout = "2s"
}
}
task "test" {
driver = "docker"
user = "20000"
config {
image = "docker-repo:5000/test/test:latest"
ports = ["php_http"]
}
volume_mount {
volume = "test-sessions"
destination = "/var/php_sessions"
read_only = false
}
volume_mount {
volume = "test-upload"
destination = "/var/upload"
read_only = false
}
template {
data = <<EOF
1.2.3.4
EOF
destination = "local/trusted-proxies.lst"
}
}
}
}