I'm trying to configure SLURM on an Ubuntu 23.10 system so that it uses MySQL via slurmdbd
. This is a continuation of an earlier question which I solved through somewhat random guessing...
The funny thing is that the SLURM controller (slurmctld
) fails to start upon boot. However, when I manually restart the service, it appears fine.
For example, if I type sudo service slurmctld status
after booting, I see these messages:
Feb 03 17:10:26 mycomputer slurmctld[1682]: slurmctld: error: Sending PersistInit msg: Connection refused
Feb 03 17:10:26 mycomputer slurmctld[1682]: slurmctld: accounting_storage/slurmdbd: clusteracct_storage_p_register_ctld: Registering slurmctld at port 6817 with slurmdbd
Feb 03 17:10:26 mycomputer slurmctld[1682]: slurmctld: No memory enforcing mechanism configured.
Feb 03 17:10:27 mycomputer slurmctld[1682]: WARNING: MYSQL_OPT_RECONNECT is deprecated and will be removed in a future version.
Feb 03 17:10:27 mycomputer slurmctld[1682]: slurmctld: error: mysql_real_connect failed: 2002 Can't connect to local MySQL server through socket '/var/run/mysqld/mysqld.sock' (2)
Feb 03 17:10:27 mycomputer slurmctld[1682]: slurmctld: fatal: You haven't inited this storage yet.
Feb 03 17:10:27 mycomputer systemd[1]: slurmctld.service: Main process exited, code=exited, status=1/FAILURE
Feb 03 17:10:27 mycomputer systemd[1]: slurmctld.service: Failed with result 'exit-code'.
which is similar to the information in the /var/log/
log file. However, if I restart it with sudo service slurmctld restart
, without changing any configuration files, it starts up with this in the log:
Feb 03 23:22:57 mycomputer slurmctld[30777]: slurmctld: Recovered information about 0 jobs
Feb 03 23:22:57 mycomputer slurmctld[30777]: slurmctld: select/cons_tres: part_data_create_array: select/cons_tres: preparing for 1 partitions
Feb 03 23:22:57 mycomputer slurmctld[30777]: slurmctld: Recovered state of 0 reservations
Feb 03 23:22:57 mycomputer slurmctld[30777]: slurmctld: read_slurm_conf: backup_controller not specified
Feb 03 23:22:57 mycomputer slurmctld[30777]: slurmctld: select/cons_tres: select_p_reconfigure: select/cons_tres: reconfigure
Feb 03 23:22:57 mycomputer slurmctld[30777]: slurmctld: select/cons_tres: part_data_create_array: select/cons_tres: preparing for 1 partitions
Feb 03 23:22:57 mycomputer slurmctld[30777]: slurmctld: Running as primary controller
Feb 03 23:22:57 mycomputer slurmctld[30777]: slurmctld: No parameter for mcs plugin, default values set
Feb 03 23:22:57 mycomputer slurmctld[30777]: slurmctld: mcs: MCSParameters = (null). ondemand set.
Feb 03 23:23:02 mycomputer slurmctld[30777]: slurmctld: SchedulerParameters=default_queue_depth=100,max_rpc_cnt=0,max_sched_time=2,partition_job_depth=0,sched_max_job_start=0,...
And it seems fine now.
My only guess is that it might have to do with the order in which slurmdbd
, slurmd
, and slurmctld
services are started. But I have been assuming that the default order is correct. Perhaps this assumption is wrong?
The defaults for slurmctld.service are missing an ordering dependency on mysql.service. Let's add one.
Create a file named
/etc/systemd/system/mysql.service.d/99-mysql-ordering-askubuntu-1502374.conf
:Then reboot.