We are currently running an interactive HPC application which presents a graphical interface to the user, attaches to an HPC cluster and allows them to run and observe some computation. The user logs in to a front-end node via NoMachine NX Server (this machine does not participate in the computation). He typically sets up his problem, does a few tiny trial runs, then starts a large job. After that he disconnects from the NX session, expecting the computation to continue.
Except it doesn't. All execution within the NX session, and across the cluster, seems to pause when the user disconnects. If he resumes a session he can resume the computation, but this is a job which he expects to run for several days, so it may not be feasible to expect a connected NX session throughout.
We're aware that in many ways the correct use-case would be for the user to figure out his parameters and then submit a batch (non-interactive) job via ssh, but he strongly prefers the workflow I outlined above so we're trying to make it work.
0 Answers