slurm silliness
Not a disparagement of SLURM, but of the admin that manages it.
maybe should change this blog to IAFM (Idiot Admin From Hell)
Idiocy #1 which took over a week to figure out:
slurmdbd would not start with the following message:
[2023-03-01T19:59:10.773] slurmdbd version 23.02.0 started
[2023-03-01T19:59:10.773] error: Error binding slurm stream socket: Address already in use
[2023-03-01T19:59:10.774] fatal: slurm_init_msg_engine_port error Address already in use
like why? strace - shows it will be listening on 0.0.0.0 and thats fine
that should not be a problem...
mysqld is running and can be accessed by command line.
turns out its the port, stupid! Instead of listening on port 6819, the default port for slurmdbd it wastrying to listen on pot 3306 and failing - of course its in use.
idiocy#2: sacct only works on the master node
fails on the other nodes because cannot connect to port 6819 on 127.0.0.1
looking up the DbdHost by name - in /etc/hosts as nsswitch tells it to
the entry for the DbdHost, u14-master is 127.0.0.1!
changed /etc/hosts - it has a double entry for 127.0.0.1
but also could have changed DbdAddr??
DbdAddr is set correctly though...
sad.

0 Comments:
Post a Comment
<< Home