Monday 5 January 2015

Setup torque/maui system _debug the system

This one follows my previous article focusing on setting up torque system. However, it is found that torque 2.6.1 in Ubuntu system is out of date and not working properly. To circumvent this problem, I decide to move to torque/maui for better schedule efficiency.
http://www.adaptivecomputing.com/support/download-center/torque-download/
It is also noticed that adaptive computing is not maintaining torque and mari any more. which means bugs will not be cleaned. The ultimate solution for the system really is to move to slurm or sun grid system.


First, Download torque and maui from their websites:

maui has to be installed after torque installation

error 1:
pbs_mom: symbol lookup error: pbs_mom: undefined symbol: log_mutex

solution:
echo '/usr/local/lib' > /etc/ld.so.conf.d/torque.conf
ldconfig


error 2:
socket_connect_unix failed: 15137
qstat: cannot connect to server (null) (errno=15137) could not connect to trqauthd

solution: make sure trqauthd is running with pbs_mom

error 3: at the client
pbs_mom
pbs_mom: symbol lookup error: pbs_mom: undefined symbol: dis_getc


error 4 at the client
./torque-mom start
 * Starting Torque Mom torque-mom
/usr/sbin/pbs_mom: symbol lookup error: /usr/sbin/pbs_mom: undefined symbol: dis_getc
   ...fail!
pbs_mom: symbol lookup error: pbs_mom: undefined symbol: dis_getc

solution:
ldd /usr/local/sbin/pbs_mom
        linux-vdso.so.1 =>  (0x00007fff9f7ff000)
        libutil.so.1 => /lib/x86_64-linux-gnu/libutil.so.1 (0x00007f2abbbed000)
        libtorque.so.2 => /usr/local/lib/libtorque.so.2 (0x00007f2abb2f6000)
        libxml2.so.2 => /usr/lib/x86_64-linux-gnu/libxml2.so.2 (0x00007f2abaf99000)
        libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f2abad7c000)
        librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007f2abab74000)
        libstdc++.so.6 => /usr/lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007f2aba873000)
        libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f2aba577000)
        libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007f2aba361000)
        libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f2ab9fa1000)
        libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007f2ab9d9d000)
        libz.so.1 => /lib/x86_64-linux-gnu/libz.so.1 (0x00007f2ab9b86000)
        /lib64/ld-linux-x86-64.so.2 (0x00007f2abbe0b000)

the solution so far is to resintall the torque 5.0.1, 2015-05-27 it takes the whole morning to fix it
this happens again 2015-09-28
this file is located in
/usr/local/sbin/pbs_mom
just run it should be ok
dis_getc is the old package from apt-get
first: remove the torque in apt repo   : apt-get remove torque-mom
now if run pbs_mom you wiil see
./pbs_mom
pbs_mom: LOG_ERROR::No such file or directory (2) in chk_file_sec, Security violation with "/var/spool/torque/checkpoint" - /var/spool/torque/checkpoint cannot be lstat'd - errno=2, No such file or directory

then reinstall torque-5.0.1-1_4fa836f5
torque-package-clients-linux-x86_64.sh  --install
torque-package-mom-linux-x86_64.sh  --install







question 1:
limit the maximum processes per user
http://docs.adaptivecomputing.com/maui/6.2throttlingpolicies.php


install pam torque
libtool --finish /lib64/security
/lib64/security/ is the place where pam files are located
/etc/security/access.conf give access to anyone you wish to give



set maui to limit the jobs and process per user

USERCFG[DEFAULT] MAXPROC=64 MAXJOB=5 #working

#GROUPCFG[useraid] MAXJOB[USER]=5  # not working
#CLASSCFG[batch] MAXJOB[USER]=5   working


CLASSCFG[batch] MAXJOB[USER]=5 MAXPROC=64  # not working


Working solution to use pam to prevent user from logging into compute nodes
give some users into compute nodes while others staying outside

versions: torque-5.0.1-1_4fa836f5 maui-3.3.tar.gz
in the tutorial given by official maui http://docs.adaptivecomputing.com/torque/3-0-5/3.4hostsecurity.php
it says

1. first configure torque with ./configure --with-pam

2.
/etc/pam.c/sshd.
account required pam_pbssimpleauth.so
account required pam_access.so

and
3.
In /etc/security/access.conf make sure all users who access the compute node are added to the configuration.This is an example which allows the users root, george, allen, and michael access.
-:ALL EXCEPT root george allen michael torque:ALL


However, I found this method is too strong, specifically, none of root george allen can log into compute node.

my solution:

1. do not need to resinstall torque with  ./configure --with-pam

2. put
account required     pam_access.so
 into /etc/pam.d/sshd

which means pam_access has to be considered for each ssh login

3. put

-:ALL EXCEPT root szhang czhang storres torque:ALL
into /etc/security/access.conf
now only szhang czhang root can log into compute nodes

I think this idea is working and understandable. because at the moment all the submission is done by pbs_mom which is running under root, so pam_pbssimpleauth.so doesn't have to take into effect.







reload maui

just restart it. it wont affect the queue

pkill maui && qterm -t quick && sleep 5&& /usr/local/maui/sbin/maui && pbs_server && ps aux |grep maui


showres working
showres -n
checkjob 810 working
checknode macondo01  % very good feedback
showgrid AVGXFACTOR
showstats
mbal this will kill maui!!!!!!!!!!!!!!!
mdiag same as diagnose

I still didn't get the idea of maxnode. does it mean all job for one person has to go to one perticular node?

mjobct
ERROR:    corrupt command received


mclient
ERROR:    unknown command: 'mclient'

mprof
USAGE ERROR:  (tracefile not specified)

mstat
ERROR:  command 'mstat' args not handled
ERROR:    service 36 not handled
ERROR:    Service[36] 'mstat' not implemented

showbf
backfill window (user: 'czhang' group: 'useraid' partition: ALL) Sun Jan 18 15:25:07

231 procs available for    7:11:35:38
175 procs available for   21:18:13:37
118 procs available for   40:14:55:01
 62 procs available for   40:21:06:15



diagnose -j | grep -o -P '(?<=job \047).*(?=\047 utilizes more procs than)
# this line can find out all the job where warnings comes out.

diagnose -j
Name                  State Par Proc QOS     WCLimit R  Min     User    Group  Account  QueuedTime  Network  Opsys   Arch    Mem   Disk  Procs       Class Features

381                 Running DEF    1 DEF 10:00:00:00 1    1    cwang  useraid uq-Civil    00:49:21   [NONE] [NONE] [NONE]    >=0    >=0    NC0   [batch:1] [NONE]
569                 Running DEF    1 DEF 25:00:00:00 1    1   pzhang  useraid uq-Civil    00:49:21   [NONE] [NONE] [NONE]    >=0    >=0    NC0   [batch:1] [NONE]
WARNING:  job '569' utilizes more procs than dedicated (10.35 > 1)
650                 Running DEF    1 DEF 41:16:00:00 1    1 mgholami  useraid uq-Civil    00:49:20   [NONE] [NONE] [NONE]    >=0    >=0    NC0   [batch:1] [NONE]
WARNING:  job '650' utilizes more procs than dedicated (13.00 > 1)
651                 Running DEF    1 DEF 41:16:00:00 1    1 mgholami  useraid uq-Civil    00:49:20   [NONE] [NONE] [NONE]    >=0    >=0    NC0   [batch:1] [NONE]
WARNING:  job '651' utilizes more procs than dedicated (10.28 > 1)
669                 Running DEF    1 DEF 41:16:00:00 1    1 mgholami  useraid uq-Civil    00:49:19   [NONE] [NONE] [NONE]    >=0    >=0    NC0   [batch:1] [NONE]
WARNING:  job '669' utilizes more procs than dedicated (14.00 > 1)
671                 Running DEF    1 DEF 25:00:00:00 1    1   pzhang  useraid uq-Civil    00:49:21   [NONE] [NONE] [NONE]    >=0    >=0    NC0   [batch:1] [NONE]
WARNING:  job '671' utilizes more procs than dedicated (9.57 > 1)
672                 Running DEF    1 DEF 25:00:00:00 1    1   pzhang  useraid uq-Civil    00:49:21   [NONE] [NONE] [NONE]    >=0    >=0    NC0   [batch:1] [NONE]
WARNING:  job '672' utilizes more procs than dedicated (7.80 > 1)


\047 octal ascii represent single quote

diagnose -j | grep -o -P '(?<=than dedicated \050).*(?=>)'

\047 octal ascii represent 'left bracket'

adse=$(diagnose -j | grep -o -P '(?<=than dedicated \050).*(?=>)')
store result into adse

if [ "$a" != "$b" ]
then
  echo "$a is not equal to $b."
  echo "(string comparison)"
  #     "4"  != "5"
  # ASCII 52 != ASCII 53
fi

#!/bin/bash
x=5.0
y=3.0
#ans= $(( $x + $y |bc  ))
#ans=$(echo  $x + $y |bc )
#ans=$(echo  $x / $y |bc -l )   # this ends up with good result
#ans=$(echo  $x / $y |bc  )     # this does not give good result

#ans=$(python -c "print $x / $y")    # this one is also ok but format is a problem

#ans=$(python -c "print( "%.2f"     %($x / $y) ) ")  #failed
#alpha=`echo "$a/100" | bc -l | awk '{printf("%06.2f", $1);}'`
ans=`echo "$x/$y" | bc -l | awk '{printf("%6.4f", $1);}'`
echo "$x / $y = $ans"


maui starts off to be deprecated. use Sun Grid Engine (SGE, rock cluster uses this Oracle Grid Engine)  or slurm instead. 

it feels to me that the soft hard limit only works for the groups not rather for users
/usr/local/maui
http://www.physics.oregonstate.edu/cluster_install

Problem 2016-01-12:
once running trqauthd
trqauthd: symbol lookup error: trqauthd: undefined symbol: debug_mode
this happens for the server, the server has been runing for a few days. once trqauthd is killed, it can not reboot, properly.



root@macondo03:/home/users/uqczhan2#  trqauthd
trqauthd: symbol lookup error: trqauthd: undefined symbol: debug_mode
root@macondo03:/home/users/uqczhan2# pbs_server
pbs_server: symbol lookup error: pbs_server: undefined symbol: job_log_mutex
root@macondo03:/home/users/uqczhan2# pbs_mom
pbs_mom: symbol lookup error: pbs_mom: undefined symbol: log_mutex
root@macondo03:/home/users/uqczhan2# which trqauthd
/usr/local/sbin/trqauthd
root@macondo03:/home/users/uqczhan2# pbs_
pbs_demux    pbs_mom      pbs_restart  pbs_sched    pbs_server   pbs_track  
root@macondo03:/home/users/uqczhan2# pbs_sched
pbs_sched: symbol lookup error: pbs_sched: undefined symbol: log_mutex
root@macondo03:/home/users/uqczhan2# pbs_restart
Cannot connect to default server host 'macondo03' - check pbs_server daemon.
qterm: could not connect to server '' (1) Operation not permitted


 torque-package-server-linux-x86_64.sh
we get pbs_sched  pbs_server  qschedd  qserverd

./torque-package-mom-linux-x86_64.sh --install

Installing TORQUE archive... 

Done.
root@macondo03:/home/user/uqczhan2/czhang/Downloads/torque-5.0.1-1_4fa836f5# ls /usr/local/sbin
momctl  pbs_demux  pbs_mom  pbs_sched  pbs_server  qnoded  qschedd  qserverd
solution:
ldd trqauthd
        linux-vdso.so.1 =>  (0x00007ffcf55e1000)
        libtorque.so.2 => /usr/local/lib/libtorque.so.2 (0x00007f365ed33000)
        libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f365eb16000)
        libstdc++.so.6 => /usr/lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007f365e816000)
        libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f365e458000)
        librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007f365e250000)
        libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f365df54000)
        libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007f365dd3e000)
        /lib64/ld-linux-x86-64.so.2 (0x00007f365f62a000)
today problem resolved again:

infact fds model gets the system hangs. it changes the address of libtorque.so.2 and so trqauthd is not working.
solution: i have removed everything associated with FDS in .bashrc (from LD_LIBRARY_PATH). and check ldd trqauthd. the right one should be the same as the ones above.

also after the restore, there is a bit problem in restart pbs_mom pbs_server and pbs_sched .
solution:
first, apt-get remove torque-mom torque-server torque-sched, make sure the torque in apt system is not installed.
second, reinstall torque 5.0.1 by configure, make make install.
run one by one.
the below are the errors appears when running pbs_mom pbs_server pbs_sched.
 pbs_mom
pbs_mom: LOG_ERROR::No such file or directory (2) in chk_file_sec, Security violation with "/var/spool/torque/checkpoint" - /var/spool/torque/checkpoint cannot be lstat'd - errno=2, No such file or directory


for pbs_server and pbs_sched, once running it, it doesn't show as a process in the system. 


as long as reinstall torque 5.0.1 problem get resolved. 2016-01-12

problem
pbsnodes
pbsnodes: Server has no node list MSG=node list is empty - check 'server_priv/nodes' file



cd /var/spool/torque/server_priv



No comments:

Post a Comment