トラブル

Ippei Kishida

Last-modified:2018/08/02 20:41:52.

1 qsub で投入したホストがいつまでも実行されない

1.1 Se.q の pe_list

Se.q の pe_list に Se.openmpi が書かれていなかった。 qconf -mq Se.q で記入すればいけた。

1.2 [2018-08-02]

qjob -j すると、

All queues dropped because of overload or full

qstat -g c すると、

CLUSTER QUEUE                   CQLOAD   USED    RES  AVAIL  TOTAL aoACDS  cdsuE  
--------------------------------------------------------------------------------
Ag.q                              0.63     20      0      8     32      0      4 
Cd.q                              0.67     20      0      4     32      0     12 
Ga.q                              -NA-      0      0      0     40      0     40 
Ge.q                              -NA-      4      0      0     52      0     52 
In.q                              -NA-      0      0      0     40      0     40 
Kr.q                              0.20     12      0      4     36      0     20 
Pd.q                              0.24      8      0      4     64      0     56 
Rh.q                              0.00      2      0      4     16      0     12 
Ru.q                              0.05     12      0     28     64      0     32 
Se.q                              0.34      8      0     12    120      0    100 
Sn.q                              -NA-      0      0      0     20      0     20 
Sr.q                              0.20      8      0     16     80      0     56 
Tc.q                              0.00     12      0     24     80      0     44

AVAIL が 0 になっている。

2 qmaster が1分以内程度で死ぬ問題。

Ge02 をネットワークから外したらマスターホストで sge_qmaster が死ななくなった。

# apt-get remove  gridengine-exec
# apt-get purge   gridengine-exec
# apt-get install gridengine-exec

ネットワークにつないだら死ぬようになった。

Ge02 で /etc/init.d/gridengine-exec stop しても、プロセスは死なない。 pkill -9 sge_execd で殺すと、qmaster が死ななくなった。 /etc/init.d/gridengine-exec start するとたちどころに qmaster が死んだ。 Ge02 で

# apt-get purge   -y gridengine-exec
# apt-get purge   -y gridengine-common
# apt-get purge   -y gridengine-client
# apt-get install -y gridengine-exec

やっぱり死ぬ。

Ge02 の /etc/hosts で以下をコメントアウトしてみたが、やっぱり死んだ。

#127.0.1.1  Ge02.calc.atom  Ge02

/var/spool/gridengine/qmaster/messages を監視したが、起動時には書き出されるが、死ぬときには書き出されない。

よくわからない。解決できず。諦め。

3 Re から qsub すると、投入したジョブが実行されない。

qstat すると

5016 0.50000 vasp-Se.qs ippei        Eqw   05/27/2014 21:13:59

qstat -j 5016 すると、以下。

==============================================================
job_number:                 5016
exec_file:                  job_scripts/5016
submission_time:            Tue May 27 21:13:59 2014
owner:                      ippei
uid:                        1000
group:                      ippei
gid:                        1000
sge_o_home:                 /home/ippei
sge_o_log_name:             ippei
sge_o_path:                 /home/ippei/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/home/ippei/.rvm/bin:/home/ippei/.gem/ruby/1.8/bin:/home/ippei/.gem/ruby/1.9.1/bin:/home/ippei/opt/ATLAS/bin:/home/ippei/opt/cogue/bin:/home/ippei/opt/cogue_00/bin:/home/ippei/opt/cogue_01/bin:/home/ippei/opt/mediawiki-1.20.2/bin:/home/ippei/opt/phonopy-1.7.6/bin:/home/ippei/opt/ruby-1.9.3-p125/bin:/opt/bin:/opt/intel/bin:/opt/mendeleydesktop/bin:/opt/mpich2/bin:/opt/openmpi-intel/bin
sge_o_shell:                /usr/bin/zsh
sge_o_workdir:              /home/ippei/tmp/vasp/bench.Hg
sge_o_host:                 Re
account:                    sge
cwd:                        /home/ippei/tmp/vasp/bench.Hg
stderr_path_list:           NONE:NONE:stderr
mail_list:                  ippei@Re.atom
notify:                     FALSE
job_name:                   vasp-Se.qsub
stdout_path_list:           NONE:NONE:stdout
jobshare:                   0
hard_queue_list:            Se.q
shell_list:                 NONE:/bin/sh
env_list:
script_file:                vasp-Se.qsub
parallel environment:  Se.openmpi range: 4
error reason    1:          05/27/2014 21:14:07 [1000:9212]: error: can't chdir to /home/ippei/tmp/vasp/bench.Hg: No such file o
scheduling info:            queue instance "Ge.q@Ge04.calc.atom" dropped because it is temporarily not available
                            queue instance "Ge.q@Ge05.calc.atom" dropped because it is temporarily not available
                            queue instance "Ge.q@Ge13.calc.atom" dropped because it is temporarily not available
                            queue instance "Ge.q@Ge14.calc.atom" dropped because it is temporarily not available
                            queue instance "Kr.q@Kr02.calc.atom" dropped because it is temporarily not available
                            queue instance "Ga.q@Ga09.calc.atom" dropped because it is temporarily not available
                            queue instance "Se.q@Se10.calc.atom" dropped because it is temporarily not available
                            Job is in error state

error reason で、ディレクトリがないと言われている。そらそうだ。 ~/tmp は NFS で共有されていない。これが原因だ。 ~/computation も /mnt/Pt/… にマウントしてるのでダメ。シムリンクを使わず、本当にその位置にマウントしていないとダメなんだろう。 Re から投げる事はとりあえず無期限保留。

4 qsub で Ga に投入したが、1台にしか走らない。

qstat して、

5032 0.75000 vasp-Ga.qs ippei        r     05/27/2014 21:53:06 Ga.q@Ga01.calc.atom                4
5033 0.75000 vasp-Ga.qs ippei        qw    05/27/2014 21:30:50

最初の qw のジョブ ID 5033 に対して、 qstat -j 5033 すると、最後に

cannot run in PE "Ga.openmpi" because it only offers 0 slots

こんなんでた。 qconf -sp Ga.openmpi して確認すると、 slots が 4 だった。 PE で許可されているスロット数一杯になっていた。 slots を 40 にすると即座に走った。

5 qdel してもずっと dr のまま、qstat から消えない。

JOB_ID を 5039 とする。 qdel してもずっと dr のまま、qstat から消えない。改めて qdel しても、

job 5039 is already in deletion

qdel -f 5039 を試す。

% qdel -f 5039
job 5039 is already in deletion
............

root@Ir # qdel 5039
job 5039 is already in deletion

でもあかん。

[/home/ippei]
root@Ir # qdel -f 5039

で消えた。

6 A hostfile was provided that contains at least one node not

--------------------------------------------------------------------------
A hostfile was provided that contains at least one node not
present in the allocation:

  hostfile:  machines
  node:      Kr04.calc.atom

If you are operating in a resource-managed environment, then only
nodes that are in the allocation can be used in the hostfile. You
may find relative node syntax to be a useful alternative to
specifying absolute node names see the orte_hosts man page for
further information.
--------------------------------------------------------------------------

Grid Engine 経由で投げた場合なんかに、上記のエラーが出ることがある。名前解決かなあ。

手作業でコマンド直打ちすると実行できるみたいなので、環境変数の受け渡しとかその辺かなあ。

-np 16 で投げていたのを -np 4 にしたらいけた。 Kr は queue の最大プロセス数を 4 にしているので、これを越えて投げようとしたから止められたんじゃないかな。

Kr が古い計算で swap になっていたのが原因？再起動してみた。違うな。

grid engine からは正常に投げられているわけだ。 openmpi 側の問題だよな。環境変数か？

printenv 比較

HOST

machinefile を指定すべきでないのか？

machinefile の記述を削ってみた。

#! /bin/sh
#$ -S /bin/sh
#$ -cwd
#$ -o stdout
#$ -e stderr
#$ -q Kr.q
#$ -pe Kr.openmpi 20

MACHINE_FILE="machines"

LD_LIBRARY_PATH=/usr/lib:/usr/local/lib:/opt/intel/mkl/lib/intel64:/opt/intel/lib/intel64:/opt/intel/lib:/opt/openmpi-intel/lib
export LD_LIBRARY_PATH

cd $SGE_O_WORKDIR
printenv | sort > printenv.log
#cut -d " " -f 1,2 $PE_HOSTFILE | sed 's/ / cpu=/' > $MACHINE_FILE

#/opt/openmpi-intel/bin/mpiexec -machinefile machines -np 20 /opt/bin/vasp5212openmpifast
/opt/openmpi-intel/bin/mpiexec -np 20 /opt/bin/vasp5212openmpifast



--------------------------------------------------------------------------
There are not enough slots available in the system to satisfy the 20 slots
that were requested by the application:
  /opt/bin/vasp5212openmpifast

Either request fewer slots for your application, or make more slots available
for use.
--------------------------------------------------------------------------

qconf -mq Kr.q で slots を 20 にしてみた。

？

-V オプションを使う？あかんかったが、 -V オプションは環境変数を引き継ぐのでやっといた方が無難ぽい。

machinefile に書かれているのが、

Kr05.calc.atom cpu=4
Kr06.calc.atom cpu=4

で FQDN で書かれているからかな。

PE_HOSTFILEの中身は

Kr07.calc.atom 4 Kr.q@Kr07.calc.atom UNDEFINED
Kr01.calc.atom 4 Kr.q@Kr01.calc.atom UNDEFINED

のように書き換えられるようにいじってみたが、ダメだった。

Kr05 cpu=4
Kr06 cpu=4

そういえば、sge を使う場合、openmpi のコンパイルに何かオプションを入れないとあかんとかあった気がする。

7 error: commlib error: got select error (Connection refused)

[2014-08-29] qstat すると以下のエラー。

error: commlib error: got select error (Connection refused)
error: unable to send message to qmaster using port 6444 on host "Ir.calc.atom": got send error

Ir で sge_qmaster が死んでいた。 (関連 [[Grid Engine/セットアップ/マスターホスト]])$

/etc/init.d/gridengine-master start で起動を試みる。これだけで通るようになる事もある。

起動しない場合……。

# tail /var/spool/gridengine/qmaster/messages
08/29/2014 13:40:06|  main|Ir|W|local configuration Ir.calc.atom not defined - using global configuration
08/29/2014 13:40:06|  main|Ir|E|can't create queue "Kr.q": host "Kr00.calc.atom" is not known
08/29/2014 13:40:06|  main|Ir|I|read job database with 3 entries in 0 seconds
08/29/2014 13:40:06|  main|Ir|C|!!!!!!!!!! got NULL element for EH_name !!!!!!!!!!

データベースが壊れたっぽいな。 gridengine を入れ直してみよう。

apt-get purge gridengine-master
mv /etc/gridengine /etc/gridengine_old
mv /var/spool/gridengine /var/spool/gridengine_old
apt-get install -y gridengine-master

で、設定していった。

8 Pd00-04 にジョブが投入されない

   ippei@Re % qstat -explain E | grep Pd                                 [16-06-14 16:45:13]
   120:Pd.q@Pd00.calc.atom            BIP   0/0/4          -NA-     lx26-amd64    auE
   121:  queue Pd.q marked QERROR as result of job 6223's failure at host Pd00.calc.atom
   123:Pd.q@Pd01.calc.atom            BIP   0/0/4          -NA-     lx26-amd64    auE
   124:  queue Pd.q marked QERROR as result of job 6048's failure at host Pd01.calc.atom
   126:Pd.q@Pd02.calc.atom            BIP   0/0/4          -NA-     lx26-amd64    auE
   127:  queue Pd.q marked QERROR as result of job 6048's failure at host Pd02.calc.atom
   129:Pd.q@Pd03.calc.atom            BIP   0/0/4          -NA-     lx26-amd64    auE
   130:  queue Pd.q marked QERROR as result of job 6050's failure at host Pd03.calc.atom
   132:Pd.q@Pd04.calc.atom            BIP   0/0/4          -NA-     lx26-amd64    auE
   133:  queue Pd.q marked QERROR as result of job 6051's failure at host Pd04.calc.atom
   135:Pd.q@Pd05.calc.atom            BIP   0/4/4          7.90     lx26-amd64
   137:Pd.q@Pd06.calc.atom            BIP   0/0/4          0.67     lx26-amd64
   139:Pd.q@Pd07.calc.atom            BIP   0/4/4          7.84     lx26-amd64
   141:Pd.q@Pd08.calc.atom            BIP   0/4/4          7.89     lx26-amd64

過去のジョブが解放されていないとかだろうか。よくわからんが、IP アドレスとホスト名を付け直して逃げた。