遇到mpi worker exited on signal 9

2020-08-06 10:46:17 浏览数 (1)

运行一个 mpi-operator 的 demo(这个 demo 还是我提交的…),看到如下错误。

代码语言:javascript复制
An MPI communication peer process has unexpectedly disconnected.  This
usually indicates a failure in the peer process (e.g., a crash or
otherwise exiting without calling MPI_FINALIZE first).

Although this local MPI process will likely now behave unpredictably
(it may even hang or crash), the root cause of this problem is the
failure of the peer -- that is what you need to investigate.  For
example, there may be a core file that you can examine.  More
generally: such peer hangups are frequently caused by application bugs
or other external events.

  Local host: mpi-sleep-worker-0
  Local PID:  99
  Peer host:  mpi-sleep-worker-1
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 1 with PID 58 on node mpi-sleep-worker-1 exited on signal 9 (Killed).
--------------------------------------------------------------------------

看了许久,发现是 Worker 配置的内存太少了(之前只有1Gi),如果要运行这个 demo,请把 Worker 的内存加到 2Gi。

0 人点赞