运行一个 mpi-operator 的 demo(这个 demo 还是我提交的…),看到如下错误。
代码语言:javascript复制An MPI communication peer process has unexpectedly disconnected. This
usually indicates a failure in the peer process (e.g., a crash or
otherwise exiting without calling MPI_FINALIZE first).
Although this local MPI process will likely now behave unpredictably
(it may even hang or crash), the root cause of this problem is the
failure of the peer -- that is what you need to investigate. For
example, there may be a core file that you can examine. More
generally: such peer hangups are frequently caused by application bugs
or other external events.
Local host: mpi-sleep-worker-0
Local PID: 99
Peer host: mpi-sleep-worker-1
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 1 with PID 58 on node mpi-sleep-worker-1 exited on signal 9 (Killed).
--------------------------------------------------------------------------
看了许久,发现是 Worker 配置的内存太少了(之前只有1Gi),如果要运行这个 demo,请把 Worker 的内存加到 2Gi。