Container exited with a non-zero exit code 134Container exited code 134

2022-09-22 16:27:34 浏览数 (2)

背景---在相关问题的日志显示如下所示:

1.任务的相关日志如下所示:

客户反馈的问题日志客户反馈的问题日志

1.1 反馈了多个任务中有出现以上的日志.

客户提供的submit相关命令如下所示:

spark-submit

--driver-class-path "$yarn_client_driver_classpath"

--jars "$extraJars" --files "$extraFiles"

--conf spark.dynamicAllocation.enabled=true

--conf spark.driver.userClassPathFirst=true

--conf spark.port.maxRetries=30

--conf spark.shuffle.file.buffer=96k

--conf spark.reducer.maxSizeInFlight=96m

--conf spark.task.maxFailures=20

--conf spark.network.timeout=500s

--conf spark.yarn.maxAppAttempts=3

--conf spark.executor.extraJavaOptions="-Dfile.encoding=UTF-8 -XX: UseG1GC"

--conf spark.driver.extraJavaOptions="-Dfile.encoding=UTF-8 -XX: UseG1GC"

--master yarn --deploy-mode "$yarn_deploy_mode"

$kerberos_args

"$@"

1.2 相关log如下:

exec /bin/bash -c "LD_LIBRARY_PATH="$HADOOP_HOME/lib/native/:$LD_LIBRARY_PATH" $JAVA_HOME/bin/java -server -Xmx10240m '-Dfile.encoding=UTF-8' '-XX: UseG1GC' -Djava.io.tmpdir=$PWD/tmp '-Dspark.port.maxRetries=30' '-Dspark.network.timeout=500s' '-Dspark.driver.port=46243' -Dspark.yarn.app.container.log.dir=/data/emr/yarn/logs/application_1662701224474_3019/container_e20_1662701224474_3019_01_000076 -XX:OnOutOfMemoryError='kill %p' org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url spark://CoarseGrainedScheduler@172.31.1.18:46243 --executor-id 67 --hostname 172.31.1.12 --cores 1 --app-id application_1662701224474_3019 --user-class-path file:$PWD/__app__.jar --user-class-path file:$PWD/amqp-client-5.4.3.jar --user-class-path file:$PWD/ant-1.9.1.jar --user-class-path file:$PWD/aviator-3.3.0.jar --user-class-path file:$PWD/aws-java-sdk-core-1.11.60.jar --user-class-path file:$PWD/aws-java-sdk-s3-1.11.60.jar --user-class-path file:$PWD/azure-storage-8.2.0.jar --user-class-path file:$PWD/caffeine-2.9.0.jar --user-class-path file:$PWD/commons-compress-1.8.1.jar --user-class-path file:$PWD/commons-csv-1.8.jar --user-class-path file:$PWD/commons-email-1.3.1.jar --user-class-path file:$PWD/commons-exec-1.1.jar --user-class-path file:$PWD/commons-lang-2.4.jar --user-class-path file:$PWD/commons-lang3-3.5.jar --user-class-path file:$PWD/commons-pool2-2.6.2.jar --user-class-path file:$PWD/cos_api-5.6.55.jar --user-class-path file:$PWD/gson-2.6.2.jar --user-class-path file:$PWD/guava-15.0.jar --user-class-path file:$PWD/hbase-client-2.3.5.jar --user-class-path file:$PWD/hbase-common-2.3.5.jar --user-class-path file:$PWD/hbase-hadoop2-compat-2.3.5.jar --user-class-path file:$PWD/hbase-hadoop-compat-2.3.5.jar --user-class-path file:$PWD/hbase-mapreduce-2.3.5.jar --user-class-path file:$PWD/hbase-metrics-2.3.5.jar --user-class-path file:$PWD/hbase-metrics-api-2.3.5.jar --user-class-path file:$PWD/hbase-protocol-2.3.5.jar --user-class-path file:$PWD/hbase-protocol-shaded-2.3.5.jar --user-class-path file:$PWD/hbase-server-2.3.5.jar --user-class-path file:$PWD/hbase-shaded-miscellaneous-3.3.0.jar --user-class-path file:$PWD/hbase-shaded-netty-3.3.0.jar --user-class-path file:$PWD/hbase-shaded-protobuf-3.3.0.jar --user-class-path file:$PWD/hbase-zookeeper-2.3.5.jar --user-class-path file:$PWD/httpclient-4.5.13.jar --user-class-path file:$PWD/httpcore-4.4.5.jar --user-class-path file:$PWD/insight-shaded-guava-15.0.jar --user-class-path file:$PWD/jackson-dataformat-yaml-2.12.5.jar --user-class-path file:$PWD/javassist-3.28.0-GA.jar --user-class-path file:$PWD/jboss-marshalling-2.0.11.Final.jar --user-class-path file:$PWD/jboss-marshalling-river-2.0.11.Final.jar --user-class-path file:$PWD/jedis-3.1.0.jar --user-class-path file:$PWD/joda-time-2.8.1.jar --user-class-path file:$PWD/lombok-1.18.20.jar --user-class-path file:$PWD/mail-1.4.5.jar --user-class-path file:$PWD/memory-0.12.1.jar --user-class-path file:$PWD/nacos-api-1.3.2.jar --user-class-path file:$PWD/nacos-client-1.3.2.jar --user-class-path file:$PWD/nacos-common-1.3.2.jar --user-class-path file:$PWD/opencsv-2.3.jar --user-class-path file:$PWD/redisson-3.16.3.jar --user-class-path file:$PWD/reflections-0.10.2.jar --user-class-path file:$PWD/RoaringBitmap-0.6.44.jar --user-class-path file:$PWD/simpleclient-0.5.0.jar --user-class-path file:$PWD/sketches-core-0.13.0.jar --user-class-path file:$PWD/spring-core-4.1.8.RELEASE.jar --user-class-path file:$PWD/ua_uc_check-1.0.0.jar --user-class-path file:$PWD/UserAgentUtils-1.20.jar --user-class-path file:$PWD/zookeeper-3.5.7.jar 1>/data/emr/yarn/logs/application_1662701224474_3019/container_e20_1662701224474_3019_01_000076/stdout 2>/data/emr/yarn/logs/application_1662701224474_3019/container_e20_1662701224474_3019_01_000076/stderr"

End of LogType:launch_container.sh

************************************************************************************

1.3 YARN executor launch context:

env:

CLASSPATH -> $HADOOP_HOME/lib/<CPS>{{PWD}}<CPS>{{PWD}}/__spark_conf__<CPS>{{PWD}}/__spark_libs__/*<CPS>$HADOOP_CONF_DIR<CPS>$HADOOP_COMMON_HOME/share/hadoop/common/*<CPS>$HADOOP_COMMON_HOME/share/hadoop/common/lib/*<CPS>$HADOOP_HDFS_HOME/share/hadoop/hdfs/*<CPS>$HADOOP_HDFS_HOME/share/hadoop/hdfs/lib/*<CPS>$HADOOP_YARN_HOME/share/hadoop/yarn/*<CPS>$HADOOP_YARN_HOME/share/hadoop/yarn/lib/*<CPS>/usr/local/service/hadoop/etc/hadoop<CPS>/usr/local/service/hadoop/share/hadoop/common/*<CPS>/usr/local/service/hadoop/share/hadoop/common/lib/*<CPS>/usr/local/service/hadoop/share/hadoop/hdfs/*<CPS>/usr/local/service/hadoop/share/hadoop/hdfs/lib/*<CPS>/usr/local/service/hadoop/share/hadoop/mapreduce/*<CPS>/usr/local/service/hadoop/share/hadoop/mapreduce/lib/*<CPS>/usr/local/service/hadoop/share/hadoop/yarn/*<CPS>/usr/local/service/hadoop/share/hadoop/yarn/lib/*<CPS>{{PWD}}/__spark_conf__/__hadoop_conf__

SPARK_YARN_STAGING_DIR -> hdfs://HDFS19683/user/hadoop/.sparkStaging/application_1662701224474_3019

SPARK_USER -> hadoop

command:

LD_LIBRARY_PATH="$HADOOP_HOME/lib/native/:$LD_LIBRARY_PATH"

{{JAVA_HOME}}/bin/java

-server

-Xmx10240m

'-Dfile.encoding=UTF-8'

'-XX: UseG1GC'

-Djava.io.tmpdir={{PWD}}/tmp

'-Dspark.port.maxRetries=30'

'-Dspark.network.timeout=500s'

'-Dspark.driver.port=46243'

-Dspark.yarn.app.container.log.dir=<LOG_DIR>

-XX:OnOutOfMemoryError='kill %p'

org.apache.spark.executor.CoarseGrainedExecutorBackend

--driver-url

spark://CoarseGrainedScheduler@****:46243

--executor-id

<executorId>

--hostname

<hostname>

--cores

2.问题解决如下所示:

2.1) 以下两个-XX: UseG1GC均去除,解决该问题.

--conf spark.executor.extraJavaOptions="-Dfile.encoding=UTF-8 -XX: UseG1GC"

--conf spark.driver.extraJavaOptions="-Dfile.encoding=UTF-8 -XX: UseG1GC"

(-XX永久代的

3.原理剖析:

Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again

此处是JVM相关报错,可在故障机上使用 ulimit -a 命令查看本机的设置.

-t: cpu time (seconds) unlimited

-f: file size (blocks) unlimited

-d: data seg size (kbytes) unlimited

-s: stack size (kbytes) 8192

-c: core file size (blocks) 0

-v: address space (kbytes) unlimited

-l: locked-in-memory size (kbytes) unlimited

-u: processes 1392

-n: file descriptors 2560

此处报错是ulimit -c 对应的数值是:-c: core file size (blocks) 0

4.接下来看下UseG1GC和CMS两者之间的区别.

-XX: UseG1GC

对应的两者之间的对比,可参考如下的url:https://blog.chriscs.com/2017/06/20/g1-vs-cms/,

此错误通过调整 spark.storage.memoryFraction的值依旧生效.

原理如下:

Spark通过确保它不超过RDD堆空间体积乘以此参数的值来控制缓存RDD的总大小的参数。JVM也可以使用RDD高速缓存分数的未使用部分。因此,Spark应用程序的GC分析应涵盖两个内存分数的内存使用情况。

当观察到GC延迟,导致效率下降时,我们应首先检查并确保Spark应用程序以有效的方式使用有限的内存空间.

RDD占用的内存空间越少,程序执行剩余的堆空间就越多,从而提高了GC的效率;

相反,由于旧代中存在大量缓冲对象,RDD过多的内存消耗会导致显着的性能损失.

相关文档引入如下所示:

https://blog.chriscs.com/2017/06/20/g1-vs-cms/

https://www.cnblogs.com/qingyunzong/p/8973857.html

https://www.databricks.com/blog/2015/05/28/tuning-java-garbage-collection-for-spark-applications.html

0 人点赞