背景---在相关问题的日志显示如下所示:
1.任务的相关日志如下所示:
1.1 反馈了多个任务中有出现以上的日志.
客户提供的submit相关命令如下所示:
spark-submit
--driver-class-path "$yarn_client_driver_classpath"
--jars "$extraJars" --files "$extraFiles"
--conf spark.dynamicAllocation.enabled=true
--conf spark.driver.userClassPathFirst=true
--conf spark.port.maxRetries=30
--conf spark.shuffle.file.buffer=96k
--conf spark.reducer.maxSizeInFlight=96m
--conf spark.task.maxFailures=20
--conf spark.network.timeout=500s
--conf spark.yarn.maxAppAttempts=3
--conf spark.executor.extraJavaOptions="-Dfile.encoding=UTF-8 -XX: UseG1GC"
--conf spark.driver.extraJavaOptions="-Dfile.encoding=UTF-8 -XX: UseG1GC"
--master yarn --deploy-mode "$yarn_deploy_mode"
$kerberos_args
"$@"
1.2 相关log如下:
exec /bin/bash -c "LD_LIBRARY_PATH="$HADOOP_HOME/lib/native/:$LD_LIBRARY_PATH" $JAVA_HOME/bin/java -server -Xmx10240m '-Dfile.encoding=UTF-8' '-XX: UseG1GC' -Djava.io.tmpdir=$PWD/tmp '-Dspark.port.maxRetries=30' '-Dspark.network.timeout=500s' '-Dspark.driver.port=46243' -Dspark.yarn.app.container.log.dir=/data/emr/yarn/logs/application_1662701224474_3019/container_e20_1662701224474_3019_01_000076 -XX:OnOutOfMemoryError='kill %p' org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url spark://CoarseGrainedScheduler@172.31.1.18:46243 --executor-id 67 --hostname 172.31.1.12 --cores 1 --app-id application_1662701224474_3019 --user-class-path file:$PWD/__app__.jar --user-class-path file:$PWD/amqp-client-5.4.3.jar --user-class-path file:$PWD/ant-1.9.1.jar --user-class-path file:$PWD/aviator-3.3.0.jar --user-class-path file:$PWD/aws-java-sdk-core-1.11.60.jar --user-class-path file:$PWD/aws-java-sdk-s3-1.11.60.jar --user-class-path file:$PWD/azure-storage-8.2.0.jar --user-class-path file:$PWD/caffeine-2.9.0.jar --user-class-path file:$PWD/commons-compress-1.8.1.jar --user-class-path file:$PWD/commons-csv-1.8.jar --user-class-path file:$PWD/commons-email-1.3.1.jar --user-class-path file:$PWD/commons-exec-1.1.jar --user-class-path file:$PWD/commons-lang-2.4.jar --user-class-path file:$PWD/commons-lang3-3.5.jar --user-class-path file:$PWD/commons-pool2-2.6.2.jar --user-class-path file:$PWD/cos_api-5.6.55.jar --user-class-path file:$PWD/gson-2.6.2.jar --user-class-path file:$PWD/guava-15.0.jar --user-class-path file:$PWD/hbase-client-2.3.5.jar --user-class-path file:$PWD/hbase-common-2.3.5.jar --user-class-path file:$PWD/hbase-hadoop2-compat-2.3.5.jar --user-class-path file:$PWD/hbase-hadoop-compat-2.3.5.jar --user-class-path file:$PWD/hbase-mapreduce-2.3.5.jar --user-class-path file:$PWD/hbase-metrics-2.3.5.jar --user-class-path file:$PWD/hbase-metrics-api-2.3.5.jar --user-class-path file:$PWD/hbase-protocol-2.3.5.jar --user-class-path file:$PWD/hbase-protocol-shaded-2.3.5.jar --user-class-path file:$PWD/hbase-server-2.3.5.jar --user-class-path file:$PWD/hbase-shaded-miscellaneous-3.3.0.jar --user-class-path file:$PWD/hbase-shaded-netty-3.3.0.jar --user-class-path file:$PWD/hbase-shaded-protobuf-3.3.0.jar --user-class-path file:$PWD/hbase-zookeeper-2.3.5.jar --user-class-path file:$PWD/httpclient-4.5.13.jar --user-class-path file:$PWD/httpcore-4.4.5.jar --user-class-path file:$PWD/insight-shaded-guava-15.0.jar --user-class-path file:$PWD/jackson-dataformat-yaml-2.12.5.jar --user-class-path file:$PWD/javassist-3.28.0-GA.jar --user-class-path file:$PWD/jboss-marshalling-2.0.11.Final.jar --user-class-path file:$PWD/jboss-marshalling-river-2.0.11.Final.jar --user-class-path file:$PWD/jedis-3.1.0.jar --user-class-path file:$PWD/joda-time-2.8.1.jar --user-class-path file:$PWD/lombok-1.18.20.jar --user-class-path file:$PWD/mail-1.4.5.jar --user-class-path file:$PWD/memory-0.12.1.jar --user-class-path file:$PWD/nacos-api-1.3.2.jar --user-class-path file:$PWD/nacos-client-1.3.2.jar --user-class-path file:$PWD/nacos-common-1.3.2.jar --user-class-path file:$PWD/opencsv-2.3.jar --user-class-path file:$PWD/redisson-3.16.3.jar --user-class-path file:$PWD/reflections-0.10.2.jar --user-class-path file:$PWD/RoaringBitmap-0.6.44.jar --user-class-path file:$PWD/simpleclient-0.5.0.jar --user-class-path file:$PWD/sketches-core-0.13.0.jar --user-class-path file:$PWD/spring-core-4.1.8.RELEASE.jar --user-class-path file:$PWD/ua_uc_check-1.0.0.jar --user-class-path file:$PWD/UserAgentUtils-1.20.jar --user-class-path file:$PWD/zookeeper-3.5.7.jar 1>/data/emr/yarn/logs/application_1662701224474_3019/container_e20_1662701224474_3019_01_000076/stdout 2>/data/emr/yarn/logs/application_1662701224474_3019/container_e20_1662701224474_3019_01_000076/stderr"
End of LogType:launch_container.sh
************************************************************************************
1.3 YARN executor launch context:
env:
CLASSPATH -> $HADOOP_HOME/lib/<CPS>{{PWD}}<CPS>{{PWD}}/__spark_conf__<CPS>{{PWD}}/__spark_libs__/*<CPS>$HADOOP_CONF_DIR<CPS>$HADOOP_COMMON_HOME/share/hadoop/common/*<CPS>$HADOOP_COMMON_HOME/share/hadoop/common/lib/*<CPS>$HADOOP_HDFS_HOME/share/hadoop/hdfs/*<CPS>$HADOOP_HDFS_HOME/share/hadoop/hdfs/lib/*<CPS>$HADOOP_YARN_HOME/share/hadoop/yarn/*<CPS>$HADOOP_YARN_HOME/share/hadoop/yarn/lib/*<CPS>/usr/local/service/hadoop/etc/hadoop<CPS>/usr/local/service/hadoop/share/hadoop/common/*<CPS>/usr/local/service/hadoop/share/hadoop/common/lib/*<CPS>/usr/local/service/hadoop/share/hadoop/hdfs/*<CPS>/usr/local/service/hadoop/share/hadoop/hdfs/lib/*<CPS>/usr/local/service/hadoop/share/hadoop/mapreduce/*<CPS>/usr/local/service/hadoop/share/hadoop/mapreduce/lib/*<CPS>/usr/local/service/hadoop/share/hadoop/yarn/*<CPS>/usr/local/service/hadoop/share/hadoop/yarn/lib/*<CPS>{{PWD}}/__spark_conf__/__hadoop_conf__
SPARK_YARN_STAGING_DIR -> hdfs://HDFS19683/user/hadoop/.sparkStaging/application_1662701224474_3019
SPARK_USER -> hadoop
command:
LD_LIBRARY_PATH="$HADOOP_HOME/lib/native/:$LD_LIBRARY_PATH"
{{JAVA_HOME}}/bin/java
-server
-Xmx10240m
'-Dfile.encoding=UTF-8'
'-XX: UseG1GC'
-Djava.io.tmpdir={{PWD}}/tmp
'-Dspark.port.maxRetries=30'
'-Dspark.network.timeout=500s'
'-Dspark.driver.port=46243'
-Dspark.yarn.app.container.log.dir=<LOG_DIR>
-XX:OnOutOfMemoryError='kill %p'
org.apache.spark.executor.CoarseGrainedExecutorBackend
--driver-url
spark://CoarseGrainedScheduler@****:46243
--executor-id
<executorId>
--hostname
<hostname>
--cores
2.问题解决如下所示:
2.1) 以下两个-XX: UseG1GC均去除,解决该问题.
--conf spark.executor.extraJavaOptions="-Dfile.encoding=UTF-8 -XX: UseG1GC"
--conf spark.driver.extraJavaOptions="-Dfile.encoding=UTF-8 -XX: UseG1GC"
(-XX永久代的
3.原理剖析:
Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
此处是JVM相关报错,可在故障机上使用 ulimit -a 命令查看本机的设置.
-t: cpu time (seconds) unlimited
-f: file size (blocks) unlimited
-d: data seg size (kbytes) unlimited
-s: stack size (kbytes) 8192
-c: core file size (blocks) 0
-v: address space (kbytes) unlimited
-l: locked-in-memory size (kbytes) unlimited
-u: processes 1392
-n: file descriptors 2560
此处报错是ulimit -c 对应的数值是:-c: core file size (blocks) 0
4.接下来看下UseG1GC和CMS两者之间的区别.
-XX: UseG1GC
对应的两者之间的对比,可参考如下的url:https://blog.chriscs.com/2017/06/20/g1-vs-cms/,
此错误通过调整 spark.storage.memoryFraction的值依旧生效.
原理如下:
Spark通过确保它不超过RDD堆空间体积乘以此参数的值来控制缓存RDD的总大小的参数。JVM也可以使用RDD高速缓存分数的未使用部分。因此,Spark应用程序的GC分析应涵盖两个内存分数的内存使用情况。
当观察到GC延迟,导致效率下降时,我们应首先检查并确保Spark应用程序以有效的方式使用有限的内存空间.
RDD占用的内存空间越少,程序执行剩余的堆空间就越多,从而提高了GC的效率;
相反,由于旧代中存在大量缓冲对象,RDD过多的内存消耗会导致显着的性能损失.
相关文档引入如下所示:
https://blog.chriscs.com/2017/06/20/g1-vs-cms/
https://www.cnblogs.com/qingyunzong/p/8973857.html
https://www.databricks.com/blog/2015/05/28/tuning-java-garbage-collection-for-spark-applications.html