注:使用的是腾讯云EMR 3.3.0 版本,其中spark为3.0.2版本。
排查过程:
在EMR集群上按小时跑的spark sql 任务有时会失败,在driver端的日志中可以看到报错: org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree
对应的yarn上的application的日志中可以看到在executor将创建的信息(执行步骤、广播变量)不断的发给driver
从时间点上可以看到在16:16:37 到16:16:44 这个时间段内,executor不断地给 driver 发送信息(执行步骤、广播变量)
,在对应的web页面上也能看到driver上有大量的广播变量。而在16:16:45的时候driver就报错了。
查看错误栈对应的代码 org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.$anonfun$relationFuture$1(BroadcastExchangeExec.scala:169)
错误栈:
代码语言:javascript复制Caused by: org.apache.spark.util.SparkFatalException
at org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.$anonfun$relationFuture$1(BroadcastExchangeExec.scala:169)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withThreadLocalCaptured$1(SQLExecution.scala:182)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
... 1 more
对应代码:
https://github.com/apache/spark/blob/branch-3.0/sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/BroadcastExchangeExec.scala
排查结果:
driver端oom导致的报错。
解决方法:
1.关闭广播变量(set spark.sql.autoBroadcastJoinThreshold = -1 );
2.调大 spark.driver.memory 的值,比如4g