Spark SQL报错:org.apache.spark.sql.catalyst.errors.package$TreeNodeException 排查记录

2022-12-11 18:19:29 浏览数 (1)

注:使用的是腾讯云EMR 3.3.0 版本,其中spark为3.0.2版本。

排查过程:

在EMR集群上按小时跑的spark sql 任务有时会失败,在driver端的日志中可以看到报错: org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree

对应的yarn上的application的日志中可以看到在executor将创建的信息(执行步骤、广播变量)不断的发给driver

从时间点上可以看到在16:16:37 到16:16:44 这个时间段内,executor不断地给 driver 发送信息(执行步骤、广播变量)

,在对应的web页面上也能看到driver上有大量的广播变量。而在16:16:45的时候driver就报错了。

查看错误栈对应的代码 org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.$anonfun$relationFuture$1(BroadcastExchangeExec.scala:169)

错误栈:

代码语言:javascript复制
Caused by: org.apache.spark.util.SparkFatalException
	at org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.$anonfun$relationFuture$1(BroadcastExchangeExec.scala:169)
	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withThreadLocalCaptured$1(SQLExecution.scala:182)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	... 1 more

对应代码:

https://github.com/apache/spark/blob/branch-3.0/sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/BroadcastExchangeExec.scala

排查结果:

driver端oom导致的报错。

解决方法:

1.关闭广播变量(set spark.sql.autoBroadcastJoinThreshold = -1 );

2.调大 spark.driver.memory 的值,比如4g

0 人点赞