Spark命令详解

2021-01-27 10:51:27 浏览数 (1)

本篇博客,Alice为大家带来关于Spark命令的详解。

spark-shell

  • 引入

之前我们使用提交任务都是使用spark-shell提交,spark-shell是Spark自带的交互式Shell程序,方便用户进行交互式编程,用户可以在该命令行下可以用scala编写spark程序,适合学习测试时使用!

  • 示例

spark-shell 可以携带参数

spark-shell --master local[N] 数字N表示在本地模拟N个线程来运行当前任务

spark-shell --master local[*] 表示使用当前机器上所有可用的资源

默认不携带参数就是–master local[*]

spark-shell --master spark://node01:7077,node02:7077 表示运行在集群上

spark-submit

  • 引入

spark-shell交互式编程确实很方便我们进行学习测试,但是在实际中我们一般是使用IDEA开发Spark应用程序。打成jar包交给Spark集群/YARN去执行,所以我们还得学习一个spark-submit命令用来帮我们提交jar包给spark集群/YARN。

  • 示例

我们可以用Spark自带的一些算法,例如利用蒙特·卡罗算法求圆周率PI,通过计算机模拟大量的随机数计算出比较精确的π。

代码语言:javascript复制
bin/spark-submit 
--class org.apache.spark.examples.SparkPi 
--master spark://node-1:7077 
--executor-memory 1G 
--total-executor-cores 2 
examples/jars/spark-examples_2.11-2.0.2.jar 

参数总结

  • Master参数形式

http://spark.apache.org/docs/latest/submitting-applications.html

Master形式

解释

local

本地以一个worker线程运行(例如非并行的情况).

local[N]

本地以K worker 线程 (理想情况下, N设置为你机器的CPU核数)

local[*]

本地以本机同样核数的线程运行

spark://HOST:PORT

连接到指定的Spark standalone cluster master. 端口是你的master集群配置的端口,缺省值为7077

mesos://HOST:PORT

连接到指定的Mesos 集群. Port是你配置的mesos端口, 默认5050. 或者使用ZK,格式为 mesos://zk://…

yarn-client

以client模式连接到YARN cluster. 集群的位置基于HADOOP_CONF_DIR 变量找到

yarn-cluster

以cluster模式连接到YARN cluster. 集群的位置基于HADOOP_CONF_DIR 变量找到

  • 其他参数

我们亦可以通过shell命令来进行查看 spark-submit --help

代码语言:javascript复制
$ bin/spark-submit --help
 
Usage: spark-submit [options] <app jar | python file> [app arguments]
Usage: spark-submit --kill [submission ID] --master [spark://...]
Usage: spark-submit --status [submission ID] --master [spark://...]
Usage: spark-submit run-example [options] example-class [example args]
 
Options:
  --master MASTER_URL         spark://host:port, mesos://host:port, yarn, or local.
  --deploy-mode DEPLOY_MODE   Whether to launch the driver program locally ("client") or
                              on one of the worker machines inside the cluster ("cluster")
                              (Default: client).
  --class CLASS_NAME          Your application's main class (for Java / Scala apps).
  --name NAME                 A name of your application.
  --jars JARS                 Comma-separated list of local jars to include on the driver
                              and executor classpaths.
  --packages                  Comma-separated list of maven coordinates of jars to include
                              on the driver and executor classpaths. Will search the local
                              maven repo, then maven central and any additional remote
                              repositories given by --repositories. The format for the
                              coordinates should be groupId:artifactId:version.
  --exclude-packages          Comma-separated list of groupId:artifactId, to exclude while
                              resolving the dependencies provided in --packages to avoid
                              dependency conflicts.
  --repositories              Comma-separated list of additional remote repositories to
                              search for the maven coordinates given with --packages.
  --py-files PY_FILES         Comma-separated list of .zip, .egg, or .py files to place
                              on the PYTHONPATH for Python apps.
  --files FILES               Comma-separated list of files to be placed in the working
                              directory of each executor. File paths of these files
                              in executors can be accessed via SparkFiles.get(fileName).
 
  --conf PROP=VALUE           Arbitrary Spark configuration property.
  --properties-file FILE      Path to a file from which to load extra properties. If not
                              specified, this will look for conf/spark-defaults.conf.
 
  --driver-memory MEM         Memory for driver (e.g. 1000M, 2G) (Default: 1024M).
  --driver-java-options       Extra Java options to pass to the driver.
  --driver-library-path       Extra library path entries to pass to the driver.
  --driver-class-path         Extra class path entries to pass to the driver. Note that
                              jars added with --jars are automatically included in the
                              classpath.
 
  --executor-memory MEM       Memory per executor (e.g. 1000M, 2G) (Default: 1G).
 
  --proxy-user NAME           User to impersonate when submitting the application.
                              This argument does not work with --principal / --keytab.
 
  --help, -h                  Show this help message and exit.
  --verbose, -v               Print additional debug output.
  --version,                  Print the version of current Spark.
 
 Spark standalone with cluster deploy mode only:
  --driver-cores NUM          Cores for driver (Default: 1).
 
 Spark standalone or Mesos with cluster deploy mode only:
  --supervise                 If given, restarts the driver on failure.
  --kill SUBMISSION_ID        If given, kills the driver specified.
  --status SUBMISSION_ID      If given, requests the status of the driver specified.
 
 Spark standalone and Mesos only:
  --total-executor-cores NUM  Total cores for all executors.
 
 Spark standalone and YARN only:
  --executor-cores NUM        Number of cores per executor. (Default: 1 in YARN mode,
                              or all available cores on the worker in standalone mode)
 
 YARN-only:
  --driver-cores NUM          Number of cores used by the driver, only in cluster mode
                              (Default: 1).
  --queue QUEUE_NAME          The YARN queue to submit to (Default: "default").
  --num-executors NUM         Number of executors to launch (Default: 2).
                              If dynamic allocation is enabled, the initial number of
                              executors will be at least NUM.
  --archives ARCHIVES         Comma separated list of archives to be extracted into the
                              working directory of each executor.
  --principal PRINCIPAL       Principal to be used to login to KDC, while running on
                              secure HDFS.
  --keytab KEYTAB             The full path to the file that contains the keytab for the
                              principal specified above. This keytab will be copied to
                              the node running the Application Master via the Secure
                              Distributed Cache, for renewing the login tickets and the
                              delegation tokens periodically.

参数说明:

-- master spark://node01:7077,node02:7077 指定Master的地址

-- executor-memory 1g 指定每个worker可用内存为1g

-- total-executor-cores 2 指定整个集群使用的cup核数为2个

  • 示例2

-- master spark://node01:7077 指定 Master 的地址

-- name "appName" 指定程序运行的名称

-- class 程序的main方法所在的类

-- jars xx.jar 程序额外使用的 jar 包

-- driver-memory 512m Driver运行所需要的内存, 默认1g

-- executor-memory 2g 指定每个 executor 可用内存为 2g, 默认1g

-- executor-cores 1 指定每一个 executor 可用的核数

-- total-executor-cores 2 指定运行任务使用的 cup 核数为 2 个

  • 注意:

如果 worker 节点的内存不足,那么在启动 spark-shell 的时候,就不能为 executor分配超出 worker 可用的内存容量,大家根据自己 worker 的容量进行分配任务资源。

如果--executor-cores超过了每个 worker 可用的 cores,任务处于等待状态。

如果--total-executor-cores即使超过可用的 cores,默认使用所有的。以后当集群其他的资源释放之后,就会被该程序所使用。

如果内存或单个 executor 的 cores 不足,启动 spark-submit 就会报错,任务处于等待状态,不能正常执行。

  • 总结:

开发中需要根据实际任务的数据量大小、任务优先级、公司服务器的实际资源情况,参考公司之前的提交的任务的脚本参数,灵活设置即可。


好了,本次的分享就到这里,受益的小伙伴或对大数据技术感兴趣的朋友可以点赞关注Alice哟(^U^)ノ~YO

0 人点赞