本篇博客,Alice为大家带来关于Spark命令的详解。
spark-shell
- 引入
之前我们使用提交任务都是使用spark-shell提交,spark-shell是Spark自带的交互式Shell程序,方便用户进行交互式编程,用户可以在该命令行下可以用scala编写spark程序,适合学习测试时使用!
- 示例
spark-shell 可以携带参数
spark-shell --master local[N] 数字N表示在本地模拟N个线程来运行当前任务
spark-shell --master local[*] 表示使用当前机器上所有可用的资源
默认不携带参数就是–master local[*]
spark-shell --master spark://node01:7077,node02:7077 表示运行在集群上
spark-submit
- 引入
spark-shell交互式编程确实很方便我们进行学习测试,但是在实际中我们一般是使用IDEA开发Spark应用程序。打成jar包交给Spark集群/YARN去执行,所以我们还得学习一个spark-submit
命令用来帮我们提交jar包给spark集群/YARN。
- 示例
我们可以用Spark自带的一些算法,例如利用蒙特·卡罗算法求圆周率PI,通过计算机模拟大量的随机数计算出比较精确的π。
代码语言:javascript复制bin/spark-submit
--class org.apache.spark.examples.SparkPi
--master spark://node-1:7077
--executor-memory 1G
--total-executor-cores 2
examples/jars/spark-examples_2.11-2.0.2.jar
参数总结
- Master参数形式
http://spark.apache.org/docs/latest/submitting-applications.html
Master形式 | 解释 |
---|---|
local | 本地以一个worker线程运行(例如非并行的情况). |
local[N] | 本地以K worker 线程 (理想情况下, N设置为你机器的CPU核数) |
local[*] | 本地以本机同样核数的线程运行 |
spark://HOST:PORT | 连接到指定的Spark standalone cluster master. 端口是你的master集群配置的端口,缺省值为7077 |
mesos://HOST:PORT | 连接到指定的Mesos 集群. Port是你配置的mesos端口, 默认5050. 或者使用ZK,格式为 mesos://zk://… |
yarn-client | 以client模式连接到YARN cluster. 集群的位置基于HADOOP_CONF_DIR 变量找到 |
yarn-cluster | 以cluster模式连接到YARN cluster. 集群的位置基于HADOOP_CONF_DIR 变量找到 |
- 其他参数
我们亦可以通过shell命令来进行查看
spark-submit --help
$ bin/spark-submit --help
Usage: spark-submit [options] <app jar | python file> [app arguments]
Usage: spark-submit --kill [submission ID] --master [spark://...]
Usage: spark-submit --status [submission ID] --master [spark://...]
Usage: spark-submit run-example [options] example-class [example args]
Options:
--master MASTER_URL spark://host:port, mesos://host:port, yarn, or local.
--deploy-mode DEPLOY_MODE Whether to launch the driver program locally ("client") or
on one of the worker machines inside the cluster ("cluster")
(Default: client).
--class CLASS_NAME Your application's main class (for Java / Scala apps).
--name NAME A name of your application.
--jars JARS Comma-separated list of local jars to include on the driver
and executor classpaths.
--packages Comma-separated list of maven coordinates of jars to include
on the driver and executor classpaths. Will search the local
maven repo, then maven central and any additional remote
repositories given by --repositories. The format for the
coordinates should be groupId:artifactId:version.
--exclude-packages Comma-separated list of groupId:artifactId, to exclude while
resolving the dependencies provided in --packages to avoid
dependency conflicts.
--repositories Comma-separated list of additional remote repositories to
search for the maven coordinates given with --packages.
--py-files PY_FILES Comma-separated list of .zip, .egg, or .py files to place
on the PYTHONPATH for Python apps.
--files FILES Comma-separated list of files to be placed in the working
directory of each executor. File paths of these files
in executors can be accessed via SparkFiles.get(fileName).
--conf PROP=VALUE Arbitrary Spark configuration property.
--properties-file FILE Path to a file from which to load extra properties. If not
specified, this will look for conf/spark-defaults.conf.
--driver-memory MEM Memory for driver (e.g. 1000M, 2G) (Default: 1024M).
--driver-java-options Extra Java options to pass to the driver.
--driver-library-path Extra library path entries to pass to the driver.
--driver-class-path Extra class path entries to pass to the driver. Note that
jars added with --jars are automatically included in the
classpath.
--executor-memory MEM Memory per executor (e.g. 1000M, 2G) (Default: 1G).
--proxy-user NAME User to impersonate when submitting the application.
This argument does not work with --principal / --keytab.
--help, -h Show this help message and exit.
--verbose, -v Print additional debug output.
--version, Print the version of current Spark.
Spark standalone with cluster deploy mode only:
--driver-cores NUM Cores for driver (Default: 1).
Spark standalone or Mesos with cluster deploy mode only:
--supervise If given, restarts the driver on failure.
--kill SUBMISSION_ID If given, kills the driver specified.
--status SUBMISSION_ID If given, requests the status of the driver specified.
Spark standalone and Mesos only:
--total-executor-cores NUM Total cores for all executors.
Spark standalone and YARN only:
--executor-cores NUM Number of cores per executor. (Default: 1 in YARN mode,
or all available cores on the worker in standalone mode)
YARN-only:
--driver-cores NUM Number of cores used by the driver, only in cluster mode
(Default: 1).
--queue QUEUE_NAME The YARN queue to submit to (Default: "default").
--num-executors NUM Number of executors to launch (Default: 2).
If dynamic allocation is enabled, the initial number of
executors will be at least NUM.
--archives ARCHIVES Comma separated list of archives to be extracted into the
working directory of each executor.
--principal PRINCIPAL Principal to be used to login to KDC, while running on
secure HDFS.
--keytab KEYTAB The full path to the file that contains the keytab for the
principal specified above. This keytab will be copied to
the node running the Application Master via the Secure
Distributed Cache, for renewing the login tickets and the
delegation tokens periodically.
参数说明:
-- master spark://node01:7077,node02:7077
指定Master的地址
-- executor-memory 1g
指定每个worker可用内存为1g
-- total-executor-cores 2
指定整个集群使用的cup核数为2个
- 示例2
-- master spark://node01:7077
指定 Master 的地址
-- name "appName"
指定程序运行的名称
-- class
程序的main方法所在的类
-- jars xx.jar
程序额外使用的 jar 包
-- driver-memory 512m
Driver运行所需要的内存, 默认1g
-- executor-memory 2g
指定每个 executor 可用内存为 2g, 默认1g
-- executor-cores 1
指定每一个 executor 可用的核数
-- total-executor-cores 2
指定运行任务使用的 cup 核数为 2 个
- 注意:
如果 worker 节点的内存不足,那么在启动 spark-shell 的时候,就不能为 executor分配超出 worker 可用的内存容量,大家根据自己 worker 的容量进行分配任务资源。
如果--executor-cores
超过了每个 worker 可用的 cores,任务处于等待状态。
如果--total-executor-cores
即使超过可用的 cores,默认使用所有的。以后当集群其他的资源释放之后,就会被该程序所使用。
如果内存或单个 executor 的 cores 不足,启动 spark-submit 就会报错,任务处于等待状态,不能正常执行。
- 总结:
开发中需要根据实际任务的数据量大小、任务优先级、公司服务器的实际资源情况,参考公司之前的提交的任务的脚本参数,灵活设置即可。
好了,本次的分享就到这里,受益的小伙伴或对大数据技术感兴趣的朋友可以点赞关注Alice哟(^U^)ノ~YO