1
文档编写目的
- 记录spark本地开发环境的搭建过程
环境依赖
- 操作系统 mac os
- idea
- scala 2.11.12
- spark2.4.0 - 根据集群版本选择
- jdk
2
Scala-2.11.12安装
下载连接
https://www.scala-lang.org/download/2.11.12.html
- 下载到scala目录,并进行解压
tar -zxvf scala-2.11.12.tgz
- 配置环境变量
vi ~/.bash_profile
# 添加scala path
# scala setting
export SCALA_HOME=/Users/jackbin/scala/scala-2.11.12
export PATH=$PATH:$SCALA_HOME/bin
# 刷新配置
source ~/.bash_profile
代码语言:javascript复制
- 在终端输入scala进行检验
3
Spark环境下载
下载连接
https://archive.apache.org/dist/spark/spark-2.4.0/
根据需要的集群环境选择下载的hadoop版本,这里使用的是CDH5,则下载hadoop2.6的版本
- 解压spark环境
tar -zxvf spark-2.4.0-bin-hadoop2.6.tgz
- 配置环境变量
vi ~/.bash_profile
# 添加spark home配置
# spark setting
export SPARK_HOME=/Users/jackbin/spark-runtime/spark-2.4.0-bin-hadoop2.6
export PATH=$PATH:$SPARK_HOME/bin
- 终端输入spark-shell进行测试,spark配置完成
4
Idea构建Spark开发环境
- 新建maven项目
- 安装scala插件
- 项目添加scala支持
- 在main包下新建scala目录,在项目模块中将scala调整为source,并选择language level为java8
- pom中引入spark的相关依赖
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>org.eights</groupId>
<artifactId>spark-demo</artifactId>
<version>1.0-SNAPSHOT</version>
<properties>
<maven.compiler.source>1.8</maven.compiler.source>
<maven.compiler.target>1.8</maven.compiler.target>
<scala.version>2.11</scala.version>
<spark.version>2.4.0</spark.version>
<encoding>UTF-8</encoding>
</properties>
<dependencies>
<!-- 导入spark的依赖 -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_${scala.version}</artifactId>
<version>${spark.version}</version>
</dependency>
<!-- spark sql -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>${spark.version}</version>
</dependency>
<!--hive依赖-->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-hive_2.11</artifactId>
<version>${spark.version}</version>
</dependency>
</dependencies>
<build>
<pluginManagement>
<plugins>
<!-- 编译scala的插件 -->
<plugin>
<groupId>net.alchim31.maven</groupId>
<artifactId>scala-maven-plugin</artifactId>
<version>3.2.2</version>
</plugin>
<!-- 编译java的插件 -->
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<version>3.5.1</version>
</plugin>
</plugins>
</pluginManagement>
<plugins>
<plugin>
<groupId>net.alchim31.maven</groupId>
<artifactId>scala-maven-plugin</artifactId>
<executions>
<execution>
<id>scala-compile-first</id>
<phase>process-resources</phase>
<goals>
<goal>add-source</goal>
<goal>compile</goal>
</goals>
</execution>
<execution>
<id>scala-test-compile</id>
<phase>process-test-resources</phase>
<goals>
<goal>testCompile</goal>
</goals>
</execution>
</executions>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<executions>
<execution>
<phase>compile</phase>
<goals>
<goal>compile</goal>
</goals>
</execution>
</executions>
</plugin>
<!-- 打jar插件 -->
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-shade-plugin</artifactId>
<version>2.4.3</version>
<executions>
<execution>
<phase>package</phase>
<goals>
<goal>shade</goal>
</goals>
<configuration>
<filters>
<filter>
<artifact>*:*</artifact>
<excludes>
<exclude>META-INF/*.SF</exclude>
<exclude>META-INF/*.DSA</exclude>
<exclude>META-INF/*.RSA</exclude>
</excludes>
</filter>
</filters>
</configuration>
</execution>
</executions>
</plugin>
</plugins>
</build>
</project>
运行wordcount代码
代码语言:javascript复制import org.apache.spark.rdd.RDD
import org.apache.spark.sql.SparkSession
object WorkCount {
/**
* spark word count
* @param args 传入参数
*/
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder()
.master("local[*]")
.enableHiveSupport()
.getOrCreate()
val wordString = Array("hadoop", "hadoop", "spark","spark","spark","spark","flink","flink","flink","flink",
"flink","flink","hive","flink","hdfs","yarn","zookeeper","hbase","impala","sqoop","hadoop")
//生成Rdd
val wordRdd: RDD[String] = spark.sparkContext.parallelize(wordString)
//统计wordcount
val resRdd: RDD[(String, Int)] = wordRdd.map((_, 1)).reduceByKey(_ _)
resRdd.foreach(elem => {
println(elem._1 "-----" elem._2)
})
spark.stop()
}
}
词频统计运行成功,Spark本地开发环境搭建完成