Hadoop2.7+Spark2.4.0+scala2.12.12+pyspark伪分布式环境搭建

用大数据框架做机器学习第一步~~~~~~~~~~~~~~~

环境：VMware ubuntu虚拟机

基础的linux操作本教程默认会，所以写的相对简明，有问题可以留言。

一、vim

sudo apt-get install vim

二、JAVA-JDK 安装

export JAVA_HOME=/usr/lib/jvm/jdk1.8.0_261

export PATH=JAVA_HOME/bin:PATH

三、scala 安装

下载：https://www.scala-lang.org/download/all.html

mv scala的解压文件 /usr/local/scala

sudo vim /etc/profile

export SCALA_HOME=/usr/local/scala/scala-2.12.12

export PATH="$PATH: /usr/local/scala/scala-2.12.12/bin"

【大数据组件下载地址】

http://archive.apache.org/dist/

四、Hadoop2.7 安装

下载后解压到指定文件夹，我的是/usr/local/hadoop/hadoop-2.7.0

sudo vim /etc/profile

sudo vim ~/.bashrc

添加

export HADOOP_HOME=/usr/local/hadoop/hadoop-2.7.0

export PATH="

source /etc/profile

source~/.bashrc

1、打开hadoop-2.7.0/etc/hadoop/hadoop-env.sh文件，

vim hadoop-2.7.0/etc/hadoop/hadoop-env.sh

# The java implementation to use.（修改这里）

export JAVA_HOME=/usr/lib/jvm/jdk1.8.0_261

(Ctrl Shift V，粘贴)

2、打开hadoop-2.7.0/etc/hadoop/core-site.xml文件，编辑如下：

<name>fs.default.name</name>

<value>hdfs://localhost:9000</value>

</property>

</configuration>

3、打开hadoop-2.7.0/etc/hadoop/mapred-site.xml文件，编辑如下：

<name>mapred.job.tracker</name>

<value>localhost:9001</value>

</property>

</configuration>

4、打开hadoop-2.7.0/etc/hadoop/hdfs-site.xml文件，编辑如下：

<value>/usr/local/hadoop/hadoop-2.7.0/namenode</value>

</property>

<value>/usr/local/hadoop/hadoop-2.7.0/datanode</value>

</property>

<name>dfs.replication</name>

</property>

</configuration>

五、spark安装

下载后解压到指定文件夹，我的是/usr/local/spark/spark

sudo vim /etc/profile

sudo vim ~/.bashrc

添加

export SPARK_HOME=/usr/local/spark/spark

export PATH={SPARK_HOME}/bin:PATH

source /etc/profile

source~/.bashrc

修改spark-env.sh文件，

修改前先备份并重命名cp spark-env.sh.tempalte spark-env.sh

然后打开spark-env.sh文件，追加内容：

export JAVA_HOME=/usr/lib/jvm/jdk1.8.0_261

export HADOOP_HOME=/usr/local/hadoop/hadoop-2.7.0

export SCALA_HOME=/usr/local/scala/scala-2.12.12

export HADOOP_CONF_DIR=/usr/local/hadoop/hadoop-2.7.0/etc/hadoop

export SPARK_MASTER_IP=ubuntu

export SPARK_WORKER_MEMORY=512M

六、SSH配置

1、安装SSH服务

sudo apt-get install openssh-client

sudo apt-get install openssh-server

ssh-keygen -t

cat ~/.ssh/id_rsa.pub

将SSH Key添加到github（在settings 里面， add）

2、免密登录

cd ~/.ssh

cat id_rsa.pub >> authorized_keys

七、MVN安装

地址：http://maven.apache.org/download.cgi

下载解压：

# tar -xvf apache-maven-3.6.3-bin.tar.gz

# sudo mv -f apache-maven-3.6.3 /usr/local/

编辑 /etc/profile 文件 sudovim /etc/profile，在文件末尾添加如下代码：

export M2_HOME=/usr/local/mvn/apache-maven-3.6.3

export PATH={M2_HOME}/bin:PATH

保存文件，并运行如下命令使环境变量生效：

# source /etc/profile

# mvn -v

八、Pyspark安装

sudo apt-get install python

sudo apt-get install python-pip

sudo pip install pyspark==2.4.0 -i https://pypi.doubanio.com/simple

九、spark iforest 离群点检测

https://github.com/titicaca/spark-iforest

git clone git@github.com:titicaca/spark-iforest.git

Step 1：

cd spark-iforest/

mvn clean package -DskipTests

cp target/spark-iforest-<version>.jar $SPARK_HOME/jars/

Step 2.：

cd spark-iforest/python

python setup.py sdist

pip install dist/pyspark-iforest-<version>.tar.gz

测试栗子：

from pyspark.ml.linalg import Vectors

import tempfile

from pyspark.sql import SparkSession

spark = SparkSession

.builder.master("local[*]")

.appName("IForestExample")

.getOrCreate()

data = [(Vectors.dense([0.0, 0.0]),), (Vectors.dense([7.0,9.0]),),

(Vectors.dense([9.0,8.0]),), (Vectors.dense([8.0, 9.0]),)]

# NOTE: features need to be dense vectors for the model input

df = spark.createDataFrame(data, ["features"])

from pyspark_iforest.ml.iforest import *

# Init an IForest Object

iforest = IForest(contamination=0.3, maxDepth=2)

# Fit on a given data frame

model = iforest.fit(df)

# Check if the model has summary or not, the newly trained modelhas the summary info

model.hasSummary

# Show model summary

summary = model.summary

# Show the number of anomalies

summary.numAnomalies

# Predict for a new data frame based on the fitted model

transformed = model.transform(df)

# Collect spark data frame into local df

rows = transformed.collect()

temp_path = tempfile.mkdtemp()

iforest_path = temp_path "/iforest"

# Save the iforest estimator into the path

iforest.save(iforest_path)

# Load iforest estimator from a path

loaded_iforest = IForest.load(iforest_path)

model_path = temp_path "/iforest_model"

# Save the fitted model into the model path

model.save(model_path)

# Load a fitted model from a model path

loaded_model = IForestModel.load(model_path)

# The loaded model has no summary info

loaded_model.hasSummary

# Use the loaded model to predict a new data frame

loaded_model.transform(df).show()

最后结果输出如图：

hadoop scala ide

0 人点赞