用大数据框架做机器学习第一步~~~~~~~~~~~~~~~
环境:VMware ubuntu虚拟机
基础的linux操作本教程默认会,所以写的相对简明,有问题可以留言。
一、vim
sudo apt-get install vim
二、JAVA-JDK 安装
export JAVA_HOME=/usr/lib/jvm/jdk1.8.0_261
export PATH=JAVA_HOME/bin:PATH
三、scala 安装
下载:https://www.scala-lang.org/download/all.html
mv scala的解压文件 /usr/local/scala
sudo vim /etc/profile
export SCALA_HOME=/usr/local/scala/scala-2.12.12
export PATH="$PATH: /usr/local/scala/scala-2.12.12/bin"
【大数据组件下载地址】
http://archive.apache.org/dist/
四、Hadoop2.7 安装
下载后解压到指定文件夹,我的是/usr/local/hadoop/hadoop-2.7.0
sudo vim /etc/profile
sudo vim ~/.bashrc
添加
export HADOOP_HOME=/usr/local/hadoop/hadoop-2.7.0
export PATH="
source /etc/profile
source~/.bashrc
1、打开hadoop-2.7.0/etc/hadoop/hadoop-env.sh文件,
vim hadoop-2.7.0/etc/hadoop/hadoop-env.sh
# The java implementation to use.(修改这里)
export JAVA_HOME=/usr/lib/jvm/jdk1.8.0_261
(Ctrl Shift V,粘贴)
2、打开hadoop-2.7.0/etc/hadoop/core-site.xml文件,编辑如下:
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>
3、打开hadoop-2.7.0/etc/hadoop/mapred-site.xml文件,编辑如下:
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>localhost:9001</value>
</property>
</configuration>
4、打开hadoop-2.7.0/etc/hadoop/hdfs-site.xml文件,编辑如下:
<configuration>
<property>
<name>dfs.name.dir</name>
<value>/usr/local/hadoop/hadoop-2.7.0/namenode</value>
</property>
<property>
<name>dfs.data.dir</name>
<value>/usr/local/hadoop/hadoop-2.7.0/datanode</value>
</property>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>
五、spark安装
下载后解压到指定文件夹,我的是/usr/local/spark/spark
sudo vim /etc/profile
sudo vim ~/.bashrc
添加
export SPARK_HOME=/usr/local/spark/spark
export PATH={SPARK_HOME}/bin:PATH
source /etc/profile
source~/.bashrc
修改spark-env.sh文件,
修改前先备份并重命名cp spark-env.sh.tempalte spark-env.sh
然后打开spark-env.sh文件,追加内容:
export JAVA_HOME=/usr/lib/jvm/jdk1.8.0_261
export HADOOP_HOME=/usr/local/hadoop/hadoop-2.7.0
export SCALA_HOME=/usr/local/scala/scala-2.12.12
export HADOOP_CONF_DIR=/usr/local/hadoop/hadoop-2.7.0/etc/hadoop
export SPARK_MASTER_IP=ubuntu
export SPARK_WORKER_MEMORY=512M
六、SSH配置
1、安装SSH服务
sudo apt-get install openssh-client
sudo apt-get install openssh-server
ssh-keygen -t
cat ~/.ssh/id_rsa.pub
将SSH Key添加到github(在settings 里面, add)
2、免密登录
cd ~/.ssh
cat id_rsa.pub >> authorized_keys
七、MVN安装
地址:http://maven.apache.org/download.cgi
下载解压:
# tar -xvf apache-maven-3.6.3-bin.tar.gz
# sudo mv -f apache-maven-3.6.3 /usr/local/
编辑 /etc/profile 文件 sudovim /etc/profile,在文件末尾添加如下代码:
export M2_HOME=/usr/local/mvn/apache-maven-3.6.3
export PATH={M2_HOME}/bin:PATH
保存文件,并运行如下命令使环境变量生效:
# source /etc/profile
# mvn -v
八、Pyspark安装
sudo apt-get install python
sudo apt-get install python-pip
sudo pip install pyspark==2.4.0 -i https://pypi.doubanio.com/simple
九、spark iforest 离群点检测
https://github.com/titicaca/spark-iforest
git clone git@github.com:titicaca/spark-iforest.git
Step 1:
cd spark-iforest/
mvn clean package -DskipTests
cp target/spark-iforest-<version>.jar $SPARK_HOME/jars/
Step 2.:
cd spark-iforest/python
python setup.py sdist
pip install dist/pyspark-iforest-<version>.tar.gz
测试栗子:
from pyspark.ml.linalg import Vectors
import tempfile
from pyspark.sql import SparkSession
spark = SparkSession
.builder.master("local[*]")
.appName("IForestExample")
.getOrCreate()
data = [(Vectors.dense([0.0, 0.0]),), (Vectors.dense([7.0,9.0]),),
(Vectors.dense([9.0,8.0]),), (Vectors.dense([8.0, 9.0]),)]
# NOTE: features need to be dense vectors for the model input
df = spark.createDataFrame(data, ["features"])
from pyspark_iforest.ml.iforest import *
# Init an IForest Object
iforest = IForest(contamination=0.3, maxDepth=2)
# Fit on a given data frame
model = iforest.fit(df)
# Check if the model has summary or not, the newly trained modelhas the summary info
model.hasSummary
# Show model summary
summary = model.summary
# Show the number of anomalies
summary.numAnomalies
# Predict for a new data frame based on the fitted model
transformed = model.transform(df)
# Collect spark data frame into local df
rows = transformed.collect()
temp_path = tempfile.mkdtemp()
iforest_path = temp_path "/iforest"
# Save the iforest estimator into the path
iforest.save(iforest_path)
# Load iforest estimator from a path
loaded_iforest = IForest.load(iforest_path)
model_path = temp_path "/iforest_model"
# Save the fitted model into the model path
model.save(model_path)
# Load a fitted model from a model path
loaded_model = IForestModel.load(model_path)
# The loaded model has no summary info
loaded_model.hasSummary
# Use the loaded model to predict a new data frame
loaded_model.transform(df).show()
最后结果输出如图: