机器学习系列--贝叶斯分类算法

2023-06-29 16:28:30 浏览数 (2)

简介

贝叶斯分类算法是一大类分类算法的总称

贝叶斯分类算法以样本可能属于某类的概率来作为分类依据

朴素贝叶斯分类算法是贝叶斯分类算法中最简单的一种

注:朴素的意思是条件概率独立性

此处要想真正理解,需要有概率论的基础知识

P(A|x1x2x3x4)=p(A|x1)*p(A|x2)p(A|x3)p(A|x4)则为条件概率独立

P(xy|z)=p(xyz)/p(z)=p(xz)/p(z)*p(yz)/p(z)

算法

如果一个事物在一些属性条件发生的情况下,事物属于A的概率大于属于B的概率,则判定事物属于A

公式

步骤

1、分解各类先验样本数据中的特征

2、计算各类数据中,各特征的条件概率

(比如:特征1出现的情况下,属于A类的概率p(A|特征1),属于B类的概率p(B|特征1),属于C类的概率p(C|特征1)......)

3、分解待分类数据中的特征(特征1、特征2、特征3、特征4......)

4、计算各特征的各条件概率的乘积,如下所示:

判断为A类的概率:p(A|特征1)*p(A|特征2)*p(A|特征3)*p(A|特征4).....

判断为B类的概率:p(B|特征1)*p(B|特征2)*p(B|特征3)*p(B|特征4).....

判断为C类的概率:p(C|特征1)*p(C|特征2)*p(C|特征3)*p(C|特征4).....

5、结果中的最大值就是该样本所属的类别

代码

object NaiveBayes {   /**     * 先验数据     */   def dataSet(): (Array[Array[String]], Array[Int]) ={     val dataList = Array(Array("my", "dog", "has", "flea", "problems", "help", "please"),       Array("maybe", "not", "take", "him", "to", "dog", "park", "stupid"),       Array("my", "dalmation", "is", "so", "cute", "I", "love", "him"),       Array("stop", "posting", "stupid", "worthless", "garbage"),       Array("mr", "licks", "ate", "my", "steak", "how", "to", "stop", "him"),       Array("quit", "buying", "worthless", "dog", "food", "stupid"))     //分类     val dataType=Array(0, 1, 0, 1, 0, 1)     (dataList,dataType)   }

  /**     * 设置分类     * @param dataList 数据集合     * @param inputSet 输入类型     */   def setWordsType(dataList:Array[String],inputSet:Array[String]): Array[Int] ={     /***       *  先验数据       *  ArrayBuffer(quit, buying, worthless, dog, food, stupid, mr, licks, ate, my, steak, how, to, stop, him, posting, garbage, dalmation, is, so, cute, I, love, maybe, not, take, park, has, flea, problems, help, please)       *  ArrayBuffer(quit, buying, worthless, dog, food, stupid, mr, licks, ate, my, steak, how, to, stop, him, posting, garbage, dalmation, is, so, cute, I, love, maybe, not, take, park, has, flea, problems, help, please)       *  ArrayBuffer(quit, buying, worthless, dog, food, stupid, mr, licks, ate, my, steak, how, to, stop, him, posting, garbage, dalmation, is, so, cute, I, love, maybe, not, take, park, has, flea, problems, help, please)       *  ArrayBuffer(quit, buying, worthless, dog, food, stupid, mr, licks, ate, my, steak, how, to, stop, him, posting, garbage, dalmation, is, so, cute, I, love, maybe, not, take, park, has, flea, problems, help, please)       *  ArrayBuffer(quit, buying, worthless, dog, food, stupid, mr, licks, ate, my, steak, how, to, stop, him, posting, garbage, dalmation, is, so, cute, I, love, maybe, not, take, park, has, flea, problems, help, please)       *  ArrayBuffer(quit, buying, worthless, dog, food, stupid, mr, licks, ate, my, steak, how, to, stop, him, posting, garbage, dalmation, is, so, cute, I, love, maybe, not, take, park, has, flea, problems, help, please)       *       *       */     val returnList=new Array[Int](dataList.length)     val dataIndex = dataList.zipWithIndex     for(word <- inputSet){       if(dataList.contains(word)){         //println(dataIndex.filter(_._1 == word).toBuffer)         //与inputSet数据相等的为1         returnList(dataIndex.filter(_._1 == word)(0)._2) = 1       }else {         println("the word: %s is not in my Vocabulary!n",word)       }     }     returnList   }

  /**     * 先验数据     * @param trainData 训练数据     * @param trainType 训练类型     */   def trainSet(trainData:Array[Array[Int]],trainType:Array[Int]): (Array[Double], Array[Double], Double) ={     /**       * 0 = {int[32]@797}       * 1 = {int[32]@798}       * 2 = {int[32]@799}       * 3 = {int[32]@800}       * 4 = {int[32]@801}       * 5 = {int[32]@802}       *       *  ArrayBuffer(0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1)       *  ArrayBuffer(0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0)       *  ArrayBuffer(0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0)       *  ArrayBuffer(0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0)       *  ArrayBuffer(0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0)       *  ArrayBuffer(1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0)       *       */     val trainLength=trainData.length     val wordsNum=trainData(0).length     //每个分类的概率,这里分类只有0/1,所以只返回一个类别1的概率     val pType=trainType.sum/trainLength.toDouble     var p0Num=Array.fill(wordsNum)(1)     var p1Num=Array.fill(wordsNum)(1)

    var p0Denom = 2.0     var p1Denom = 2.0

    /**       * for 循环 0~5       * p0Denom:2.0         p1Denom:2.0         p0Denom:9.0         p1Denom:2.0         p0Denom:9.0         p1Denom:10.0         p0Denom:17.0         p1Denom:10.0         p0Denom:17.0         p1Denom:15.0         p0Denom:26.0         p1Denom:15.0       */     for (i <- 0 until trainLength) {

      if (trainType(i) == 1) {         var cnt = 0         //         p1Num = p1Num.map { x =>           val v = x trainData(i)(cnt)           cnt = 1           v         }         p1Denom = trainData(i).sum       } else {         var cnt = 0         p0Num = p0Num.map { x =>           val v = x trainData(i)(cnt)           cnt = 1           v         }         p0Denom = trainData(i).sum       }     }     /**       * p1Num       * ArrayBuffer(2, 2, 3, 3, 2, 4, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 1, 1, 1, 1, 1)       * p1Denom 21.0       *       * p0Num       * ArrayBuffer(1, 1, 1, 2, 1, 1, 2, 2, 2, 4, 2, 2, 2, 2, 3, 1, 1, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 2, 2, 2, 2, 2)       * p0Denom 26.0       */     (p1Num.map(x => Math.log(x / p1Denom)), p0Num.map(x => Math.log(x / p0Denom)), pType)

  }

  def classifyNB(vec2Classify: Array[Int], p0Vec: Array[Double], p1Vec: Array[Double], pClass1: Double): Int = {     var cnt = 0     val p1 = vec2Classify.map { x =>       val v = x * p1Vec(cnt)       cnt = 1       v     }.sum math.log(pClass1)     cnt = 0     val p0 = vec2Classify.map { x =>       val v = x * p0Vec(cnt)       cnt = 1       v     }.sum math.log(1.0 - pClass1)     //log(p(w/c0)p(c0))=log(p(w/c0)) log(p(c0))= sum(vec2Classify * p0Vec) log(1.0 - pClass1)     if (p1 > p0) 1 else 0   }

  def main(args: Array[String]): Unit = {     val DataSet = dataSet()     val listOPosts = DataSet._1     val listClasses = DataSet._2     val myVocabList = listOPosts.reduce((a1, a2) => a1. :(a2)).distinct     /**       * myVocabList的数据       * ArrayBuffer(quit, buying, worthless, dog, food, stupid, mr, licks, ate, my, steak, how, to, stop, him, posting, garbage, dalmation, is, so, cute, I, love, maybe, not, take, park, has, flea, problems, help, please)       */     var trainMat = new ArrayBuffer[Array[Int]](listOPosts.length)     listOPosts.foreach(postinDoc => trainMat.append(setWordsType(myVocabList, postinDoc)))

    //训练集     val p = trainSet(trainMat.toArray, listClasses)     val p0V = p._2     val p1V = p._1     val pAb = p._3     val testEntry = Array("love", "my", "dalmation")     val thisDoc = setWordsType(myVocabList, testEntry)     println(testEntry.mkString(",") " classified as: " classifyNB(thisDoc, p0V, p1V, pAb))     val testEntry2 = Array("stupid", "garbage")     val thisDoc2 = setWordsType(myVocabList, testEntry2)     println(testEntry2.mkString(",") " classified as: " classifyNB(thisDoc2, p0V, p1V, pAb))   } }

0 人点赞