# 问题的提出 1

• $P(C=1|X1,X2,X3)>P(C=2|X1,X2,X3)$，给定数据的X1、X2、X3后，数据属于类别1的概率要大于属于类别2，即说明现有样本支持未知样本属于类别1，判定为类别1。
• $P(C=1|X1,X2,X3)<P(C=2|X1,X2,X3)$，则说明现有样本支持未知样本属于类别2，判定为类别2。

• $P(C=1|X1,X2,X3)>P(C=2|X1,X2,X3)$，则判定类别为1；
• $P(C=1|X1,X2,X3)<P(C=2|X1,X2,X3)$，则判定类别为2；

# 贝叶斯定理

$P(C|X) = \frac{ P(X|C)P(C)}{ P(X) }$
• P(C|X)是给定属性X下，C的后验概率
• P(C)是C的先验概率

# 朴素贝叶斯分类

$P(C|X) = \frac{ P(X|C)P(C)}{ P(X) } = \frac{P(C)}{P(X)} \prod_{i = 1}^{d}P(X_i|C)$

$h_{naivebayes}^{*}(X) = arg max P(C) \prod_{i=1}^{d} P(X_i|C)$

（1）P(C=i)=Si/S，Si是类Ci中的训练样本数，S是训练样本总数；

（2）P(X|C=i)的计算开销可能非常大，因为会涉及到很多属性变量，这里可以做“属性值互相条件独立”的假定，即属性间不存在依赖关系：

# Naive Bayes

PlayTennis (i.e., decide whether our friend will play tennis or not in a given day) 3

#data
data = [
{"outlook":"sunny", "temp":"hot", "humidity":"high", "wind":"weak", "class":"no" },
{"outlook":"sunny", "temp":"hot", "humidity":"high", "wind":"strong", "class":"no" },
{"outlook":"overcast", "temp":"hot", "humidity":"high", "wind":"weak", "class":"yes" },
{"outlook":"rain", "temp":"mild", "humidity":"high", "wind":"weak", "class":"yes" },
{"outlook":"rain", "temp":"cool", "humidity":"normal", "wind":"weak", "class":"yes" },
{"outlook":"rain", "temp":"cool", "humidity":"normal", "wind":"strong", "class":"no" },
{"outlook":"overcast", "temp":"cool", "humidity":"normal", "wind":"strong", "class":"yes" },
{"outlook":"sunny", "temp":"mild", "humidity":"high", "wind":"weak", "class":"no" },
{"outlook":"sunny", "temp":"cool", "humidity":"normal", "wind":"weak", "class":"yes" },
{"outlook":"rain", "temp":"mild", "humidity":"normal", "wind":"weak", "class":"yes" },
{"outlook":"sunny", "temp":"mild", "humidity":"normal", "wind":"strong", "class":"yes" },
{"outlook":"overcast", "temp":"mild", "humidity":"high", "wind":"strong", "class":"yes" },
{"outlook":"overcast", "temp":"hot", "humidity":"normal", "wind":"weak", "class":"yes" },
{"outlook":"rain", "temp":"mild", "humidity":"high", "wind":"strong", "class":"no" }]

import pandas as pd
pd.DataFrame(data)

class humidity outlook temp wind
0 no high sunny hot weak
1 no high sunny hot strong
2 yes high overcast hot weak
3 yes high rain mild weak
4 yes normal rain cool weak
5 no normal rain cool strong
6 yes normal overcast cool strong
7 no high sunny mild weak
8 yes normal sunny cool weak
9 yes normal rain mild weak
10 yes normal sunny mild strong
11 yes high overcast mild strong
12 yes normal overcast hot weak
13 no high rain mild strong
test={"outlook":"sunny","temp":"cool","humidity":"high","wind":"strong"}

#Calculate the Prob. of class:cls

def P(data,cls_val,cls_name="class"):
count = 0.0
for e in data:
if e[cls_name] == cls_val:
count += 1
return count/len(data)

# The probability of play or not
PY, PN = P(data,"yes"), P(data, "no")
PY, PN

(0.6428571428571429, 0.35714285714285715)

#Calculate the Prob(attr|cls)
def PT(data,cls_val,attr_name,attr_val,cls_name="class"):
count1 = 0.0
count2 = 0.0
for e in data:
if e[cls_name] == cls_val:
count1 += 1
if e[attr_name] == attr_val:
count2 += 1
return count2/count1

# The conditional probability of play or not
PT(data,"yes", "outlook", "sunny"), PT(data,"no", "outlook", "sunny")

(0.2222222222222222, 0.6)

#Calculate the NB
def NB(data,test,cls_y,cls_n):
PY = P(data,cls_y)
PN = P(data,cls_n)
print 'The probability of play or not:', PY,'vs.', PN
for key,val in test.items():
PY *= PT(data,cls_y,key,val)
PN *= PT(data,cls_n,key,val)
print key, val, '-->play or not:-->', PY, PN
return {cls_y:PY,cls_n:PN}

#calculate
NB(data,test,"yes","no")

The probability of play or not: 0.642857142857 vs. 0.357142857143
outlook sunny -->play or not:--> 0.142857142857 0.214285714286
wind strong -->play or not:--> 0.047619047619 0.128571428571
temp cool -->play or not:--> 0.015873015873 0.0257142857143
humidity high -->play or not:--> 0.00529100529101 0.0205714285714

{'no': 0.020571428571428574, 'yes': 0.005291005291005291}

#calculate
NB(data,{"outlook":"sunny","temp":"hot","humidity":"normal","wind":"weak"},"yes","no")

The probability of play or not: 0.642857142857 vs. 0.357142857143
outlook sunny -->play or not:--> 0.142857142857 0.214285714286
wind weak -->play or not:--> 0.0952380952381 0.0857142857143
temp hot -->play or not:--> 0.021164021164 0.0342857142857
humidity normal -->play or not:--> 0.0141093474427 0.00685714285714

{'no': 0.006857142857142858, 'yes': 0.014109347442680775}


# Note

1. 以下内容来自 【数说工作室】金融数据挖掘之朴素贝叶斯 http://www.ppvke.com/Blog/archives/6431

2. 周志华 2016 机器学习 p150-151

3. Mitchell Machine Learning http://www.cs.cmu.edu/~tom/10701_sp11/lectures.shtml

Updated: