决策树的生成是一个递归过程。在决策树基本算法中,有三种情形会导致递归返回:
当前结点包含的样本全属于同一类别,无需划分; 当前属性集为空,或是所有样本在所有属性上取值相同,无法划分; 当前结点包含的样本集合为空,不能划分. 在第 (2)种情形下,我们把当前结点标记为叶结点,井将其类别设定为该结点所含样本最多的类别;在第 (3)种情形下,同样把当前结点标记为叶结点,但将其类别设定为其父结点所含样本最多的类别。注意这两种情形的处理实质不同:情形 (2)是在利用当前结点的后验分布,而情形(3)则是把父结点的样本分布作为当前结点的先验分布.
💡
情形(2)是后验分布,是指在已知样本
x \boldsymbol{x} x 的情况,推测出的当前结点所属的类别,即
( y ∣ x ) (y|\boldsymbol{x}) ( y ∣ x ) ;而情形(3)是先验分布,是因为当前结点没有样本集合给我去做推断,而是去利用已知的父结点分布做先验,即我们预先假设,是一种脱离样本的假设条件。
ID3 算法 信息熵的定义 信息熵:熵是度量样本集合纯度最常用的一种指标,代表一个系统中蕴含多少信息量,信息量越大表明一个系统不确定性就越大就存在越多的可能性,即信息熵越大。
假定当前样本集合
D D D 中第
k k k 类样本所占的比例为
p k ( k = 1 , 2 , … , ∣ y ∣ ) p_k(k=1,2,\dots,|y|) p k ( k = 1 , 2 , … , ∣ y ∣ ) ,则
D D D 的信息熵为:
Ent ( D ) = − ∑ k = 1 ∣ Y ∣ p k log 2 p k \operatorname{Ent}(D)=-\sum_{k=1}^{|\mathcal{Y}|} p_{k} \log _{2} p_{k} Ent ( D ) = − k = 1 ∑ ∣ Y ∣ p k log 2 p k 信息熵满足下列不等式:
0 ≤ Ent ( D ) ≤ log 2 ∣ Y ∣ Y 表示样本 D 中的类别数 0 \leq \operatorname{Ent}(D) \leq \log _{2}|\mathcal{Y}| \quad \mathcal{Y} 表示样本 D 中的类别数 0 ≤ Ent ( D ) ≤ log 2 ∣ Y ∣ Y 表示样本 D 中的类别数 对信息熵的最值进行证明(选看) Ent ( D ) = − ∑ k = 1 ∣ Y ∣ p k log 2 p k \operatorname{Ent}(D)=-\sum_{k=1}^{|\mathcal{Y}|} p_{k} \log _{2} p_{k} Ent ( D ) = − k = 1 ∑ ∣ Y ∣ p k log 2 p k 其中
∣ Y ∣ |\mathcal{Y}| ∣ Y ∣ 表示样本类别总数,
p k p_k p k 表示第
k k k 类样本所占的比例,且
0 ≤ p k ≤ 1 0\leq p_k \leq 1 0 ≤ p k ≤ 1 ,
∑ k = 1 n p k = 1 \sum_{k=1}^{n}p_k=1 k = 1 ∑ n p k = 1 。
若令
∣ Y ∣ = n |\mathcal{Y}|=n ∣ Y ∣ = n ,
p k = x k p_k=x_k p k = x k ,那么信息熵
Ent ( D ) \operatorname{Ent}(D) Ent ( D ) 就可以看作一个
n n n 元实值函数,即
Ent ( D ) = f ( x 1 , … , x n ) = − ∑ k = 1 n x k log 2 x k \operatorname{Ent}(D)=f\left(x_{1}, \ldots, x_{n}\right)=-\sum_{k=1}^{n} x_{k} \log _{2} x_{k} Ent ( D ) = f ( x 1 , … , x n ) = − k = 1 ∑ n x k log 2 x k 其中
0 ≤ x k ≤ 1 , ∑ k = 1 n x k = 1 0 \leq x_{k} \leq 1, \sum_{k=1}^{n} x_{k}=1 0 ≤ x k ≤ 1 , k = 1 ∑ n x k = 1 引入拉格朗日乘子 引入拉格朗日乘子
λ \lambda λ (条件极值, 在
0 ≤ x k ≤ 1 , ∑ k = 1 n x k = 1 0 \leq x_{k} \leq 1, \sum_{k=1}^{n} x_{k}=1 0 ≤ x k ≤ 1 , k = 1 ∑ n x k = 1 的条件下, 求最值) :
L ( x 1 , … , x n , λ ) = − ∑ k = 1 n x k log 2 x k + λ ( ∑ k = 1 n x k − 1 ) L\left(x_{1}, \ldots, x_{n}, \lambda\right)=-\sum_{k=1}^{n} x_{k} \log _{2} x_{k}+\lambda\left(\sum_{k=1}^{n} x_{k}-1\right) L ( x 1 , … , x n , λ ) = − k = 1 ∑ n x k log 2 x k + λ ( k = 1 ∑ n x k − 1 ) 对
L L L 分别关于
x x x ,
λ \lambda λ 求一阶偏导,并令偏导等于
0 0 0 :
∂ L ( x 1 , … , x n , λ ) ∂ x 1 = ∂ ∂ x 1 [ − ∑ k = 1 n x k log 2 x k + λ ( ∑ k = 1 n x k − 1 ) ] = 0 ⇒ − log 2 x 1 − x 1 ⋅ 1 x 1 ln 2 + λ = 0 ⇒ − log 2 x 1 − 1 ln 2 + λ = 0 ⇒ λ = log 2 x 1 + 1 ln 2 \begin{aligned}\frac{\partial L\left(x_{1}, \ldots, x_{n}, \lambda\right)}{\partial x_{1}} &=\frac{\partial}{\partial x_{1}}\left[-\sum_{k=1}^{n} x_{k} \log _{2} x_{k}+\lambda\left(\sum_{k=1}^{n} x_{k}-1\right)\right]=0 \\&\Rightarrow-\log _{2} x_{1}-x_{1} \cdot \frac{1}{x_{1} \ln 2}+\lambda=0 \\&\Rightarrow-\log _{2} x_{1}-\frac{1}{\ln 2}+\lambda=0 \\& \Rightarrow \lambda=\log _{2} x_{1}+\frac{1}{\ln 2}\end{aligned} ∂ x 1 ∂ L ( x 1 , … , x n , λ ) = ∂ x 1 ∂ [ − k = 1 ∑ n x k log 2 x k + λ ( k = 1 ∑ n x k − 1 ) ] = 0 ⇒ − log 2 x 1 − x 1 ⋅ x 1 ln 2 1 + λ = 0 ⇒ − log 2 x 1 − ln 2 1 + λ = 0 ⇒ λ = log 2 x 1 + ln 2 1 同理可推得:
λ = log 2 x 1 + 1 ln 2 = log 2 x 2 + 1 ln 2 = ⋯ = log 2 x n + 1 ln 2 \lambda=\log _{2} x_{1}+\frac{1}{\ln 2}=\log _{2} x_{2}+\frac{1}{\ln 2}=\cdots=\log _{2} x_{n}+\frac{1}{\ln 2} λ = log 2 x 1 + ln 2 1 = log 2 x 2 + ln 2 1 = ⋯ = log 2 x n + ln 2 1 0 ≤ x k ≤ 1 , ∑ k = 1 n x k = 1 0 \leq x_{k} \leq 1, \sum_{k=1}^{n} x_{k}=1 0 ≤ x k ≤ 1 , k = 1 ∑ n x k = 1 因此:
x 1 = x 2 = ⋯ = x n = 1 n x_1=x_2=\cdots=x_n=\frac{1}{n} x 1 = x 2 = ⋯ = x n = n 1 这里对最大值点还是最小值点需要做个验证: 当
x 1 = x 2 = ⋯ = x n = 1 n x_1=x_2=\cdots=x_n=\frac{1}{n} x 1 = x 2 = ⋯ = x n = n 1 时,
f ( 1 n , … , 1 n ) = − ∑ k = 1 n 1 n log 2 1 n = − n ⋅ 1 n log 2 1 n = log 2 n f\left(\frac{1}{n}, \ldots, \frac{1}{n}\right)=-\sum_{k=1}^{n} \frac{1}{n} \log _{2} \frac{1}{n}=-n \cdot \frac{1}{n} \log _{2} \frac{1}{n}=\log _{2} n f ( n 1 , … , n 1 ) = − k = 1 ∑ n n 1 log 2 n 1 = − n ⋅ n 1 log 2 n 1 = log 2 n 当
x 1 = 1 , x 2 = x 3 = ⋯ = x n = 0 x_1=1,x_2=x_3=\cdots=x_n=0 x 1 = 1 , x 2 = x 3 = ⋯ = x n = 0 时:
f ( 1 , 0 , … , 0 ) = − 1 ⋅ log 2 1 − 0 ⋅ log 2 0 … − 0 ⋅ log 2 0 = 0 f(1,0, \ldots, 0)=-1 \cdot \log _{2} 1-0 \cdot \log _{2} 0 \ldots-0 \cdot \log _{2} 0=0 f ( 1 , 0 , … , 0 ) = − 1 ⋅ log 2 1 − 0 ⋅ log 2 0 … − 0 ⋅ log 2 0 = 0 显然
log 2 n ≥ 0 \log_2 n \geq 0 log 2 n ≥ 0 ,所以
x 1 = x 2 = ⋯ = x n = 1 n x_1=x_2=\cdots=x_n=\frac{1}{n} x 1 = x 2 = ⋯ = x n = n 1 为最大值点,最大值为
log 2 n \log_2 n log 2 n 。
下面考虑求 f ( x 1 , x 2 , ⋯ , x n ) f(x_1,x_2,\cdots,x_n) f ( x 1 , x 2 , ⋯ , x n ) 的最小值 仅考虑
f ( x 1 , x 2 , ⋯ , x n ) f(x_1,x_2,\cdots,x_n) f ( x 1 , x 2 , ⋯ , x n ) 可以看作是
n n n 个互不相关一元函数的加和,即
f ( x 1 , … , x n ) = ∑ k = 1 n g ( x k ) g ( x k ) = − x k log 2 x k , 0 ≤ x k ≤ 1 \begin{array}{l}f\left(x_{1}, \ldots, x_{n}\right)=\sum_{k=1}^{n} g\left(x_{k}\right) \\g\left(x_{k}\right)=-x_{k} \log _{2} x_{k}, 0 \leq x_{k} \leq 1\end{array} f ( x 1 , … , x n ) = ∑ k = 1 n g ( x k ) g ( x k ) = − x k log 2 x k , 0 ≤ x k ≤ 1 求
g ( x 1 ) g(x_1) g ( x 1 ) 的最小值,首先对
g ( x 1 ) g(x_1) g ( x 1 ) 关于
x 1 x_1 x 1 求一阶和二阶导数:
g ′ ( x 1 ) = d ( − x 1 log 2 x 1 ) d x 1 = − log 2 x 1 − x 1 ⋅ 1 x 1 ln 2 = − log 2 x 1 − 1 ln 2 g ′ ′ ( x 1 ) = d ( − log 2 x 1 − 1 ln 2 ) d x 1 = − 1 x 1 ln 2 \begin{aligned}&g^{\prime}\left(x_{1}\right)=\frac{d\left(-x_{1} \log _{2} x_{1}\right)}{d x_{1}}=-\log _{2} x_{1}-x_{1} \cdot \frac{1}{x_{1} \ln 2}=-\log _{2} x_{1}-\frac{1}{\ln 2} \\&g^{\prime \prime}\left(x_{1}\right)=\frac{d\left(-\log _{2} x_{1}-\frac{1}{\ln 2}\right)}{d x_{1}}=-\frac{1}{x_{1} \ln 2}\end{aligned} g ′ ( x 1 ) = d x 1 d ( − x 1 log 2 x 1 ) = − log 2 x 1 − x 1 ⋅ x 1 ln 2 1 = − log 2 x 1 − ln 2 1 g ′′ ( x 1 ) = d x 1 d ( − log 2 x 1 − l n 2 1 ) = − x 1 ln 2 1 在定义域
0 ≤ x k ≤ 1 0\le x_k \le 1 0 ≤ x k ≤ 1 上,始终有
g ′ ′ ( x 1 ) = − 1 x 1 ln 2 ≤ 0 g^{\prime \prime}\left(x_{1}\right)=-\frac{1}{x_{1} \ln 2}\le 0 g ′′ ( x 1 ) = − x 1 ln 2 1 ≤ 0 ,即
g ( x 1 ) g(x_1) g ( x 1 ) 为
开口向下的凹函数 ,最小值在边界
x 1 = 0 x_1=0 x 1 = 0 或
x 1 = 1 x_1=1 x 1 = 1 处取得:
g ( 0 ) = − 0 log 2 0 = 0 g ( 1 ) = − 1 log 2 1 = 0 \begin{array}{l}g(0)=-0 \log _{2} 0=0 \\g(1)=-1 \log _{2} 1=0\end{array} g ( 0 ) = − 0 log 2 0 = 0 g ( 1 ) = − 1 log 2 1 = 0 g ( x 1 ) g(x_1) g ( x 1 ) 的最小值即为
0 0 0 ,同理可得
g ( x 2 ) , ⋯ , g ( x n ) g(x_2),\cdots,g(x_n) g ( x 2 ) , ⋯ , g ( x n ) 的最小值也为
0 0 0 ,那么
f ( x 1 , x 2 , ⋯ , x n ) f(x_1,x_2,\cdots,x_n) f ( x 1 , x 2 , ⋯ , x n ) 的最小值也为
0 0 0 。
如果令某个
x k = 1 x_k=1 x k = 1 ,那么根据约束条件
0 ≤ x k ≤ 1 , ∑ k = 1 n x k = 1 0 \leq x_{k} \leq 1, \sum_{k=1}^{n} x_{k}=1 0 ≤ x k ≤ 1 , k = 1 ∑ n x k = 1 ,可知
x 1 = x 2 = ⋯ = x k − 1 = x k + 1 = ⋯ = x n = 0 x_1=x_2=\cdots=x_{k-1}=x_{k+1}=\cdots=x_n=0 x 1 = x 2 = ⋯ = x k − 1 = x k + 1 = ⋯ = x n = 0 将其带入
f ( x 1 , x 2 , ⋯ , x n ) f(x_1,x_2,\cdots,x_n) f ( x 1 , x 2 , ⋯ , x n ) ,得:
f ( 0 , 0 , … , 0 , 1 , 0 , … , 0 ) = − 0 log 2 0 − 0 log 2 0 … . − 0 log 2 0 − 1 log 2 1 − 0 log 2 0 … − 0 log 2 0 = 0 f(0,0, \ldots, 0,1,0, \ldots, 0)=-0 \log _{2} 0-0 \log _{2} 0 \ldots .-0 \log _{2} 0-1 \log _{2} 1-0 \log _{2} 0 \ldots-0 \log _{2} 0=0 f ( 0 , 0 , … , 0 , 1 , 0 , … , 0 ) = − 0 log 2 0 − 0 log 2 0 … . − 0 log 2 0 − 1 log 2 1 − 0 log 2 0 … − 0 log 2 0 = 0 所以
x k = 1 , x 1 = x 2 = ⋯ = x k − 1 = x k + 1 = x n = 0 x_k=1,x_1=x_2=\cdots=x_{k-1}=x_{k+1}=x_n=0 x k = 1 , x 1 = x 2 = ⋯ = x k − 1 = x k + 1 = x n = 0 ,一定是
f ( x 1 , x 2 , ⋯ , x n ) f(x_1,x_2,\cdots,x_n) f ( x 1 , x 2 , ⋯ , x n ) 在满足的约束条件下的最小值点,其最小值为
0 0 0 。
所以有
0 ≤ Ent ( D ) ≤ log 2 n 0 \leq \operatorname{Ent}(D) \leq \log _{2}n 0 ≤ Ent ( D ) ≤ log 2 n 信息增益 假定离散属性
a a a 有
V V V 个可能的取值
{ a 1 , a 2 , … , a V } \{ a^1,a^2,\dots,a^{V}\} { a 1 , a 2 , … , a V } , 如果使用特征
a a a 来对数据集
D D D 进行划分,则会产生
V V V 个分支结点,其中第
v v v 个结点包含了数据集
D D D 中所有在特征
a a a 上取值为
a V a^{V} a V 的样本总数,记为
D v D^{v} D v 。再考虑到不同的分支结点所包含的样本数量不同,给分支节点赋予不同的权重,这样对样本数越多的分支节点的影响就会越大,因此,就能够计算出特征对样本集
D D D 进行划分所获得的“信息增益”:
Gain ( D , a ) = Ent ( D ) − ∑ i = 1 V ∣ D v ∣ ∣ D ∣ Ent ( D v ) \operatorname{Gain}(D, a)=\operatorname{Ent}(D)-\sum_{i=1}^{V} \frac{\left|D^{v}\right|}{|D|}\operatorname{Ent}\left(D^{v}\right) Gain ( D , a ) = Ent ( D ) − i = 1 ∑ V ∣ D ∣ ∣ D v ∣ Ent ( D v ) ID3 算法的计算过程 我们从训练集的分类标签(好瓜)中出发,计算树根节点的信息熵:
Ent ( D ) = − 8 17 × log 2 8 17 − 9 17 × log 2 9 17 = 0.9975 \operatorname{Ent}(D)=-\frac{8}{17} \times \log _{2} \frac{8}{17}-\frac{9}{17} \times \log _{2} \frac{9}{17}=0.9975 Ent ( D ) = − 17 8 × log 2 17 8 − 17 9 × log 2 17 9 = 0.9975 第一层分裂 计算各属性的信息增益:
属性 a 1 a_1 a 1 :色泽 色泽中,一共有三个种类,分别为青绿(6 个)、乌黑(6 个)、浅白(5 个)。其中青绿中有好瓜 3 个,坏瓜 3 个。
从青绿色属性中,可以得到的信息熵是
Ent ( D 1 ) = − 3 6 log 2 3 6 − 3 6 log 2 3 6 = 1.000 \operatorname{Ent}\left(D^{1}\right)=-\frac{3}{6} \log _{2} \frac{3}{6}-\frac{3}{6} \log _{2} \frac{3}{6}=1.000 Ent ( D 1 ) = − 6 3 log 2 6 3 − 6 3 log 2 6 3 = 1.000 。
同理,计算乌黑和浅白的信息熵,分别为
Ent ( D 2 ) = 0.918 \operatorname{Ent}\left(D^{2}\right)=0.918 Ent ( D 2 ) = 0.918 、
Ent ( D 3 ) = 0.722 \operatorname{Ent}\left(D^{3}\right)=0.722 Ent ( D 3 ) = 0.722 。根据对应具体属性值的数量,赋予不同的信息权重,即具体属性值数量所占总体数量的比例。然后得到该属性的信息熵,将根节点的信息熵减去该属性的信息熵,获得信息增益值。
Gain ( D , a 1 ) = Ent ( D ) − ∑ v = 1 3 ∣ D v ∣ ∣ D ∣ Ent ( D v ) = Ent ( D ) − ( D 1 D × Ent ( D 1 ) + D 2 D × Ent ( D 2 ) + D 3 D × Ent ( D 3 ) ) = 0.9975 − ( 6 17 ( − 3 6 log 2 3 6 − 3 6 log 2 3 6 ) + 6 17 ( − 4 6 log 2 4 6 − 2 6 log 2 2 6 ) + 5 17 ( − 1 5 log 2 1 5 − 4 5 log 2 4 5 ) ) = 0.1091 \begin{array}{l}\operatorname{Gain}\left(D, a_{1}\right)=\operatorname{Ent}(D)-\sum_{v=1}^{3} \frac{\left|D^{v}\right|}{|D|} \operatorname{Ent}\left(D^{v}\right) \\=\operatorname{Ent}(\mathrm{D})-\left(\frac{D^{1}}{D} \times \operatorname{Ent}\left(D^{1}\right)+\frac{D^{2}}{D} \times \operatorname{Ent}\left(D^{2}\right)+\frac{D^{3}}{D} \times \operatorname{Ent}\left(D^{3}\right)\right) \\=0.9975-\left(\frac{6}{17}\left(-\frac{3}{6} \log _{2} \frac{3}{6}-\frac{3}{6} \log _{2} \frac{3}{6}\right)+\frac{6}{17}\left(-\frac{4}{6} \log _{2} \frac{4}{6}-\frac{2}{6} \log _{2} \frac{2}{6}\right)+\frac{5}{17}\left(-\frac{1}{5} \log _{2} \frac{1}{5}-\frac{4}{5} \log _{2} \frac{4}{5}\right)\right)\\=0.1091\\\end{array} Gain ( D , a 1 ) = Ent ( D ) − ∑ v = 1 3 ∣ D ∣ ∣ D v ∣ Ent ( D v ) = Ent ( D ) − ( D D 1 × Ent ( D 1 ) + D D 2 × Ent ( D 2 ) + D D 3 × Ent ( D 3 ) ) = 0.9975 − ( 17 6 ( − 6 3 log 2 6 3 − 6 3 log 2 6 3 ) + 17 6 ( − 6 4 log 2 6 4 − 6 2 log 2 6 2 ) + 17 5 ( − 5 1 log 2 5 1 − 5 4 log 2 5 4 ) ) = 0.1091 属性 a 2 a_2 a 2 :根蒂 Gain ( D , a 2 ) = Ent ( D ) − ∑ v = 1 3 ∣ D v ∣ D Ent ( D ) = Ent ( D ) − ( D 1 D × Ent ( D 1 ) + D 2 D × Ent ( D 2 ) + D 3 D × Ent ( D 3 ) ) = 0.9975 − ( 8 17 ( − 3 8 log 2 3 8 − 5 8 log 2 5 8 ) + 2 17 ( − 1 log 2 1 ) + 7 17 ( − 3 7 log 2 3 7 − 4 7 log 2 4 7 ) ) = 0.1427 \begin{array}{l}\operatorname{Gain}\left(D, a_{2}\right)=\operatorname{Ent}(\mathrm{D})-\sum_{v=1}^{3} \frac{\left|D^{v}\right|}{D} \operatorname{Ent}(\mathrm{D}) \\=\operatorname{Ent}(D)-\left(\frac{D^{1}}{D} \times \operatorname{Ent}\left(D^{1}\right)+\frac{D^{2}}{D} \times \operatorname{Ent}\left(D^{2}\right)+\frac{D^{3}}{D} \times \operatorname{Ent}\left(D^{3}\right)\right) \\=0.9975-\left(\frac{8}{17}\left(-\frac{3}{8} \log _{2} \frac{3}{8}-\frac{5}{8} \log _{2} \frac{5}{8}\right)+\frac{2}{17}\left(-1 \log _{2} 1\right)+\frac{7}{17}\left(-\frac{3}{7} \log _{2} \frac{3}{7}-\frac{4}{7} \log _{2} \frac{4}{7}\right)\right)\\=0.1427\end{array} Gain ( D , a 2 ) = Ent ( D ) − ∑ v = 1 3 D ∣ D v ∣ Ent ( D ) = Ent ( D ) − ( D D 1 × Ent ( D 1 ) + D D 2 × Ent ( D 2 ) + D D 3 × Ent ( D 3 ) ) = 0.9975 − ( 17 8 ( − 8 3 log 2 8 3 − 8 5 log 2 8 5 ) + 17 2 ( − 1 log 2 1 ) + 17 7 ( − 7 3 log 2 7 3 − 7 4 log 2 7 4 ) ) = 0.1427 属性 a 3 a_3 a 3 :敲声 Gain ( D , a 3 ) = Ent ( D ) − ∑ v = 1 3 ∣ D v ∣ D Ent ( D ) = Ent ( D ) − ( D 1 D × Ent ( D 1 ) + D 2 D × Ent ( D 2 ) + D 3 D × Ent ( D 3 ) ) = 0.9975 − ( 10 17 ( − 6 10 log 2 6 10 − 4 10 log 2 4 10 ) + 5 17 ( − 2 5 log 2 2 5 − 3 5 log 2 3 5 ) + 2 17 ( − 1 log 2 1 ) ) = 0.1408 \begin{array}{l}\operatorname{Gain}\left(D, a_{3}\right)=\operatorname{Ent}(\mathrm{D})-\sum_{v=1}^{3} \frac{\left|D^{v}\right|}{D} \operatorname{Ent}(\mathrm{D}) \\=\operatorname{Ent}(\mathrm{D})-\left(\frac{D^{1}}{D} \times \operatorname{Ent}\left(D^{1}\right)+\frac{D^{2}}{D} \times \operatorname{Ent}\left(D^{2}\right)+\frac{D^{3}}{D} \times \operatorname{Ent}\left(D^{3}\right)\right) \\=0.9975-\left(\frac{10}{17}\left(-\frac{6}{10} \log _{2} \frac{6}{10}-\frac{4}{10} \log _{2} \frac{4}{10}\right)+\frac{5}{17}\left(-\frac{2}{5} \log _{2} \frac{2}{5}-\frac{3}{5} \log _{2} \frac{3}{5}\right)+\frac{2}{17}\left(-1 \log _{2} 1\right)\right)\\=0.1408\end{array} Gain ( D , a 3 ) = Ent ( D ) − ∑ v = 1 3 D ∣ D v ∣ Ent ( D ) = Ent ( D ) − ( D D 1 × Ent ( D 1 ) + D D 2 × Ent ( D 2 ) + D D 3 × Ent ( D 3 ) ) = 0.9975 − ( 17 10 ( − 10 6 log 2 10 6 − 10 4 log 2 10 4 ) + 17 5 ( − 5 2 log 2 5 2 − 5 3 log 2 5 3 ) + 17 2 ( − 1 log 2 1 ) ) = 0.1408 属性 a 4 a_4 a 4 :纹理 Gain ( D , a 4 ) = Ent ( D ) − ∑ v = 1 3 ∣ D v ∣ D Ent ( D ) = Ent ( D ) − ( D 1 D × Ent ( D 1 ) + D 2 D × Ent ( D 2 ) + D 3 D × Ent ( D 3 ) ) = 0.9975 − ( 9 17 ( − 7 9 log 2 7 9 − 2 9 log 2 2 9 ) + 5 17 ( − 1 5 log 2 1 5 − ( 4 5 log 2 4 5 ) + ( 3 17 ( − 1 log 2 1 ) ) = 0.3808 \begin{array}{l}\operatorname{Gain}\left(D, a_{4}\right)=\operatorname{Ent}(\mathrm{D})-\sum_{v=1}^{3} \frac{\left|D^{v}\right|}{D} \operatorname{Ent}(\mathrm{D})\\=\operatorname{Ent}(\mathrm{D})-\left(\frac{D^{1}}{D} \times \operatorname{Ent}\left(D^{1}\right)+\frac{D^{2}}{D} \times \operatorname{Ent}\left(D^{2}\right)+\frac{D^{3}}{D} \times \operatorname{Ent}\left(D^{3}\right)\right)\\=0.9975-(\frac{9}{17}\left(-\frac{7}{9} \log _{2} \frac{7}{9}-\frac{2}{9} \log _{2} \frac{2}{9}\right)+\frac{5}{17}\left(-\frac{1}{5} \log _{2} \frac{1}{5}-\left(\frac{4}{5} \log _{2} \frac{4}{5}\right)+\left(\frac{3}{17}\left(-1 \log _{2} 1\right)\right)\right.\\=0.3808\end{array} Gain ( D , a 4 ) = Ent ( D ) − ∑ v = 1 3 D ∣ D v ∣ Ent ( D ) = Ent ( D ) − ( D D 1 × Ent ( D 1 ) + D D 2 × Ent ( D 2 ) + D D 3 × Ent ( D 3 ) ) = 0.9975 − ( 17 9 ( − 9 7 log 2 9 7 − 9 2 log 2 9 2 ) + 17 5 ( − 5 1 log 2 5 1 − ( 5 4 log 2 5 4 ) + ( 17 3 ( − 1 log 2 1 ) ) = 0.3808 属性 a 5 a_5 a 5 :脐部 Gain ( D , a 5 ) = Ent ( D ) − ∑ v = 1 3 ∣ D v ∣ D Ent ( D ) = Ent ( D ) − ( D 1 D × Ent ( D 1 ) + D 2 D × Ent ( D 2 ) + D 3 D × Ent ( D 3 ) ) = 0.9975 − ( 7 17 ( − 5 7 log 2 5 7 − 2 7 log 2 2 7 ) + 6 17 ( − log 2 3 6 − 3 6 log 2 3 6 ) + 4 17 ( − 1 log 2 1 ) ) = 0.2892 \begin{array}{l}\operatorname{Gain}\left(D, a_{5}\right)=\operatorname{Ent}(\mathrm{D})-\sum_{v=1}^{3} \frac{\left|D^{v}\right|}{D} \operatorname{Ent}(\mathrm{D}) \\=\operatorname{Ent}(\mathrm{D})-\left(\frac{D^{1}}{D} \times \operatorname{Ent}\left(D^{1}\right)+\frac{D^{2}}{D} \times \operatorname{Ent}\left(D^{2}\right)+\frac{D^{3}}{D} \times \operatorname{Ent}\left(D^{3}\right)\right) \\=0.9975-\left(\frac{7}{17}\left(-\frac{5}{7} \log _{2} \frac{5}{7}-\frac{2}{7} \log _{2} \frac{2}{7}\right)+\frac{6}{17}\left(-\log _{2} \frac{3}{6}-\frac{3}{6} \log_{2} \frac{3}{6}\right)+\frac{4}{17}\left(-1 \log _{2} 1\right)\right) \\=0.2892\end{array} Gain ( D , a 5 ) = Ent ( D ) − ∑ v = 1 3 D ∣ D v ∣ Ent ( D ) = Ent ( D ) − ( D D 1 × Ent ( D 1 ) + D D 2 × Ent ( D 2 ) + D D 3 × Ent ( D 3 ) ) = 0.9975 − ( 17 7 ( − 7 5 log 2 7 5 − 7 2 log 2 7 2 ) + 17 6 ( − log 2 6 3 − 6 3 log 2 6 3 ) + 17 4 ( − 1 log 2 1 ) ) = 0.2892 属性 a 6 a_6 a 6 :触感 Gain ( D , a 6 ) = Ent ( D ) − ∑ v = 1 2 ∣ D v ∣ D Ent ( D ) = Ent ( D ) − ( D 1 D × Ent ( D 1 ) + D 2 D × Ent ( D 2 ) ) = 0.9975 − ( 12 17 ( − 6 12 log 2 6 12 − 6 12 log 2 6 12 ) + 5 17 ( − 2 5 log 2 2 5 − 3 5 log 2 3 5 ) ) = 0.0060 \begin{array}{l}\operatorname{Gain}\left(D, a_{6}\right)=\operatorname{Ent}(\mathrm{D})-\sum_{v=1}^{2} \frac{\left|D^{v}\right|}{D} \operatorname{Ent}(\mathrm{D}) \\=\operatorname{Ent}(\mathrm{D})-\left(\frac{D^{1}}{D} \times \operatorname{Ent}\left(D^{1}\right)+\frac{D^{2}}{D} \times \operatorname{Ent}\left(D^{2}\right)\right) \\=0.9975-\left(\frac{12}{17}\left(-\frac{6}{12} \log _{2} \frac{6}{12}-\frac{6}{12} \log _{2} \frac{6}{12}\right)+\frac{5}{17}\left(-\frac{2}{5} \log _{2} \frac{2}{5}-\frac{3}{5} \log _{2} \frac{3}{5}\right)\right)\\=0.0060\end{array} Gain ( D , a 6 ) = Ent ( D ) − ∑ v = 1 2 D ∣ D v ∣ Ent ( D ) = Ent ( D ) − ( D D 1 × Ent ( D 1 ) + D D 2 × Ent ( D 2 ) ) = 0.9975 − ( 17 12 ( − 12 6 log 2 12 6 − 12 6 log 2 12 6 ) + 17 5 ( − 5 2 log 2 5 2 − 5 3 log 2 5 3 ) ) = 0.0060 选择需要分裂的属性:
属性
信息增益
色泽
0.1091
根蒂
0.1427
敲声
0.1408
纹理
0.3808
脐部
0.2892
触感
0.0060
选择最大信息增益值对应的属性,进行分裂,即对纹理 属性进行分裂。
第二层分裂 然后,对每个分支结点做进一步划分。以分支结点“纹理=清晰”为例,该结点包含的样本集合中有编号为 {1,2,3,4,5,6,8,10,15}的 9 个样例,可用属性集合为{色泽,根蒂,敲声,脐部,触感},基于样本集合(“纹理=清晰”)计算出各属性的信息增益。
样本集合(“纹理=清晰”)的信息熵为:
Ent ( D 2 ) = − 7 9 × log 2 7 9 − 2 9 × log 2 2 9 = 0.7642 \operatorname{Ent}(D_2)=-\frac{7}{9} \times \log _{2} \frac{7}{9}-\frac{2}{9} \times \log _{2} \frac{2}{9}=0.7642 Ent ( D 2 ) = − 9 7 × log 2 9 7 − 9 2 × log 2 9 2 = 0.7642 计算各属性的信息增益值——基于样本集合(“纹理=清晰”)
属性 a 1 a_1 a 1 :色泽 Gain ( D 2 , a 1 ) = Ent ( D 2 ) − ∑ v = 1 3 ∣ D 2 v ∣ D 2 Ent ( D 2 v ) = Ent ( D 2 ) − ( D 2 1 D 2 × Ent ( D 2 1 ) + D 2 2 D 2 × Ent ( D 2 2 ) + D 2 3 D 2 × Ent ( D 2 3 ) ) = 0.7642 − [ 4 9 ( − 3 4 log 2 3 4 − 1 4 log 2 1 4 ) + 4 9 ( − 3 4 log 2 3 4 − 1 4 log 2 1 4 ) + 1 9 ( − 1 log 2 1 ) ] = 0.0431 \begin{array}{l}\operatorname{Gain}\left(D_{2}, a_{1}\right)=\operatorname{Ent}\left(D_{2}\right)-\sum_{v=1}^{3} \frac{\left|D_{2}^{v}\right|}{D_{2}} \operatorname{Ent}\left(D_{2}^{v}\right) \\=\operatorname{Ent}\left(D_{2}\right)-\left(\frac{D_{2}^{1}}{D_{2}} \times \operatorname{Ent}\left(D_{2}^{1}\right)+\frac{D_{2}^{2}}{D_{2}} \times \operatorname{Ent}\left(D_{2}^{2}\right)+\frac{D_{2}^{3}}{D_{2}} \times \operatorname{Ent}\left(D_{2}^{3}\right)\right) \\=0.7642-\left[\frac{4}{9}\left(-\frac{3}{4} \log _{2} \frac{3}{4}-\frac{1}{4} \log _{2} \frac{1}{4}\right)+\frac{4}{9}\left(-\frac{3}{4} \log _{2} \frac{3}{4}-\frac{1}{4} \log _{2} \frac{1}{4}\right)+\frac{1}{9}\left(-1 \log _{2} 1\right)\right] \\=0.0431\end{array} Gain ( D 2 , a 1 ) = Ent ( D 2 ) − ∑ v = 1 3 D 2 ∣ D 2 v ∣ Ent ( D 2 v ) = Ent ( D 2 ) − ( D 2 D 2 1 × Ent ( D 2 1 ) + D 2 D 2 2 × Ent ( D 2 2 ) + D 2 D 2 3 × Ent ( D 2 3 ) ) = 0.7642 − [ 9 4 ( − 4 3 log 2 4 3 − 4 1 log 2 4 1 ) + 9 4 ( − 4 3 log 2 4 3 − 4 1 log 2 4 1 ) + 9 1 ( − 1 log 2 1 ) ] = 0.0431 属性 a 2 a_2 a 2 :根蒂 Gain ( D 2 , a 2 ) = Ent ( D 2 ) − ∑ v = 1 3 ∣ D 2 v ∣ D 2 Ent ( D 2 v ) = Ent ( D 2 ) − ( D 2 1 D 2 × Ent ( D 2 1 ) + D 2 2 D 2 × Ent ( D 2 2 ) + D 2 3 D 2 × Ent ( D 2 3 ) ) = 0.7642 − [ 5 9 ( − 1 log 2 1 ) + 3 9 ( − 2 3 log 2 2 3 − 1 3 log 2 1 3 ) + 1 9 ( − 1 log 2 1 ) ] = 0.4581 \begin{array}{l}\operatorname{Gain}\left(D_{2}, a_{2}\right)=\operatorname{Ent}\left(D_{2}\right)-\sum_{v=1}^{3} \frac{\left|D_{2}^{v}\right|}{D_{2}} \operatorname{Ent}\left(D_{2}^{v}\right) \\=\operatorname{Ent}\left(D_{2}\right)-\left(\frac{D_{2}^{1}}{D_{2}} \times \operatorname{Ent}\left(D_{2}^{1}\right)+\frac{D_{2}^{2}}{D_{2}} \times \operatorname{Ent}\left(D_{2}^{2}\right)+\frac{D_{2}^{3}}{D_{2}} \times \operatorname{Ent}\left(D_{2}^{3}\right)\right) \\=0.7642-\left[\frac{5}{9}\left(-1 \log _{2} 1\right)+\frac{3}{9}\left(-\frac{2}{3} \log _{2} \frac{2}{3}-\frac{1}{3} \log _{2} \frac{1}{3}\right)+\frac{1}{9}\left(-1 \log _{2} 1\right)\right] \\=0.4581\end{array} Gain ( D 2 , a 2 ) = Ent ( D 2 ) − ∑ v = 1 3 D 2 ∣ D 2 v ∣ Ent ( D 2 v ) = Ent ( D 2 ) − ( D 2 D 2 1 × Ent ( D 2 1 ) + D 2 D 2 2 × Ent ( D 2 2 ) + D 2 D 2 3 × Ent ( D 2 3 ) ) = 0.7642 − [ 9 5 ( − 1 log 2 1 ) + 9 3 ( − 3 2 log 2 3 2 − 3 1 log 2 3 1 ) + 9 1 ( − 1 log 2 1 ) ] = 0.4581 属性 a 3 a_3 a 3 :敲声 Gain ( D 2 , a 3 ) = Ent ( D 2 ) − ∑ v = 1 3 ∣ D 2 v ∣ D 2 Ent ( D 2 v ) = Ent ( D 2 ) − ( D 2 1 D 2 × Ent ( D 2 1 ) + D 2 2 D 2 × Ent ( D 2 2 ) + D 2 3 D 2 × Ent ( D 2 3 ) ) = 0.7642 − [ 6 9 ( − 5 6 log 2 5 6 − 1 6 log 2 1 6 ) + 2 9 ( − 1 log 2 1 ) + 1 9 ( − 1 log 2 1 ) ] = 0.3308 \begin{array}{l}\operatorname{Gain}\left(D_{2}, a_{3}\right)=\operatorname{Ent}\left(D_{2}\right)-\sum_{v=1}^{3} \frac{\left|D_{2}^{v}\right|}{D_{2}} \operatorname{Ent}\left(D_{2}^{v}\right) \\=\operatorname{Ent}\left(D_{2}\right)-\left(\frac{D_{2}^{1}}{D_{2}} \times \operatorname{Ent}\left(D_{2}^{1}\right)+\frac{D_{2}^{2}}{D_{2}} \times \operatorname{Ent}\left(D_{2}^{2}\right)+\frac{D_{2}^{3}}{D_{2}} \times \operatorname{Ent}\left(D_{2}^{3}\right)\right) \\=0.7642-\left[\frac{6}{9}\left(-\frac{5}{6} \log _{2} \frac{5}{6}-\frac{1}{6} \log _{2} \frac{1}{6}\right)+\frac{2}{9}\left(-1 \log _{2} 1\right)+\frac{1}{9}\left(-1 \log _{2} 1\right)\right] \\=0.3308\end{array} Gain ( D 2 , a 3 ) = Ent ( D 2 ) − ∑ v = 1 3 D 2 ∣ D 2 v ∣ Ent ( D 2 v ) = Ent ( D 2 ) − ( D 2 D 2 1 × Ent ( D 2 1 ) + D 2 D 2 2 × Ent ( D 2 2 ) + D 2 D 2 3 × Ent ( D 2 3 ) ) = 0.7642 − [ 9 6 ( − 6 5 log 2 6 5 − 6 1 log 2 6 1 ) + 9 2 ( − 1 log 2 1 ) + 9 1 ( − 1 log 2 1 ) ] = 0.3308 属性 a 5 a_5 a 5 :触感 Gain ( D 2 , a 5 ) = Ent ( D 2 ) − ∑ v = 1 2 ∣ D 2 v ∣ D 2 Ent ( D 2 v ) = Ent ( D 2 ) − ( D 2 1 D 2 × Ent ( D 2 1 ) + D 2 2 D 2 × Ent ( D 2 2 ) ) = 0.7642 − [ 6 9 ( − 1 log 2 1 ) + 3 9 ( − 1 3 log 2 1 3 − 2 3 log 2 2 3 ) ] = 0.4581 \begin{array}{l}\operatorname{Gain}\left(D_{2}, a_{5}\right)=\operatorname{Ent}\left(D_{2}\right)-\sum_{v=1}^{2} \frac{\left|D_{2}^{v}\right|}{D_{2}} \operatorname{Ent}\left(D_{2}^{v}\right) \\=\operatorname{Ent}\left(D_{2}\right)-\left(\frac{D_{2}^{1}}{D_{2}} \times \operatorname{Ent}\left(D_{2}^{1}\right)+\frac{D_{2}^{2}}{D_{2}} \times \operatorname{Ent}\left(D_{2}^{2}\right)\right) \\=0.7642-\left[\frac{6}{9}\left(-1 \log _{2} 1\right)+\frac{3}{9}\left(-\frac{1}{3} \log _{2} \frac{1}{3}-\frac{2}{3} \log _{2} \frac{2}{3}\right)\right] \\=0.4581\end{array} Gain ( D 2 , a 5 ) = Ent ( D 2 ) − ∑ v = 1 2 D 2 ∣ D 2 v ∣ Ent ( D 2 v ) = Ent ( D 2 ) − ( D 2 D 2 1 × Ent ( D 2 1 ) + D 2 D 2 2 × Ent ( D 2 2 ) ) = 0.7642 − [ 9 6 ( − 1 log 2 1 ) + 9 3 ( − 3 1 log 2 3 1 − 3 2 log 2 3 2 ) ] = 0.4581 选择分裂属性
属性
信息增益
色泽
0.431
根蒂
0.4581
敲声
0.3308
脐部
0.4581
触感
0.4581
选择最大信息增益值对应的属性,进行分裂。但是有三类属性的信息增益是相等的,我们可以随机选择一个属性作为分裂属性 。我们这里选择根蒂 属性来分裂。
同理,对纹理中的其他属性值做上述操作 ,得到第二层的树。
第三层分裂 接下来对上图中结点(“根蒂=稍蜷”)进行划分,该点的样本集为{6,8,15},共有 3 个样木。可用特征集为{色泽,敲声,脐部,触感},同样可以计算信息增益。
样本集合(“根蒂=稍蜷”)的信息熵为:
Ent ( D 3 ) = − ∑ k = 1 2 p k log 2 p k = − ( 2 3 log 2 2 3 + 1 3 log 2 1 3 ) = 0.918 \operatorname{Ent}\left(D_{3}\right)=-\sum_{k=1}^{2} p_{k} \log _{2} p_{k}=-\left(\frac{2}{3} \log _{2} \frac{2}{3}+\frac{1}{3} \log _{2} \frac{1}{3}\right)=0.918 Ent ( D 3 ) = − k = 1 ∑ 2 p k log 2 p k = − ( 3 2 log 2 3 2 + 3 1 log 2 3 1 ) = 0.918 计算各属性的信息增益值——基于样本集合(“纹理=清晰”)
属性 a 1 a_1 a 1 :色泽 Gain ( D 3 , a 1 ) = Ent ( D 3 ) − ∑ v = 1 2 ∣ D 3 v ∣ D 3 Ent ( D 3 v ) = 0.918 − [ 1 3 ( − 1 1 log 2 1 − 0 ⋅ log 2 0 ) + 2 3 ( − 1 2 log 2 1 2 − 1 2 log 2 1 2 ) ] = 0.251 \begin{array}{l}\operatorname{Gain}\left(D_{3}, a_{1}\right)=\operatorname{Ent}\left(D_{3}\right)-\sum_{v=1}^{2} \frac{\left|D_{3}^{v}\right|}{D_{3}} \operatorname{Ent}\left(D_{3}^{v}\right) \\=0.918-\left[\frac{1}{3}\left(-\frac{1}{1} \log _{2} 1-0 \cdot \log _{2} 0\right)+\frac{2}{3}\left(-\frac{1}{2} \log _{2} \frac{1}{2}-\frac{1}{2} \log _{2} \frac{1}{2}\right)\right] \\=0.251\end{array} Gain ( D 3 , a 1 ) = Ent ( D 3 ) − ∑ v = 1 2 D 3 ∣ D 3 v ∣ Ent ( D 3 v ) = 0.918 − [ 3 1 ( − 1 1 log 2 1 − 0 ⋅ log 2 0 ) + 3 2 ( − 2 1 log 2 2 1 − 2 1 log 2 2 1 ) ] = 0.251 属性 a 2 a_2 a 2 :敲声 Gain ( D 3 , a 2 ) = Ent ( D 3 ) − ∑ v = 1 1 ∣ D 3 v ∣ D 3 Ent ( D 3 v ) = 0.918 − [ 3 3 ( − 2 3 log 2 2 3 − 1 3 log 2 1 3 ) ] = 0 \begin{array}{l}\operatorname{Gain}\left(D_{3}, a_{2}\right)=\operatorname{Ent}\left(D_{3}\right)-\sum_{v=1}^{1} \frac{|D_{3}^{v}|}{D_{3}} \operatorname{Ent}\left(D_{3}^{v}\right) \\=0.918-\left[\frac{3}{3}\left(-\frac{2}{3} \log _{2} \frac{2}{3}-\frac{1}{3} \log _{2} \frac{1}{3}\right)\right] \\=0\end{array} Gain ( D 3 , a 2 ) = Ent ( D 3 ) − ∑ v = 1 1 D 3 ∣ D 3 v ∣ Ent ( D 3 v ) = 0.918 − [ 3 3 ( − 3 2 log 2 3 2 − 3 1 log 2 3 1 ) ] = 0 属性 a 3 a_3 a 3 :脐部 脐部和敲声一样都只能划分出一个子集,因此脐部的信息增益也为
0 0 0 Gain ( D 3 , a 2 ) = 0 \operatorname{Gain}\left(D_{3}, a_{2}\right)=0 Gain ( D 3 , a 2 ) = 0 属性 a 4 a_4 a 4 :触感 Gain ( D 3 , a 4 ) = Ent ( D 3 ) − ∑ v = 1 2 ∣ D 3 v ∣ D 3 Ent ( D 3 v ) = 0.918 − [ 1 3 ( − 1 1 log 2 1 − 0 ⋅ log 2 0 ) + 2 3 ( − 1 2 log 2 1 2 − 1 2 log 2 1 2 ) ] = 0.251 \begin{aligned}&\text { Gain }\left(D_{3}, a_{4}\right)=\operatorname{Ent}\left(D_{3}\right)-\sum_{v=1}^{2} \frac{\left|D_{3}^{v}\right|}{D_{3}} \operatorname{Ent}\left(D_{3}^{v}\right) \\&=0.918-\left[\frac{1}{3}\left(-\frac{1}{1} \log _{2} 1-0 \cdot \log _{2} 0\right)+\frac{2}{3}\left(-\frac{1}{2} \log _{2} \frac{1}{2}-\frac{1}{2} \log _{2} \frac{1}{2}\right)\right] \\&=0.251\end{aligned} Gain ( D 3 , a 4 ) = Ent ( D 3 ) − v = 1 ∑ 2 D 3 ∣ D 3 v ∣ Ent ( D 3 v ) = 0.918 − [ 3 1 ( − 1 1 log 2 1 − 0 ⋅ log 2 0 ) + 3 2 ( − 2 1 log 2 2 1 − 2 1 log 2 2 1 ) ] = 0.251 选择分裂属性
属性
信息增益
色泽
0.251
敲声
0
脐部
0
触感
0.251
可知“色泽”,“触感” 2 个属性均取得了最大的信息增益,选个属性作为划分属性,不妨选取“色泽 ”为划分属性。
最终的树模型 ID3 算法的不足 ID3 没有考虑连续特征。 ID3 采用信息增益大的特征优先建立决策树的节点,取值比较多的特征比取值少的特征信息增益大。 ID3 算法对于缺失值的情况没有做考虑。 没有考虑过拟合的问题 C4.5 算法 增益率的定义 信息增益偏向于选择取值较多的属性,容易过拟合,基于信息增益的缺点,C4.5 算法不直接使用信息增益,而是使用一种叫增益率的方法来选择最优属性进行划分,对样本集
D D D 中离散属性
a a a ,增益率为:
Gain_ratio ( D , a ) = Gain ( D , a ) IV ( a ) \operatorname{ Gain\_ratio }(D,a)=\frac{\operatorname{Gain}(D, a)}{\operatorname{IV}(a)} Gain_ratio ( D , a ) = IV ( a ) Gain ( D , a ) IV(a) \operatorname{IV(a)} IV(a) 是属性
a a a 的固有值(Intrinsic Value):
IV ( a ) = − ∑ v = 1 V ∣ D v ∣ ∣ D ∣ log 2 ∣ D v ∣ ∣ D ∣ \operatorname{IV}(a)=-\sum_{v=1}^{V} \frac{\left|D^{v}\right|}{|D|} \log _{2} \frac{\left|D^{v}\right|}{|D|} IV ( a ) = − v = 1 ∑ V ∣ D ∣ ∣ D v ∣ log 2 ∣ D ∣ ∣ D v ∣ 增益率与信息增益的对比:
信息熵:
Ent ( D ) = − 1 2 log 2 1 2 − 1 2 log 2 1 2 = 1 \operatorname{Ent}(D)=-\frac{1}{2}\log_2\frac{1}{2}-\frac{1}{2}\log_2\frac{1}{2}=1 Ent ( D ) = − 2 1 log 2 2 1 − 2 1 log 2 2 1 = 1 Gain ( D , a 1 ) = Ent ( D ) − ∑ i = 1 4 ∣ D v ∣ D Ent ( D ) = 1 − ( D 1 D × Ent ( D 1 ) + D 2 D × Ent ( D 2 ) + D 3 D × Ent ( D 3 ) + D 4 D × Ent ( D 4 ) ) = 1 − ( 1 4 ( − 1 log 2 1 ) + 1 4 ( − 1 log 2 1 ) + 1 4 ( − 1 log 2 1 ) + 1 4 ( − 1 log 2 1 ) ) = 1 \begin{aligned}\operatorname{Gain}\left(D, a_{1}\right) &=\operatorname{Ent}(D)-\sum_{i=1}^{4} \frac{\left|D^{v}\right|}{D} \operatorname{Ent}(D) \\&=1-\left(\frac{D^{1}}{D} \times \operatorname{Ent}\left(D^{1}\right)+\frac{D^{2}}{D} \times \operatorname{Ent}\left(D^{2}\right)+\frac{D^{3}}{D} \times \operatorname{Ent}\left(D^{3}\right)+\frac{D^{4}}{D} \times \operatorname{Ent}\left(D^{4}\right)\right) \\&=1-\left(\frac{1}{4}\left(-1 \log _{2} 1\right)+\frac{1}{4}\left(-1 \log _{2} 1\right)+\frac{1}{4}\left(-1 \log _{2} 1\right)+\frac{1}{4}\left(-1 \log _{2} 1\right)\right) \\&=1\end{aligned} Gain ( D , a 1 ) = Ent ( D ) − i = 1 ∑ 4 D ∣ D v ∣ Ent ( D ) = 1 − ( D D 1 × Ent ( D 1 ) + D D 2 × Ent ( D 2 ) + D D 3 × Ent ( D 3 ) + D D 4 × Ent ( D 4 ) ) = 1 − ( 4 1 ( − 1 log 2 1 ) + 4 1 ( − 1 log 2 1 ) + 4 1 ( − 1 log 2 1 ) + 4 1 ( − 1 log 2 1 ) ) = 1 Gain ( D , a 2 ) = Ent ( D ) − ∑ v = 1 2 ∣ D v ∣ D Ent ( D ) = 1 − ( D 1 D × Ent ( D 1 ) + D 2 D × Ent ( D 2 ) ) = 1 − ( 2 4 ( − 1 2 log 2 1 2 − 1 2 log 2 1 2 ) + 2 4 ( − 1 2 log 2 1 2 − 1 2 log 2 1 2 ) ) < 1 \begin{aligned}&\operatorname{Gain}\left(D, a_{2}\right)=\operatorname{Ent}(D)-\sum_{v=1}^{2} \frac{\left|D^{v}\right|}{D} \operatorname{Ent}(\mathrm{D}) \\&=1-\left(\frac{D^{1}}{D} \times \operatorname{Ent}\left(D^{1}\right)+\frac{D^{2}}{D} \times \operatorname{Ent}\left(D^{2}\right)\right) \\&=1-\left(\frac{2}{4} \left(-\frac{1}{2} \log _{2} \frac{1}{2}-\frac{1}{2} \log _{2} \frac{1}{2}\right)+\frac{2}{4}\left(-\frac{1}{2} \log _{2} \frac{1}{2}-\frac{1}{2} \log _{2} \frac{1}{2}\right)\right) \\&<1\end{aligned} Gain ( D , a 2 ) = Ent ( D ) − v = 1 ∑ 2 D ∣ D v ∣ Ent ( D ) = 1 − ( D D 1 × Ent ( D 1 ) + D D 2 × Ent ( D 2 ) ) = 1 − ( 4 2 ( − 2 1 log 2 2 1 − 2 1 log 2 2 1 ) + 4 2 ( − 2 1 log 2 2 1 − 2 1 log 2 2 1 ) ) < 1 我们用信用级别这个属性去划分原数据集,那么,原数据集中有多少个样本,就会被划分为多少个子集,这样的话,会导致信息增益公式的第二项整体为
0 0 0 ,虽然这种划分毫无意义,但是从信息增益的概念来讲,这就是最好的划分属性。信息增益表示由于特征
A A A 而使得数据集的分类不确定性减少的程度,信息增益大的属性具有更强的分类能力。根据熵的公式可知, 属性越多,熵越大,
因此将他将它作为分母,对分支过多的情况进行惩罚,就可以校正信息增益容易偏向于取值较多的属性的问题。 C4.5 算法的不足 C4.5 生成的是多叉树,生成决策树的效率比较慢。 C4.5 只能用于分类。 C4.5 由于使用了熵模型,里面有大量的耗时的对数运算。 CART 算法 CART 是 Classification and Regression Tree 的简称,这是一种著名的决策树学习算法,分类和回归任务都可用。
基尼值的定义 基尼值(
Gini ( D ) \operatorname{Gini}(D) Gini ( D ) )用于度量数据集的纯度,
Gini ( D ) \operatorname{Gini}(D) Gini ( D ) 反映了从数据集中随机抽取两个样本,其类别标记不一致的概率。因此,
Gini ( D ) \operatorname{Gini}(D) Gini ( D ) 越小,则数据集的纯度越高。
假定当前样本集合
D D D 中第
k k k 类样本所占的比例为
p k ( k = 1 , 2 , … , ∣ y ∣ ) p_k(k=1,2,\dots,|y|) p k ( k = 1 , 2 , … , ∣ y ∣ ) ,则
D D D 的基尼值为:
Gini ( D ) = ∑ k = 1 ∣ y ∣ ∑ k ′ ≠ k p k p k ′ = p 1 ⋅ ( p 2 + p 3 + … + p ∣ y ∣ ) + … + p ∣ y ∣ ⋅ ( p 1 + p 2 + p 3 + … + p ∣ y ∣ − 1 ) = p 1 ( 1 − p 1 ) + p 2 ( 1 − p 2 ) + … + p ∣ y ∣ ( 1 − p ∣ y ∣ ) = ( p 1 + p 2 + … + p ∣ y ∣ ) − ( p 1 2 + p 2 2 + … + p ∣ y ∣ 2 ) = 1 − ∑ k = 1 p k 2 \begin{aligned}\operatorname{Gini}(D) &=\sum_{k=1}^{|y|} \sum_{k^{\prime} \neq k} p_{k} p_{k^{\prime}} \\&=p_{1} \cdot\left(p_{2}+p_{3}+\ldots+p_{|y|}\right)+\ldots+p_{|y|} \cdot\left(p_{1}+p_{2}+p_{3}+\ldots+p_{|y|-1}\right) \\&=p_{1}\left(1-p_{1}\right)+p_{2}\left(1-p_{2}\right)+\ldots+p_{|y|}\left(1-p_{|y|}\right) \\&=\left(p_{1}+p_{2}+\ldots+p_{|y|}\right)-\left(p_{1}^{2}+p_{2}^{2}+\ldots+p_{|y|}^{2}\right) \\&=1-\sum_{k=1} p_{k}^{2}\end{aligned} Gini ( D ) = k = 1 ∑ ∣ y ∣ k ′ = k ∑ p k p k ′ = p 1 ⋅ ( p 2 + p 3 + … + p ∣ y ∣ ) + … + p ∣ y ∣ ⋅ ( p 1 + p 2 + p 3 + … + p ∣ y ∣ − 1 ) = p 1 ( 1 − p 1 ) + p 2 ( 1 − p 2 ) + … + p ∣ y ∣ ( 1 − p ∣ y ∣ ) = ( p 1 + p 2 + … + p ∣ y ∣ ) − ( p 1 2 + p 2 2 + … + p ∣ y ∣ 2 ) = 1 − k = 1 ∑ p k 2 基尼指数的定义 基尼指数表示在样本集合中一个随机选中的样本被分错的概率。基尼指数越大,样本的不确定性也就越大。
Gini_index ( D , a ) = ∑ v = 1 V ∣ D v ∣ ∣ D ∣ Gini ( D v ) \operatorname{ Gini\_index }(D, a)=\sum_{v=1}^{V} \frac{\left|D^{v}\right|}{|D|} \operatorname{Gini}\left(D^{v}\right) Gini_index ( D , a ) = v = 1 ∑ V ∣ D ∣ ∣ D v ∣ Gini ( D v ) 其中
D v D^v D v 表示第
v v v 个类别的样本集。
参考资料 《机器学习》——周志华