AdaBoost：自适应提升算法（Numpy手写代码实战）

AdaBoost算法是通过改变训练样本权重来学习多个弱分类器并线性组合成强分类器的Boosting算法。

Boosting方法要解答的两个关键问题：

一是在训练过程中如何改变训练样本的权重或者概率分布;

二是如何将多个弱分类器组合成一个强分类器。

AdaBoost的做法是：一是提高前一轮被弱分类器分类错误的样本的权重，而降低分类正确的样本的权重；而是对多个弱分类器进行线性组合，提高分类效果好的弱分类器的权重，降低分类误差率高的弱分类器的权重。

1. 导入需要的包

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
1
2
3
4
5
6

2. 生成数据集并查看

X, y = make_blobs(n_samples=150, n_features=2, centers=2, cluster_std=1.2, random_state=40)
# 将标签转换为1/-1
y_=y.copy()
# 把数据集中的标签为0的数据转换成标签为-1
y_[y_==0]=-1
y_.astype(float)
# 训练/测试数据集划分
X_train, X_test, y_train, y_test = train_test_split(X, y_, test_size=0.3, random_state=43)
colors={1:'red',-1:'blue'}
plt.scatter(X[:,0],X[:,1],c=pd.Series(y_).map(colors))
plt.show()
1
2
3
4
5
6
7
8
9
10
11

3. 定义决策树桩类作为Adaboost的弱分类器

单层决策树（decision stump，也称决策树桩）是一种简单的决策树。由于这棵树只有一次分裂过程，因此它实际上就是一个树桩。

class decision_stump:
    def __init__(self):
         # 基于划分阈值决定样本分类为1还是-1
         self.label=1
         # 特征索引
         self.feature_index=None
         # 特征划分阈值
         self.threshold=None
         # 指示分类准确率
         self.alpha=None
1
2
3
4
5
6
7
8
9
10

4. 定义Adaboost算法类

这里叙述一下Adaboost算法。

假设给定一个二类分类的训练数据集：
$T=\{(x_1,y_1)\;,\;(x_2,y_2)\;,\;\cdots\;,\;(x_N,y_N)\}$
其中，每个样本点由实例与标记组成。实例 $x_i\in \chi \subseteq \R^n$ ，标记 $y_i\in Y=\{-1\;,\;+1\}$ ， $\chi$ 是实例空间， $Y$ 是标记集合。Adaboost利用以下算法，从训练数据中学习一系列弱分类器或基本分类器，并将这些弱分类器线性组合成一个强分类器。

算法步骤如下：

输入：训练数据集 $T=\{(x_1,y_1)\;,\;(x_2,y_2)\;,\;\cdots\;,\;(x_N,y_N)$ ，其中 $x_i\in \chi \subseteq \R^n$ ， $y_i\in Y=\{-1,+1\}$ ；弱学习算法；

输出：最终分类器 $G (x)$

（1）初始化训练数据的权值分布：
$D_1=(\omega_{11}\;,\;\cdots\;,\;\omega_{1i}\;,\;\cdots\;,\;\omega_{1n})\;,\;\omega_{1i}=\frac{1}{N}\;,\;i=1,2,\cdots,N$
（2）对 $m=1,2,\cdots,M$

使用具有权值分布 $D_m$ 的训练数据集学习，得到基本分类器：
$G_m(x):\chi \rightarrow\{-1,+1\}$
计算 $G_m(x)$ 在训练数据集上的分类误差率：
$e_m=\sum_{i=1}^NP(G_m(x_i)\neq y_i)=\sum_{i=1}^N\omega_{mi}I(G_m(x_i)\neq y_i)$
计算 $G_m(x)$ 的系数：
$\alpha_m=\frac{1}{2}\log\frac{1-e_m}{e_m}$
更新训练数据集的权值分布：
$D_{m+1}=(\omega_{m+1,1}\;,\;\cdots\;,\;\omega_{m+1,i}\;,\;\cdots\;,\;\omega_{m+1,N}$
$\omega_{m+1,i}=\frac{\omega_{mi}}{Z_m}\exp(-\alpha_my_iG_m(x_i))\;,\;i=1,2,\cdots,N$
这里， $Z_m$ 是规范化因子：
$Z_m=\sum_{i=1}^N\omega_{mi}\exp(-\alpha_my_iG_m(x))$
它使得 $D_{m+1}$ 成为一个概率分布。

（3）构建基本分类器的线性组合：
$f(x)=\sum_{m=1}^M\alpha_mG_m(x)$
得到最终分类器：
$G(x)=sign(f(x))=sign(\sum_{m=1}^M\alpha_mG_m(x))$

弱分类器的个数

def __init__(self,n_estimators=10):
     self.n_estimators=n_estimators
1
2

Adaboost拟合算法

def fit(self,X,y):
    m,n=X.shape
    # (1) 初始化权重分布为均匀分布 1/N
    w=np.full(m,(1/m))
    # 初始化基分类器列表
    self.estimators = []
    # (2) for m in (1,2,...,M)
    for _ in range(self.n_estimators):
        # (2.a) 训练一个弱分类器：决策树桩
        estimator = decision_stump()
        # 设定一个最小化误差
        min_error = float('inf')
        # 遍历数据集特征，根据最小分类误差率选择最优划分特征
        for i in range(n):
            # 获取特征值
            values=np.expand_dims(X[:,i],axis=1)
            # 特征值去重
            unique_values = np.unique(values)
            # 尝试将每一个特征值作为分类阈值
            for threshold in unique_values:
                p=1
                # 初始化所有预测值为1
                pred=np.ones(np.shape(y))
                # 小于分类阈值的预测值为-1
                pred[X[:,i]<threshold]=-1
                # 2.b 计算误差率
                error=sum(w[y!=pred])
                # 如果分类误差大于0.5，则进行正负预测翻转
                # 例如 error = 0.6 => (1 - error) = 0.4
                if error>0.5:
                   error=1-error
                   p=-1

                # 一旦获得最小误差则保存相关参数配置
                if error<min_error:
                   estimator.lable=p
                   estimator.threshold=threshold
                   estimator.feature_index=i
                   min_error=error
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39

计算基分类器的权重

estimator.alpha=0.5*np.log((1-min_error)/(min_error+1e-9))
# 初始化所有预测值为1
preds=np.ones(np.shape(y))
# 获取所有小于阈值的负类索引
negative_idx=(estimator.lable*X[:,estimator.feature_index]<estimator.lable*estimator.threshold)
# 将负类设为 '-1'
preds[negative_idx]=-1
1
2
3
4
5
6
7

更新样本权重

w*=np.exp(-estimator.alpha*y*preds)
w/=np.sum(w)
# 保存该弱分类器
self.estimators.append(estimator)
1
2
3
4

定义预测函数

def predict(self,X):
    m=len(X)
    y_pred=np.zeros((m,1))
    # 计算每个弱分类器的预测值
    for estimator in self.estimators:
        # 初始化所有的预测值为1
        predictions=np.ones(np.shape(y_pred))
        # 获取所有小于阈值的负类索引
        negative_idx = (estimator.label * X[:, estimator.feature_index] < estimator.label * estimator.threshold)
        # 将负类设为 '-1'
        predictions[negative_idx]=-1
        # 2.e 对每个弱分类器的预测结果进行加权
        y_pred+=estimator.alpha*predictions
        # 返回最终预测结果
        y_pred=np.sign(y_pred).flatten()
        return y_pred

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17

计算准确率

# 创建Adaboost模型实例
clf=Adaboost(n_estimators=5)
# 模型拟合
clf.fit(X_train,y_train)
# 模型预测
y_pred=clf.predict(X_test)
# 计算模型预测准确率
accuracy=accuracy_score(y_test,y_pred)
print(accuracy)
1
2
3
4
5
6
7
8
9

运行结果

0.9777777777777777
1

完整版代码

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# 生成数据并查看
X, y = make_blobs(n_samples=150, n_features=2, centers=2, cluster_std=1.2, random_state=40)
# 将标签转换为1/-1
y_=y.copy()
# 把数据集中的标签为0的数据转换成标签为-1
y_[y_==0]=-1
y_.astype(float)
# 训练/测试数据集划分
X_train, X_test, y_train, y_test = train_test_split(X, y_, test_size=0.3, random_state=43)
colors={1:'red',-1:'blue'}
plt.scatter(X[:,0],X[:,1],c=pd.Series(y_).map(colors))
plt.show()

# 定义决策树桩类作为Adaboost的弱分类器,树桩，只有一个分类节点
class decision_stump:
    def __init__(self):
         # 基于划分阈值决定样本分类为1还是-1
         self.label=1
         # 特征索引
         self.feature_index=None
         # 特征划分阈值
         self.threshold=None
         # 指示分类准确率
         self.alpha=None

# 定义Adaboost算法类
class Adaboost:
    # 弱分类器个数
    def __init__(self,n_estimators=10):
        self.n_estimators=n_estimators

    # Adaboost拟合算法
    def fit(self,X,y):
        m,n=X.shape
        # (1) 初始化权重分布为均匀分布 1/N
        w=np.full(m,(1/m))
        # 初始化基分类器列表
        self.estimators = []
        # (2) for m in (1,2,...,M)
        for _ in range(self.n_estimators):
            # (2.a) 训练一个弱分类器：决策树桩
            estimator = decision_stump()
            # 设定一个最小化误差
            min_error = float('inf')
            # 遍历数据集特征，根据最小分类误差率选择最优划分特征
            for i in range(n):
                # 获取特征值
                values=np.expand_dims(X[:,i],axis=1)
                # 特征值去重
                unique_values = np.unique(values)
                # 尝试将每一个特征值作为分类阈值
                for threshold in unique_values:
                    p=1
                    # 初始化所有预测值为1
                    pred=np.ones(np.shape(y))
                    # 小于分类阈值的预测值为-1
                    pred[X[:,i]<threshold]=-1
                    # 2.b 计算误差率
                    error=sum(w[y!=pred])
                    # 如果分类误差大于0.5，则进行正负预测翻转
                    # 例如 error = 0.6 => (1 - error) = 0.4
                    if error>0.5:
                        error=1-error
                        p=-1

                    # 一旦获得最小误差则保存相关参数配置
                    if error<min_error:
                        estimator.lable=p
                        estimator.threshold=threshold
                        estimator.feature_index=i
                        min_error=error

            # 2.c 计算基分类器的权重
            estimator.alpha=0.5*np.log((1-min_error)/(min_error+1e-9))
            # 初始化所有预测值为1
            preds=np.ones(np.shape(y))
            # 获取所有小于阈值的负类索引
            negative_idx=(estimator.lable*X[:,estimator.feature_index]<estimator.lable*estimator.threshold)
            # 将负类设为 '-1'
            preds[negative_idx]=-1
            # 2.d 更新样本权重
            w*=np.exp(-estimator.alpha*y*preds)
            w/=np.sum(w)
            # 保存该弱分类器
            self.estimators.append(estimator)

    # 定义预测函数
    def predict(self,X):
        m=len(X)
        y_pred=np.zeros((m,1))
        # 计算每个弱分类器的预测值
        for estimator in self.estimators:
            # 初始化所有的预测值为1
            predictions=np.ones(np.shape(y_pred))
            # 获取所有小于阈值的负类索引
            negative_idx = (estimator.label * X[:, estimator.feature_index] < estimator.label * estimator.threshold)
            # 将负类设为 '-1'
            predictions[negative_idx]=-1
            # 2.e 对每个弱分类器的预测结果进行加权
            y_pred+=estimator.alpha*predictions
        # 返回最终预测结果
        y_pred=np.sign(y_pred).flatten()
        return y_pred

# 计算准确率

# 创建Adaboost模型实例
clf=Adaboost(n_estimators=10)
# 模型拟合
clf.fit(X_train,y_train)
# 模型预测
y_pred=clf.predict(X_test)
# 计算模型预测准确率
accuracy=accuracy_score(y_test,y_pred)
print(accuracy)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122

相关阅读:
python 基于django医院预约挂号管理系统
cesium结构图
自定义Flink kafka连接器Decoding和Serialization格式
博弈论学习笔记（1）——知识要点回顾（自用）
如何在python中实现capl语言里的回调函数
c++运算符重载实现
procast模拟浇注时没有出现金属液体怎么解决
windows Visual Studio 2022 opengl开发环境配置
8.jib-maven-plugin构建springboot项目镜像，docker部署配置
两千字讲明白java中instanceof关键字的使用！

原文地址：https://blog.csdn.net/wzk4869/article/details/126767022