Boosting 的过程很类似于人类学习的过程,我们学习新知识的过程往往是迭代式的,第一遍学习的时候,我们会记住一部分知识,但往往也会犯一些错误,对于这些错误,我们的印象会很深。第二遍学习的时候,就会针对犯过错误的知识加强学习,以减少类似的错误发生。不断循环往复,直到犯错误的次数减少到很低的程度
关键点:
如何确认投票权重?
如何调整数据分布?
import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
x, y = datasets.make_moons(n_samples=50000, noise=0.3, random_state=42)
#默认分割比例是75%和25%
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=42)
from sklearn.ensemble import AdaBoostClassifier
ada_clf = AdaBoostClassifier(DecisionTreeClassifier(), n_estimators=500)
ada_clf.fit(x_train, y_train)
ada_clf.score(x_test, y_test)
w
1
=
w
0
−
α
▽
f
(
w
0
;
x
)
w
2
=
w
1
−
α
▽
f
(
w
1
;
x
)
w
3
=
w
2
−
α
▽
f
(
w
2
;
x
)
w
4
=
w
3
−
α
▽
f
(
w
3
;
x
)
w
5
=
w
4
−
α
▽
f
(
w
4
;
x
)
w
5
=
w
0
−
α
▽
f
(
w
0
;
x
)
−
α
▽
f
(
w
1
;
x
)
−
α
▽
f
(
w
2
;
x
)
−
α
▽
f
(
w
3
;
x
)
−
α
▽
f
(
w
4
;
x
)
令
α
=
1
,
h
i
(
x
)
=
−
▽
f
(
w
i
;
x
)
\alpha=1,h_i(x)=-\bigtriangledown f(w_i;x)
α=1,hi(x)=−▽f(wi;x)
H
(
x
)
=
h
0
(
x
)
+
h
1
(
x
)
+
h
2
(
x
)
.
.
.
H(x)=h_0(x)+h_1(x)+h_2(x)...
H(x)=h0(x)+h1(x)+h2(x)...
H
(
x
)
H(x)
H(x)是boosting集成表达方式,如果上式中的
h
i
(
x
)
=
h_i(x)=
hi(x)=决策树模型,则上式就变为:
GBDT=梯度下降+Boosting+决策树
编号 | 年龄(岁) | 体重(KG) | 身高(M) |
---|---|---|---|
1 | 5 | 20 | 1.1 |
2 | 7 | 30 | 1.3 |
3 | 21 | 70 | 1.7 |
4 | 30 | 60 | 1.8 |
5 | 25 | 65 | ? |
得出:年龄21为划分点的方差=0.01+0.0025=0.0125
编号5身高 = 1.475 + 0.03 + 0.275 = 1.78
#它默认用的也是决策树,增加了基分类器的数目后,准确率提升
from sklearn.ensemble import GradientBoostingClassifier #梯度决策树
gb_clf = GradientBoostingClassifier(max_depth=2, n_estimators=500)
gb_clf.fit(x_train, y_train)
gb_clf.score(x_test, y_test)
import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
x, y = datasets.make_moons(n_samples=50000, noise=0.3, random_state=42)
rc_clf = RandomForestClassifier(n_estimators=500, random_state=666,
oob_score=True, n_jobs=-1)
rc_clf.fit(x, y)
from sklearn.ensemble import VotingClassifier
# 逻辑回归
from sklearn.linear_model import LogisticRegression
# SVM
from sklearn.svm import SVC
# 决策树
from sklearn.tree import DecisionTreeClassifier
#hard模式就是少数服从多数
voting_clf = VotingClassifier(estimators=[
('log_clf', LogisticRegression()),
('svm_clf', SVC()),
('dt_clf', DecisionTreeClassifier())], voting='hard')
voting_clf2 = VotingClassifier(estimators=[
('log_clf', LogisticRegression()),
('svm_clf', SVC(probability=True)), #支持向量机中需要加入probability
('dt_clf', DecisionTreeClassifier())], voting='soft')
voting_clf2.fit(x_train, y_train)
voting_clf2.score(x_test, y_test)