can perform both classification and regression tasks, and even multioutput tasks
tree_clf = DecisionTreeClassifier(max_depth=2)
export_graphviz(
tree_clf,
out_file=image_path("iris_tree.dot"),
feature_names=iris.feature_names[2:],
class_names=iris.target_names,
rounded=True,
filled=True
)
$ dot -Tpng iris_tree.dot -o iris_tree.png
递归的为每个节点寻找最好的划分特征k和划分特征的阈值t,CART Cost Function For classification
J
(
k
,
t
k
)
=
m
l
e
f
t
m
G
l
e
f
t
+
m
r
i
g
h
t
m
G
r
i
g
h
t
(
G
m
e
a
s
u
r
e
s
t
h
e
i
m
p
u
r
i
t
y
o
f
t
h
e
s
u
b
s
e
t
)
J(k, t_k)=\frac{m_{left}}{m}G_{left} + \frac{m_{right}}{m}G_{right} \;\; (G \; measures \; the \; impurity \; of \; the \; subset)
J(k,tk)=mmleftGleft+mmrightGright(Gmeasurestheimpurityofthesubset)
处了GINI指数可以作为G,香农信息熵也是一种方法
H
i
=
−
∑
k
=
1
,
p
i
,
k
≠
0
n
p
i
,
k
l
o
g
(
p
i
,
k
)
H_i=-\sum_{k=1,p_{i,k}\neq 0}^{n}p_{i,k}log(p_{i,k})
Hi=−k=1,pi,k=0∑npi,klog(pi,k)
默认选择GINI指数,计算复杂度低一些,二者训练出来的树差不多,Gini impurity tends to isolate the most frequent class in its own branch of the tree, while entropy tends to produce slightly more balanced trees
将混乱程度修改为均值平方差
from sklearn.tree import DecisionTreeRegressor
# setting min_samples_leaf=10 to obviously overfitting
tree_reg = DecisionTreeRegressor(max_depth=2)
tree_reg.fit(X, y)
返回的value值,是这一个区间内的所有samples的平均值
@ 学必求其心得,业必贵其专精