支持向量机学习笔记（2）参数比较与人脸识别项目

上期回顾：

支持向量机（Support Vector Machine, SVM）是一类按监督学习（supervised learning）方式对数据进行的广义线性分类器（generalized linear classifier），其是对学习样本求解的最大边距超平面（maximum-margin hyperplane）。

SVM使用铰链损失函数（hinge loss）计算经验风险（empirical risk）并在求解系统中加入了正则化项以优化结构风险（structural risk），是一个具有稀疏性和稳健性的分类器。SVM可以通过核方法（kernel method）进行非线性分类，是常见的核学习（kernel learning）方法之一。

（1）线性SVM分类

通过建立一个超平面，优化模型使最靠近超平面的点的间隔最大，称之为大间隔分类。由于SVM对特征的缩放特别敏感，对函数间隔影响较大，所以要对特征先进行标准化，用于特征的缩放。

如果将实例严格的分类，让所有的实例都处于正确的一方，称之为硬间隔分类；反之，称之为软间隔分类。

（2）非线性SVM分类

处理方法1：使用PolynomialFeatures让高维的特征变成新的特征

处理方法2：使用支持向量机中的核技巧

本期讨论SVM中参数问题与人脸识别的实战训练

参数比较：这里使用的是sklearn.datasets包中的make_blobs来构建数据，先模拟量三条线，这里通过对x变量进行构建，然后通过zip方法进行函数的模拟，然后使用画出图像的最大间隔线，这里使用的是svm自带的decision_function函数进行画图，然后进行训练模型，这里使用的核函数是linear，通过改变数据的数量进行绘图，发现只要最大间隔上面的点不改变，画出的线不改变。


##导入数据包
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
import seaborn as sns;sns.set()
from sklearn.datasets import make_blobs
X,y=make_blobs(n_samples=50,centers=2,random_state=0,cluster_std=0.6)##cluster_std用于描述数据离散程度
plt.scatter(X[:,0],X[:,1],c=y,s=50,cmap='autumn')##参数解释c:色彩序列，s:数组标量，cmap：图谱
xfit =np.linspace(-1,3.5)
plt.scatter(X[:,0],X[:,1],c=y,s=50,cmap='autumn')
for m ,b in[(1,0.65),(0.5,1.6),(-0.2,2.9)]:
    plt.plot(xfit,m*xfit+b,'-k')
plt.xlim(-1,3.5)
def plot_svc_decision_function(model,ax=None,plot_support=True):
    if ax is None:
        ax=plt.gca()##用于确定坐标轴
    xlim=ax.get_xlim()
    ylim =ax.get_ylim()
    x = np.linspace(xlim[0],xlim[1],30)
    y = np.linspace(ylim[0],ylim[1],30)
    Y,X = np.meshgrid(y,x)##用x,y两个轴上的点画网格
    xy =np.vstack([X.ravel(),Y.ravel()]).T
    P =model.decision_function(xy).reshape(X.shape)
    ax.contour(X,Y,P,color='k',
              levels=[-1,0,1],alpha=0.5,
              linestyle=['--','-','--'])
    if plot_support:
        ax.scatter(model.support_vectors_[:,0],model.support_vectors_[:,1],
                  s=300,linewidth=1,facecolors='none')
        ax.set_xlim(xlim)
        ax.set_ylim(ylim)
plt.scatter(X[:,0],X[:,1],c=y,s=50,cmap='autumn')
plot_svc_decision_function(model)
from sklearn.svm import SVC
model =SVC(kernel='linear')
model.fit(X,y)
model.support_vectors_##这个是最大间隔上的点
def plot_svm(N=10, ax=None):
    X, y = make_blobs(n_samples=200, centers=2,
                      random_state=0, cluster_std=0.60)
    X = X[:N]
    y = y[:N]
    model = SVC(kernel='linear', C=1E10)
    model.fit(X, y)
    
    ax = ax or plt.gca()
    ax.scatter(X[:, 0], X[:, 1], c=y, s=50, cmap='autumn')
    ax.set_xlim(-1, 4)
    ax.set_ylim(-1, 6)
    plot_svc_decision_function(model, ax)
# 分别对不同的数据点进行绘制
fig, ax = plt.subplots(1, 2, figsize=(16, 6))
fig.subplots_adjust(left=0.0625, right=0.95, wspace=0.1)
for axi, N in zip(ax, [60, 120]):
    plot_svm(N, axi)
    axi.set_title('N = {0}'.format(N))

这条线就是我们希望得到的决策边界啦
观察发现有3个点做了特殊的标记，它们恰好都是边界上的点
它们就是我们的support vectors（支持向量）
在Scikit-Learn中, 它们存储在这个位置 support_vectors_（一个属性）

左边是60个点的结果，右边的是120个点的结果
观察发现，只要支持向量没变，其他的数据怎么加无所谓！

改变成为非线性模型继续测试,并测试核函数变化的情况，切换成为高纬度进行测试，调节C（惩罚项）参数，当C趋近于无穷大时：意味着分类严格不能有错误，当C趋近于很小的时：意味着可以有更大的错误容忍，调节gamma参数：gamma越大，模型复杂度越高。


from mpl_toolkits import mplot3d
r = np.exp(-(X ** 2).sum(1))
# 可以想象一下在三维中把环形数据集进行上下拉伸
def plot_3D(elev=30, azim=30, X=X, y=y):
    ax = plt.subplot(projection='3d')
    ax.scatter3D(X[:, 0], X[:, 1], r, c=y, s=50, cmap='autumn')
    ax.view_init(elev=elev, azim=azim)
    ax.set_xlabel('x')
    ax.set_ylabel('y')
    ax.set_zlabel('r')
 
plot_3D(elev=45, azim=45, X=X, y=y)
clf = SVC(kernel='rbf')
clf.fit(X, y)
plt.scatter(X[:, 0], X[:, 1], c=y, s=50, cmap='autumn')
plot_svc_decision_function(clf)
plt.scatter(clf.support_vectors_[:, 0], clf.support_vectors_[:, 1],
            s=300, lw=1, facecolors='none');
X, y = make_blobs(n_samples=100, centers=2,
                  random_state=0, cluster_std=0.8)
plt.scatter(X[:, 0], X[:, 1], c=y, s=50, cmap='autumn')
X, y = make_blobs(n_samples=100, centers=2,
                  random_state=0, cluster_std=0.8)
 
fig, ax = plt.subplots(1, 2, figsize=(16, 6))
fig.subplots_adjust(left=0.0625, right=0.95, wspace=0.1)
# 选择两个C参数来进行对别实验，分别为10和0.1
for axi, C in zip(ax, [10.0, 0.1]):
    model = SVC(kernel='linear', C=C).fit(X, y)
    axi.scatter(X[:, 0], X[:, 1], c=y, s=50, cmap='autumn')
    plot_svc_decision_function(model, axi)
    axi.scatter(model.support_vectors_[:, 0],
                model.support_vectors_[:, 1],
                s=300, lw=1, facecolors='none');
    axi.set_title('C = {0:.1f}'.format(C), size=14)
X, y = make_blobs(n_samples=100, centers=2,
                  random_state=0, cluster_std=1.1)
 
fig, ax = plt.subplots(1, 2, figsize=(16, 6))
fig.subplots_adjust(left=0.0625, right=0.95, wspace=0.1)
# 选择不同的gamma值来观察建模效果
for axi, gamma in zip(ax, [10.0, 0.1]):
    model = SVC(kernel='rbf', gamma=gamma).fit(X, y)
    axi.scatter(X[:, 0], X[:, 1], c=y, s=50, cmap='autumn')
    plot_svc_decision_function(model, axi)
    axi.scatter(model.support_vectors_[:, 0],
                model.support_vectors_[:, 1],
                s=300, lw=1, facecolors='none');
    axi.set_title('gamma = {0:.1f}'.format(gamma), size=14)

上述代码分别通过设置松弛因子控制参数C的值分别为10,0.1，gamma值分别为10,0.1，可以通过以下的图片进行判断，对于参数C来说，C=10时相当于对分类的要求比较严格，在分类对的前提下，再去寻找最宽的决策边界，虽然分类完全正确，但是决策边界太窄，最大间隔太小，而C=0.1时，虽然有点错误，但是决策边界大。

而当gamma=10时，训练后的模型十分复杂，虽然分类全部正确，但是决策边界来看可以得出过拟合的几率比较大，但是gamma=0.1时，模型比较简单，虽然出现部分错误分类，但是决策边界比较稳定

人脸识别：这里读取了sklearn.datasets.fetch_lfw_people的数据，为了使每个人的样本不至于过少，这里选取了每个人可以有60个样本。


##进行数据读取
from sklearn.datasets import fetch_lfw_people
faces = fetch_lfw_people(min_faces_per_person=60)
print(faces.target_names)
print(faces.images.shape)
##数据预处理
from sklearn.model_selection import train_test_split
X_train,X_test,Y_train,Y_test = train_test_split(faces.data,faces.target,random_state=40)

模型建立：这里使用的是流水线方法进行两个函数的合并，使用了Pipeline与make_pipeline,然后用GridSearchCV进行参数的调优。由于这里使用的是图片数据，图片由像素点构成，会导致特征过多，所以使用PCA方法进行降维，然后使用GridSearchCV进行降维维度的调参。


##模型建立
from sklearn.pipeline import Pipeline
from sklearn.svm import SVC
from sklearn.decomposition import PCA 
pca_svc = Pipeline(steps=[('pca',PCA( whiten=True, random_state=42)),('svc',SVC(class_weight='balanced'))])
from sklearn.model_selection import GridSearchCV
param_grid = [{'pca__n_components':[100,150,200,500],'svc__C': [1,5,10],'svc__gamma':[0.0001,0.0005,0.001],
               'svc__kernel':['linear','poly','rbf','sigmoid']}]
grid_search =GridSearchCV(pca_svc,param_grid)
grid_search.fit(X_train,Y_train)
grid_search.best_params_


from sklearn.pipeline import make_pipeline
pca = PCA( whiten=True, random_state=42)
svc = SVC(class_weight='balanced')
#先降维然后再SVM
model = make_pipeline(pca, svc)
grid_search =GridSearchCV(model,param_grid)
grid_search.fit(X_train,Y_train)
grid_search.best_params_

注：此处在Pipeline中一定要对两个方法都有参数的调优，否则会报错。

对训练集进行预测并构造混淆矩阵：


model=grid_search.best_estimator_
y_pred = model.predict(X_test)
from sklearn.metrics import classification_report
print(classification_report(Y_test,y_pred,target_names=faces.target_names))
from sklearn.metrics import confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt
matrix = confusion_matrix(Y_test,y_pred)
sns.heatmap(matrix.T,square=True,annot=True,fmt='d',cbar=False,
           xticklabels=faces.target_names,
           yticklabels=faces.target_names)
plt.xlabel('true label')
plt.ylabel('predicted label')

精度(precision) = 正确预测的个数(TP)/被预测正确的个数(TP+FP)
召回率(recall)=正确预测的个数(TP)/预测个数(TP+FN)
F1 = 2精度召回率/(精度+召回率)

从中我们可以看出，在中间的对角线中代表数据预测的正确，也可以从上面Classification_report中得出，对George W Bush 的召回率最高也就是正确率最高，在混淆矩阵中也是他的正确数量最多。

相关阅读:
找不到x3daudio1_7.dll怎么解决？x3daudio1_7.dll的5个修复方法
 基于微信共享充电桩小程序系统设计与实现开题报告
 【附源码】计算机毕业设计JAVA学校食堂订餐管理
 DataOps: A New Discipline 数据治理的下一步
 无限上下文，多级内存管理！突破ChatGPT等大语言模型上下文限制
 c# - - - winform 右下角气球提示通知
 【MySQL】3.2-MySQL中的比较运算符
 webpack一些常用的Loader和Plugin
怎样用图片去搜索商品呢？
VirtualApp系统升级适配方法论（1）——一切从源码来
原文地址：https://blog.csdn.net/lovexyyforever/article/details/125886559