• 无监督学习KMeans学习笔记和实例


    KMeans算法是一种简单的算法,能够快速,高效的对数据集进行聚类,一般只要通过几次迭代即可。KMeans可以作为一种聚类工具,同时也可以作为一种降维的方式进行特征降维。

    KMeans可以通sklearn.cluster.kmeans中进行调用。

    1. from sklearn.datasets import make_blobs
    2. import numpy as np
    3. blob_centers = np.array(
    4. [[ 0.2, 2.3],
    5. [-1.5 , 2.3],
    6. [-2.8, 1.8],
    7. [-2.8, 2.8],
    8. [-2.8, 1.3]])
    9. blob_std = np.array([0.4, 0.3, 0.1, 0.1, 0.1])
    10. X, y = make_blobs(n_samples=2000, centers=blob_centers,
    11. cluster_std=blob_std, random_state=7)
    12. from sklearn.cluster import KMeans
    13. kmeans =KMeans(n_clusters=5)
    14. kmeans.fit(X)
    15. y_pred =kmeans.predict(X)
    16. y_pred
    17. y_pred is kmeans.labels_
    18. kmeans.cluster_centers_##中心位置

    从中我们可以看出kmeans可以有labels_和cluster_centers_两个函数,kmeans.label_可以显示具体每一个实例的分类副本,而cluster_centers_是显示了分类中心。

    现在可以用新的样本进行预测

    1. x_new = np.array([[0,2],[3,2],[-3,3],[-3,2.5]])
    2. kmeans.predict(x_new)
    3. kmeans.transform(x_new)##输出每个实例到5个中间点的距离

    kmeans.transform()可以显示输入的样本到各个类别中心的距离。

    1. good_init=np.array([[-3,3],[-3,2],[-3,1],[-1,2],[0,2]])
    2. kmeans =KMeans(n_clusters=5,init =good_init,n_init=1)##init为初始中心点,n_init为迭代次数
    3. kmeans.fit(X)
    4. kmeans.inertia_##输出簇内平方和
    5. kmeans.score(X)##返回负惯性

    kmeans的超参数init是选择中心点的选择方式,n_init为中心点的聚类次数。

    kmeans.inertia_是计算样本到簇内中心的距离的平方和,称之为模型的惯性,kmeans.score是输出为负惯性。

    kmeans++算法:其算法的目的是使中心点的初始分布更广,算法收敛到次优解的概率减少。可以通过设置参数init为random进行实现。

    1. ##实现kmeans++
    2. kmeans_plus = KMeans(n_clusters=5,init='random')
    3. kmeans_plus.fit(X)
    4. kmeans_plus.inertia_

    加速kmeans:其算法利用三角不等式,简便了计算,提升了运行效率,可以通过algorithm=full进行设置。

    1. ##实现加速k-means
    2. kmeans_add =KMeans(n_clusters=5,algorithm='full')
    3. kmeans_add.fit(X)
    4. kmeans.inertia_

    小批量kmeans:该算法能够在每次迭代的时候使用小批量kmeans稍微移动中心点。使用MiniBatchKMeans。

    1. ##小批量kmeans
    2. from sklearn.cluster import MiniBatchKMeans
    3. minibatch_kmeans =MiniBatchKMeans(n_clusters=5)
    4. minibatch_kmeans.fit(X)
    5. minibatch_kmeans.inertia_

    判断一个分类是否合理可以通过计算数据的轮廓分数,其范围在【-1,1】之间,当其=1是,说明实例分类满足十分靠近所属中心,且远离别的中心。

    1. from sklearn.metrics import silhouette_score
    2. silhouette_score(X,kmeans.labels_)
    3. kmeans_per_k = [KMeans(n_clusters=k, random_state=42).fit(X)
    4. for k in range(1, 10)]
    5. silhouette_scores = [silhouette_score(X, model.labels_)
    6. for model in kmeans_per_k[1:]]
    7. inertias = [model.inertia_ for model in kmeans_per_k]
    8. ##对于sihouette_score来说,约接近1说明位置处于自身集群中,且离其他集群很远。
    9. ##当接近-1时说明基本上分错集群了
    10. plt.figure(figsize=(8, 3))
    11. plt.plot(range(2, 10), silhouette_scores, "bo-")
    12. plt.xlabel("$k$", fontsize=14)
    13. plt.ylabel("Silhouette score", fontsize=14)
    14. plt.axis([1.8, 8.5, 0.55, 0.7])
    15. plt.show()

    这张图说明了不同k的值的轮廓分数。

    实例:使用kmeans进行图像分割

    1. ##利用聚类进行图像分割
    2. # Download the ladybug image
    3. import os
    4. import urllib
    5. PROJECT_ROOT_DIR = "."
    6. CHAPTER_ID = "unsupervised_learning"
    7. IMAGES_PATH = os.path.join(PROJECT_ROOT_DIR, "images", CHAPTER_ID)
    8. os.makedirs(IMAGES_PATH, exist_ok=True)
    9. images_path = os.path.join(PROJECT_ROOT_DIR, "images", "unsupervised_learning")
    10. os.makedirs(images_path, exist_ok=True)
    11. DOWNLOAD_ROOT = "https://raw.githubusercontent.com/ageron/handson-ml2/master/"
    12. filename = "ladybug.png"
    13. print("Downloading", filename)
    14. url = DOWNLOAD_ROOT + "images/unsupervised_learning/" + filename
    15. urllib.request.urlretrieve(url, os.path.join(images_path, filename))
    16. from matplotlib.image import imread
    17. image = imread(os.path.join(images_path, filename))
    18. kmeans = KMeans(n_clusters=8).fit(X)
    19. segmented_img = kmeans.cluster_centers_[kmeans.labels_]##对实例样本进行调整,变成kmeans聚类的类
    20. segmented_img =segmented_img.reshape(image.shape)
    21. segmented_imgs = []
    22. n_colors = (10, 8, 6, 4, 2)
    23. for n_clusters in n_colors:
    24. kmeans = KMeans(n_clusters=n_clusters, random_state=42).fit(X)
    25. segmented_img = kmeans.cluster_centers_[kmeans.labels_]
    26. segmented_imgs.append(segmented_img.reshape(image.shape))
    27. plt.figure(figsize=(10,5))
    28. plt.subplots_adjust(wspace=0.05, hspace=0.1)
    29. plt.subplot(231)
    30. plt.imshow(image)
    31. plt.title("Original image")
    32. plt.axis('off')
    33. for idx, n_clusters in enumerate(n_colors):
    34. plt.subplot(232 + idx)
    35. plt.imshow(segmented_imgs[idx])
    36. plt.title("{} colors".format(n_clusters))
    37. plt.axis('off')
    38. plt.show()

     这里下载了数据,然后通过kmeans进行聚类,然后通过改变聚类的数量,画出图像。

    实例2:利用kmeans进行降维与预处理

    这里通过MNIST中的图像进行降维处理

    1. from sklearn.datasets import load_digits
    2. X_digits,y_digits =load_digits(return_X_y=True)
    3. from sklearn.model_selection import train_test_split
    4. x_train,x_test,y_train,y_test = train_test_split(X_digits,y_digits)
    5. from sklearn.linear_model import LogisticRegression
    6. log_reg =LogisticRegression()
    7. log_reg.fit(x_train,y_train)
    8. log_reg.score(x_test,y_test)
    9. from sklearn.pipeline import Pipeline
    10. log_kmeans = Pipeline([
    11. ('kmeans',KMeans(n_clusters=50)),
    12. ('log_reg',LogisticRegression())
    13. ])
    14. log_kmeans.fit(x_train,y_train)
    15. from sklearn.model_selection import GridSearchCV
    16. param_grid = dict(kmeans__n_clusters=range(2, 100))
    17. grid_clf = GridSearchCV(log_kmeans,param_grid,cv=3,verbose=2)
    18. grid_clf.fit(x_train,y_train)
    19. grid_clf.best_params_
    20. grid_clf.score(x_test,y_test)

     这里利用逻辑回归进行分类,查看没有使用kmeans时和使用kmeans时的负惯性进行比较,发现效果变好。

    实例三:使用kmeans进行半监督学习

    1. ##使用聚类进行半监督学习
    2. k =50
    3. kmeans =KMeans(n_clusters=k)
    4. x_digist_dist = kmeans.fit_transform(x_train)
    5. representative_digit_idx =np.argmin(x_digist_dist,axis=0)##找到50个最靠近中心的图片
    6. x_representative_digists=x_train[representative_digit_idx]
    7. x=x_representative_digists
    8. log_reg =LogisticRegression(multi_class="ovr", solver="lbfgs", max_iter=5000, random_state=42)
    9. log_reg.fit(x,y)
    10. log_reg.score(x_test,y_test)
    11. ##通过标签传播标记实例
    12. y_train_propagated =np.empty(len(x_train),dtype=np.int32)
    13. print(y_train_propagated)
    14. for i in range(k):
    15. y_train_propagated[kmeans.labels_==i]=y[i]
    16. log_reg =LogisticRegression(multi_class="ovr", solver="lbfgs", max_iter=5000, random_state=42)
    17. log_reg.fit(x_train,y_train_propagated)
    18. log_reg.score(x_test,y_test)

     上面是通过给50个样本进行人工标注,进行训练后,将标记的标签传播到所有的样本传播到同意集群的所有实例中,这里包括了集群边界的实例,但会导致错误标记。

    1. percentile_cloest=20
    2. x_cluster_dist =x_digist_dist[np.arange(len(x_train)),kmeans.labels_]
    3. x_cluster_dist
    4. for i in range(k):
    5. in_cluster =(kmeans.labels_==i)
    6. cluster_dist = x_cluster_dist[in_cluster]
    7. cutoff_distance=np.percentile(cluster_dist,percentile_cloest)
    8. above_cutoff = (x_cluster_dist>cutoff_distance)
    9. x_cluster_dist[in_cluster&above_cutoff]=-1
    10. partially_propagated =(x_cluster_dist !=-1)
    11. x_train_partially=x_train[partially_propagated]
    12. y_train_partially =y_train[partially_propagated]
    13. log_reg =LogisticRegression(multi_class="ovr", solver="lbfgs", max_iter=5000, random_state=42)
    14. log_reg.fit(x_train_partially,y_train_partially)
    15. log_reg.score(x_test,y_test)

    以上是筛选了靠近中心的20%的数据进行标记,然后进行训练。

    DBSCAN聚类算法:它是定义了高密度的连续区域,它是通过收到参数eps画一个圆,统计圆内的样本数,最小样本数由min_samples来决定,而且DBSCAN只能用于分类,但不能预测。

    1. ##DBSCAN
    2. from sklearn.cluster import DBSCAN
    3. from sklearn.datasets import make_moons
    4. X,y=make_moons(n_samples=1000,noise=0.05)
    5. dbscan =DBSCAN(eps=0.05,min_samples=5)##min_samples说明一个核心实例中至少要包含5个实例,eps=0.05说明区域是以0.05为半径
    6. dbscan.fit(X)
    7. dbscan.labels_
    8. ##当数值=-1时,说明算法将数据视为异常
    9. dbscan.core_sample_indices_##核心实例的索引
    10. dbscan.components_##核心实例本身

    dbscan.labels_显示实例分类的副本,dbscan.core_sample_indices显示数据的核心实例索引。

    dbscan.components_显示核心实例的坐标。

    实例四:对Olivetti的人脸数据进行聚类,并判断是否拥有正确的集群数。

    1. from sklearn.datasets import fetch_olivetti_faces
    2. data =fetch_olivetti_faces()

    对数据集进行分层分类

    1. from sklearn.model_selection import StratifiedShuffleSplit
    2. sss = StratifiedShuffleSplit(n_splits=1,test_size=40,random_state=42)
    3. train_index,test_index = next(sss.split(data.data,data.target))
    4. x_train=data.data[train_index]
    5. y_train =data.target[train_index]
    6. x_test=data.data[test_index]
    7. y_test=data.target[test_index]
    8. sss_val = StratifiedShuffleSplit(n_splits=1,test_size=80,random_state=42)
    9. train_index,val_index =next(sss_val.split(x_train,y_train))
    10. x_train_new =x_train[train_index]
    11. y_train_new =y_train[train_index]
    12. x_val =x_train[val_index]
    13. y_val =y_train[val_index]

    选择效果最好的聚类个数

    1. from sklearn.cluster import KMeans
    2. kmeans =[KMeans(n_clusters=n).fit(x_train) for n in range(1,200,5)]
    3. from sklearn.metrics import silhouette_score
    4. silhouette_score =[silhouette_score(x_train,kmeans[i].labels_) for i in range(2,40)]
    5. silhouette_score
    6. plt.figure(figsize=(20, 20))
    7. plt.plot(range(6,196,5), silhouette_score, "bo-")
    8. plt.xlabel("$k$", fontsize=14)
    9. plt.ylabel("Silhouette score", fontsize=14)
    10. plt.show()

    1. kmeans =KMeans(n_clusters=135)
    2. kmeans.fit(x_train)
    3. kmeans.inertia_

    显示结果

  • 相关阅读:
    论文解读(SelfGNN)《Self-supervised Graph Neural Networks without explicit negative sampling》
    持续集成与持续交付
    Python:如何在一个月内学会爬取大规模数据
    上周热点回顾(4.22-4.28)
    linux部署nacos记录
    嵌入式开发--赛普拉斯cypress的铁电存储器FM25CL64B
    CentOS 8 编译安装程序包示例(httpd)学习笔记
    AlexNet架构解析
    Pytorch详细教程——12.Tensors For Deep Learning
    Java01-JDK1.8下载安装教程(win11版)
  • 原文地址:https://blog.csdn.net/lovexyyforever/article/details/126132552