文章详情

短信预约-IT技能 免费直播动态提醒

请输入下面的图形验证码

提交验证

短信预约提醒成功

密度峰值聚类算法DPC(Density Peak Clustering)理论基础与python实现

2023-09-01 19:31

关注

基于密度峰值的聚类算法全称为基于快速搜索和发现密度峰值的聚类算法(clustering by fast search and find of density peaks, DPC)。它是2014年在Science上提出的聚类算法,该算法能够自动地发现簇中心,实现任意形状数据的高效聚类。

密度峰值聚类算法是对K-Means算法的一种改进,回顾K-Means算法,它需要人为指定聚类的簇的个数K,并且需要不断地去迭代更新聚类中心。如果K值指定的不恰当,那么最终得到的结果也将千差万别。此外K-Means算法在迭代过程中容易受到离群点的干扰,对于非簇状的数据效果很差。如下图的聚类结果,几乎不能使用K-Means算法得到。
在这里插入图片描述
密度峰值聚类(DP)算法是一种不需要迭代的,可以一次性找到聚类中心的方法聚类方法。(当时看到这篇文章的时候还是很震惊的,毕竟是发表在顶级刊物Science上的文章)

密度峰值聚类算法有两个基本的假设:

在这里插入图片描述
如上图所示:
点 1 密度最大是一个聚类中心;
点2,6,4密度也比较大,但是距离比他们密度更大的点(点1)太近,所以不是聚类中心;
点10 密度较大,且离密度比它大的点(1,2,4,6)较远是聚类中心;

基于以上两个假设,衍生出两个基本的概念:

密度峰值聚类算法认为两者都大的点就是聚类中心点

DPC算法的执行步骤

在这里插入图片描述

导入需要用到的包

import numpy as npimport matplotlib.pyplot as plt

步骤一:计算数据点两两之间的距离

# 计算数据点两两之间的距离def getDistanceMatrix(datas):    N,D = np.shape(datas)    dists = np.zeros([N,N])        for i in range(N):        for j in range(N):            vi = datas[i,:]            vj = datas[j,:]            dists[i,j]= np.sqrt(np.dot((vi-vj),(vi-vj)))    return dists

步骤二:确定邻域截断距离 dc d_c dc

# 找到密度计算的阈值dc# 要求平均每个点周围距离小于dc的点的数目占总点数的1%-2%def select_dc(dists):        '''算法1'''    N = np.shape(dists)[0]    tt = np.reshape(dists,N*N)    percent = 2.0    position = int(N * (N - 1) * percent / 100)    dc = np.sort(tt)[position  + N]        ''' 算法 2 '''    # N = np.shape(dists)[0]    # max_dis = np.max(dists)    # min_dis = np.min(dists)    # dc = (max_dis + min_dis) / 2       # while True:        # n_neighs = np.where(dists        # rate = n_neighs/(N*(N-1))                # if rate>=0.01 and rate<=0.02:            # break        # if rate<0.01:            # min_dis = dc        # else:            # max_dis = dc                    # dc = (max_dis + min_dis) / 2        # if max_dis - min_dis < 0.0001:            # break    return dc

步骤三:计算每个点的局部密度 ρi ρ_i ρi

# 计算每个点的局部密度    def get_density(dists,dc,method=None):    N = np.shape(dists)[0]    rho = np.zeros(N)        for i in range(N):        if method == None:            rho[i]  = np.where(dists[i,:]<dc)[0].shape[0]-1        else:            rho[i] = np.sum(np.exp(-(dists[i,:]/dc)**2))-1    return rho

步骤四:计算每个点的偏移距离 δi δ_i δi j j j

# 计算每个数据点的密度距离# 即对每个点,找到密度比它大的所有点# 再在这些点中找到距离其最近的点的距离def get_deltas(dists,rho):    N = np.shape(dists)[0]    deltas = np.zeros(N)    nearest_neiber = np.zeros(N)    # 将密度从大到小排序    index_rho = np.argsort(-rho)    for i,index in enumerate(index_rho):        # 对于密度最大的点        if i==0:            continue                    # 对于其他的点        # 找到密度比其大的点的序号            index_higher_rho = index_rho[:i]        # 获取这些点距离当前点的距离,并找最小值        deltas[index] = np.min(dists[index,index_higher_rho])                #保存最近邻点的编号        index_nn = np.argmin(dists[index,index_higher_rho])        nearest_neiber[index] = index_higher_rho[index_nn].astype(int)        deltas[index_rho[0]] = np.max(deltas)       return deltas,nearest_neiber

步骤五:估算聚类中心点

# 通过阈值选取 rho与delta都大的点# 作为聚类中心    def find_centers_auto(rho,deltas):    rho_threshold = (np.min(rho) + np.max(rho))/ 2    delta_threshold  = (np.min(deltas) + np.max(deltas))/ 2    N = np.shape(rho)[0]        centers = []    for i in range(N):        if rho[i]>=rho_threshold and deltas[i]>delta_threshold:            centers.append(i)    return np.array(centers)# 选取 rho与delta乘积较大的点作为# 聚类中心   def find_centers_K(rho,deltas,K):    rho_delta = rho*deltas    centers = np.argsort(-rho_delta)    return centers[:K]

步骤六:对非聚类中心数据点进行归类

def cluster_PD(rho,centers,nearest_neiber):    K = np.shape(centers)[0]    if K == 0:        print("can not find centers")        return        N = np.shape(rho)[0]    labs = -1*np.ones(N).astype(int)        # 首先对几个聚类中进行标号    for i, center in enumerate(centers):        labs[center] = i       # 将密度从大到小排序    index_rho = np.argsort(-rho)    for i, index in enumerate(index_rho):        # 从密度大的点进行标号        if labs[index] == -1:            # 如果没有被标记过            # 那么聚类标号与距离其最近且密度比其大            # 的点的标号相同            labs[index] = labs[int(nearest_neiber[index])]    return labs

可视化展示

def draw_decision(rho,deltas,name="0_decision.jpg"):           plt.cla()    for i in range(np.shape(datas)[0]):        plt.scatter(rho[i],deltas[i],s=16.,color=(0,0,0))        plt.annotate(str(i), xy = (rho[i], deltas[i]),xytext = (rho[i], deltas[i]))        plt.xlabel("rho")        plt.ylabel("deltas")    plt.savefig(name)def draw_cluster(datas,labs,centers, dic_colors, name="0_cluster.jpg"):         plt.cla()    K = np.shape(centers)[0]        for k in range(K):        sub_index = np.where(labs == k)        sub_datas = datas[sub_index]        # 画数据点        plt.scatter(sub_datas[:,0],sub_datas[:,1],s=16.,color=dic_colors[k])        # 画聚类中心        plt.scatter(datas[centers[k],0],datas[centers[k],1],color="k",marker="+",s = 200.)    plt.savefig(name)

主函数入口

if __name__== "__main__":    #画图保存的颜色卡    dic_colors = {0:(.8,0,0),1:(0,.8,0),                  2:(0,0,.8),3:(.8,.8,0),                  4:(.8,0,.8),5:(0,.8,.8),                  6:(0,0,0)}    #读取文件    file_name = "spiral"    with open(file_name+".txt","r",encoding="utf-8") as f:        lines = f.read().splitlines()    lines = [line.split("\t")[:-1] for line in lines]    datas = np.array(lines).astype(np.float32)        # 计算距离矩阵    dists = getDistanceMatrix(datas)    # 计算dc    dc = select_dc(dists)    print("dc",dc)    # 计算局部密度     rho = get_density(dists,dc,method="Gaussion")    # 计算密度距离    deltas, nearest_neiber= get_deltas(dists,rho)      # 绘制密度/距离分布图    draw_decision(rho,deltas,name=file_name+"_decision.jpg")    # 获取聚类中心点    centers = find_centers_K(rho,deltas,3)    # centers = find_centers_auto(rho,deltas)    print("centers",centers)    #聚类    labs = cluster_PD(rho,centers,nearest_neiber)    #可视化展示    draw_cluster(datas,labs,centers, dic_colors, name=file_name+"_cluster.jpg") 

结果展示如下:
在这里插入图片描述
使用到的数据如下,可自行复制(记得文件名与路径需要与程序中的需要一致)

31.957.95331.157.3330.456.65329.76328.95.55328.055327.24.55326.354.15325.43.85324.63.6323.63.3322.753.15321.853.05320.933202.9319.13318.23.2317.33.25316.553.5315.73.7314.854.1314.154.4313.44.75312.75.2312.055.65311.456.15310.96.65310.37.2539.77.8539.358.3538.99.0538.559.6538.1510.3537.9510.9537.7511.737.5512.3537.451337.3513.7537.314.3537.3514.9537.3515.7537.5516.3537.716.9537.817.5538.0518.1538.318.7538.6519.338.919.8539.320.339.6520.8310.221.25310.621.65311.122.15311.5522.45311.9522.7312.5523313.0523.2313.4523.431423.55314.5523.6315.123.75315.723.75316.1523.85316.723.8317.1523.75317.7523.75318.223.6318.6523.5319.123.35319.623.1532022.95320.422.7320.722.5532122.15321.4521.95321.7521.5532221.25322.2521322.520.7322.6520.35322.7520.05322.919.6532319.35323.119323.1518.65323.218.25323.218.05323.217.8323.117.45323.0517.15322.916.9322.8516.6322.716.4322.616.2322.5516.05322.415.95322.3515.8322.215.65322.1515.5532215.4321.915.3321.8515.25321.7515.15321.6515.05321.5515321.514.9319.3531.65120.3531.45121.3531.1122.2530.9123.230.45123.9530.05124.929.65125.629.05126.3528.5127.1527.9127.7527.35128.326.6128.9525.85129.525.15129.9524.45130.423.7130.622.9130.922.1131.2521.3131.3520.55131.519.7131.5518.9131.6518.15131.617.35131.4516.55131.315.8131.1515.05130.914.35130.613.65130.313129.912.3129.511.7512911.15128.510.612810.1127.559.65126.99.1126.258.8125.78.4125.158.05124.57.75123.97.65123.157.4122.57.3121.97.1121.257.05120.57119.96.95119.257.05118.757.1118.057.25117.57.35116.97.6116.357.8115.88.05115.48.35114.98.7114.458.9113.959.3113.69.65113.2510.1112.9510.55112.6510.9112.3511.4112.211.75111.9512.2111.812.65111.7513.05111.5513.6111.5514111.5514.35111.5514.7111.615.25111.6515.7111.816.05111.8516.511216.75112.1517.2112.317.6112.5517.85112.818.05113.118.4113.318.6113.5518.85113.819.05114.1519.25114.4519.5114.8519.5511519.7115.2519.7115.5519.85115.9519.9116.219.9116.5519.9116.8519.9117.219.9117.419.8117.6519.75117.819.711819.6118.219.5513.99.623.5510.6523.3511.423.112.3523.113.2523.0514.152315.123.11623.216.8523.4517.7523.718.723.9519.5524.3520.2524.721.125.1521.825.622.526.223.326.823.8527.3524.4528.0524.9528.825.4529.526210.226.35210.926.75211.727212.4527.25213.327.6214.0527.6214.727.75215.5527.75216.427.75217.127.75217.927.75218.5527.7219.3527.6220.127.35220.727.1221.4526.8222.0526.5222.726.15223.3525.65223.825.3224.324.85224.7524.35225.2523.95225.6523.45226.0523226.222.3226.621.8226.7521.2522720.7227.1520.15227.1519.6227.3519.1227.3518.45227.418227.317.4227.1516.922716.422715.9226.7515.35226.5514.85226.314.45225.9514.1225.7513.7225.3513.3225.0512.95224.812.7224.412.45224.0512.2223.5511.85223.211.65222.7511.4222.311.3221.911.1221.4511.05221.111220.710.95220.3510.95219.9511219.5511219.1511.05218.8511.1218.4511.25218.1511.35217.8511.5217.511.7217.211.9521712.05216.7512.2216.6512.35216.512.5216.3512.7216.212.8216.1512.9521613.1215.9513.25215.913.4215.813.5215.813.65215.7513.85215.6514.05215.6514.25215.6514.5215.6514.62

来源地址:https://blog.csdn.net/qq_37055672/article/details/130000567

阅读原文内容投诉

免责声明:

① 本站未注明“稿件来源”的信息均来自网络整理。其文字、图片和音视频稿件的所属权归原作者所有。本站收集整理出于非商业性的教育和科研之目的,并不意味着本站赞同其观点或证实其内容的真实性。仅作为临时的测试数据,供内部测试之用。本站并未授权任何人以任何方式主动获取本站任何信息。

② 本站未注明“稿件来源”的临时测试数据将在测试完成后最终做删除处理。有问题或投稿请发送至: 邮箱/279061341@qq.com QQ/279061341

软考中级精品资料免费领

  • 历年真题答案解析
  • 备考技巧名师总结
  • 高频考点精准押题
  • 2024年上半年信息系统项目管理师第二批次真题及答案解析(完整版)

    难度     813人已做
    查看
  • 【考后总结】2024年5月26日信息系统项目管理师第2批次考情分析

    难度     354人已做
    查看
  • 【考后总结】2024年5月25日信息系统项目管理师第1批次考情分析

    难度     318人已做
    查看
  • 2024年上半年软考高项第一、二批次真题考点汇总(完整版)

    难度     435人已做
    查看
  • 2024年上半年系统架构设计师考试综合知识真题

    难度     224人已做
    查看

相关文章

发现更多好内容

猜你喜欢

AI推送时光机
位置:首页-资讯-后端开发
咦!没有更多了?去看看其它编程学习网 内容吧
首页课程
资料下载
问答资讯