上凸包和下凸包_使用凸包聚类

上凸包和下凸包

I recently came across the article titled High-dimensional data clustering by using local affine/convex hulls by HakanCevikalp in Pattern Recognition Letters. It proposes a novel algorithm to cluster high-dimensional data using local affine/convex hulls. I was inspired by their method of using convex hulls for clustering. I wanted to give a try at implementing my own simple clustering approach using convex hulls. So, in this article, I will walk you through my implementation of my clustering approach using convex hulls. Before we get into coding, let’s see what a convex hull is.

我最近在“ 模式识别字母”中碰到了一篇文章,标题为HakanCevikalp 使用本地仿射/凸包来进行高维数据聚类 。 提出了一种使用局部仿射/凸包对高维数据进行聚类的新算法。 他们使用凸包进行聚类的方法给我启发。 我想尝试使用凸包实现我自己的简单聚类方法。 因此,在本文中,我将引导您完成使用凸包的聚类方法的实现。 在进行编码之前,让我们看看什么是凸包。

凸包 (Convex Hull)

According to Wikipedia, a convex hull is defined as follows.

根据维基百科 ,凸包的定义如下。

In geometry, the convex hull or convex envelope or convex closure of a shape is the smallest convex set that contains it.

在几何中,形状的凸包或凸包络或凸包是包含该形状的最小凸集。

Let us consider an example of a simple analogy. Assume that there are a few nails hammered half-way into a plank of wood as shown in Figure 1. You take a rubber band, stretch it to enclose the nails and let it go. It will fit around the outermost nails (shown in blue) and take a shape that minimizes its length. The area enclosed by the rubber band is called the convex hull of the set of nails.

让我们考虑一个简单类比的例子。 如图1所示,假设有一些钉子被钉在一块木板上。将橡皮筋拉开,将其拉紧以包住钉子,然后松开。 它将适合最外面的钉子(以蓝色显示),并具有使长度最小化的形状。 橡皮筋包围的区域称为钉组的凸包

This convex hull (shown in Figure 1) in 2-dimensional space will be a convex polygon where all its interior angles are less than 180°. If it is in a 3-dimensional or higher-dimensional space, the convex hull will be a polyhedron.

这个在二维空间中的凸包(如图1所示)将是一个凸多边形 ,其所有内角均小于180°。 如果在3维或更高维空间中,则凸包将是多面体

There are several algorithms that can determine the convex hull of a given set of points. Some famous algorithms are the gift wrapping algorithm and the Graham scan algorithm.

有几种算法可以确定给定点集的凸包。 一些著名的算法是礼品包装算法和Graham扫描算法 。

Since a convex hull encloses a set of points, it can act as a cluster boundary, allowing us to determine points within a cluster. Hence, we can make use of convex hulls and perform clustering. Let’s get into the code.

由于凸包包围着一组点,因此它可以充当群集边界,从而使我们能够确定群集中的点。 因此,我们可以利用凸包并执行聚类。 让我们进入代码。

一个简单的例子 (A Simple Example)

I will be using Python for this example. Before getting started, we need the following Python libraries.

我将在此示例中使用Python。 在开始之前,我们需要以下Python库。

sklearn
numpy
matplotlib
mpl_toolkits
itertools
scipy
quadprog

数据集 (Dataset)

To create our sample dataset, I will be using sci-kit learn library’s make blobs function. I will make 3 clusters.

为了创建示例数据集,我将使用sci-kit学习库的make blobs函数。 我将制作3个群集。

import numpy as np
from sklearn.datasets import make_blobscenters = [[0, 1, 0], [1.5, 1.5, 1], [1, 1, 1]]
stds = [0.13, 0.12, 0.12]X, labels_true = make_blobs(n_samples=1000, centers=centers, cluster_std=stds, random_state=0)
point_indices = np.arange(1000)

Since this is a dataset of points with 3 dimensions, I will be drawing a 3D plot to show our ground truth clusters. Figure 2 denotes the scatter plot of the dataset with coloured clusters.

由于这是3维点的数据集,因此我将绘制3D图以显示我们的地面真相群集。 图2表示带有彩色簇的数据集的散点图。

import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3Dx = X[:,0]
y = X[:,1]
z = X[:,2]
# Creating figure
fig = plt.figure(figsize = (15, 10))
ax = plt.axes(projection ="3d")

# Add gridlines
ax.grid(b = True, color ='grey',
linestyle ='-.', linewidth = 0.3,
alpha = 0.2)

mycolours = ["red", "green", "blue"]# Creating color map
col = [mycolours[i] for i in labels_true]# Creating plot
sctt = ax.scatter3D(x, y, z, c = col, marker ='o')plt.title("3D scatter plot of the data\n")
ax.set_xlabel('X-axis', fontweight ='bold')
ax.set_ylabel('Y-axis', fontweight ='bold')
ax.set_zlabel('Z-axis', fontweight ='bold')

# show plot
plt.draw()
Image for post
Fig 2. Initial scatter plot of the dataset
图2.数据集的初始散点图

获取初始聚类 (Obtaining an Initial Clustering)

First, we need to break our dataset into 2 parts. One part will be used as seeds to obtain an initial clustering using K-means. The points in the other part will be assigned to clusters based on the initial clustering.

首先,我们需要将数据集分为两部分。 一部分将用作种子,以使用K均值获得初始聚类。 另一部分中的点将根据初始聚类分配给聚类。

from sklearn.model_selection import train_test_splitX_seeds, X_rest, y_seeds, y_rest, id_seeds, id_rest = train_test_split(X, labels_true, point_indices, test_size=0.33, random_state=42)

Now we perform K-means clustering on the seed points.

现在我们对种子点执行K-均值聚类。

from sklearn.cluster import KMeanskmeans = KMeans(n_clusters=3, random_state=9).fit(X_seeds)
initial_result = kmeans.labels_

Since the resulting labels may not be the same as the ground truth labels, we have to map the two sets of labels. For this, we can use the following function.

由于生成的标签可能与地面真相标签不同,因此我们必须映射两组标签。 为此,我们可以使用以下功能。

from itertools import permutations# Source: https://stackoverflow.com/questions/11683785/how-can-i-match-up-cluster-labels-to-my-ground-truth-labels-in-matlabdef remap_labels(pred_labels, true_labels):    pred_labels, true_labels = np.array(pred_labels), np.array(true_labels)
assert pred_labels.ndim == 1 == true_labels.ndim
assert len(pred_labels) == len(true_labels)
cluster_names = np.unique(pred_labels)
accuracy = 0 perms = np.array(list(permutations(np.unique(true_labels)))) remapped_labels = true_labels for perm in perms: flipped_labels = np.zeros(len(true_labels))
for label_index, label in enumerate(cluster_names):
flipped_labels[pred_labels == label] = perm[label_index] testAcc = np.sum(flipped_labels == true_labels) / len(true_labels) if testAcc > accuracy:
accuracy = testAcc
remapped_labels = flipped_labels return accuracy, remapped_labels

We can get the accuracy and the mapped initial labels from the above function.

我们可以从上面的函数中获得准确性和映射的初始标签。

intial_accuracy, remapped_initial_result = remap_labels(initial_result, y_seeds)

Figure 3 denotes the initial clustering of the seed points.

图3表示种子点的初始聚类。

Image for post
Fig 3. Initial clustering of the seed points using K-means
图3.使用K均值的种子点初始聚类

获取初始聚类的凸包 (Get Convex Hulls of the Initial Clustering)

Once we have obtained an initial clustering, we can get the convex hulls for each cluster. First, we have to get the indices of each data point in the clusters.

一旦获得初始聚类,就可以获取每个聚类的凸包。 首先,我们必须获取群集中每个数据点的索引。

# Get the idices of the data points belonging to each cluster
indices = {}for i in range(len(id_seeds)):
if int(remapped_initial_result[i]) not in indices:
indices[int(remapped_initial_result[i])] = [i]
else:
indices[int(remapped_initial_result[i])].append(i)

Now we can obtain the convex hulls from each cluster.

现在我们可以从每个聚类中获得凸包。

from scipy.spatial import ConvexHull# Get convex hulls for each cluster
hulls = {}for i in indices:
hull = ConvexHull(X_seeds[indices[i]])
hulls[i] = hull

Figure 4 denotes the convex hulls representing each of the 3 clusters.

图4表示分别代表3个群集的凸包。

Image for post
Fig 4. Convex hulls of each cluster
图4.每个群集的凸包

将剩余点分配给最接近的凸包的群集 (Assign Remaining Points to the Cluster of the Closest Convex Hull)

Now that we have the convex hulls of the initial clusters, we can assign the remaining points to the cluster of the closest convex hull. First, we have to get the projection of the data point on to a convex hull. To do so, we can use the following function.

现在我们有了初始聚类的凸包,我们可以将其余点分配给最接近的凸包的聚类。 首先,我们必须将数据点投影到凸包上。 为此,我们可以使用以下功能。

from quadprog import solve_qp# Source: https://stackoverflow.com/questions/42248202/find-the-projection-of-a-point-on-the-convex-hull-with-scipydef proj2hull(z, equations):    G = np.eye(len(z), dtype=float)
a = np.array(z, dtype=float)
C = np.array(-equations[:, :-1], dtype=float)
b = np.array(equations[:, -1], dtype=float) x, f, xu, itr, lag, act = solve_qp(G, a, C.T, b, meq=0, factorized=True) return x

The problem of finding the projection of a point on a convex hull can be solved using quadratic programming. The above function makes use of the quadprog module. You can install the quadprog module using conda or pip.

查找点在凸包上的投影的问题可以使用二次编程解决。 上面的功能利用了quadprog模块。 您可以安装quadprog使用模块condapip

conda install -c omnia quadprog
OR
pip install quadprog

I won’t go into details about how to solve this problem using quadratic programming. If you are interested, you can read more from here and here.

我不会详细介绍如何使用二次编程解决此问题。 如果您有兴趣,可以从这里和这里内容。

Image for post
Fig 5. The distance from a point to its projection on to a convex hull
图5.从点到投影到凸包上的距离

Once you have obtained the projection on the convex hull, you can calculate the distance from the point to the convex hull as shown in Figure 5. Based on this distance, now let’s assign the remaining data points to the cluster of the closest convex hull.

一旦获得了凸包的投影,就可以计算从点到凸包的距离,如图5所示。现在,基于该距离,我们将剩余的数据点分配给最近的凸包的群集。

I will consider the Euclidean distance from the data point to its projection on the convex hull. Then the data point will be assigned to the cluster with the convex hull having the shortest distance from that data point. If a point lies within the convex hull, then the distance will be 0.

我将考虑从数据点到其在凸包上的投影的欧几里得距离。 然后,将数据点分配给群集,其中凸包距该数据点的距离最短。 如果点位于凸包内,则距离将为0。

prediction = []for z1 in X_rest:    min_cluster_distance = 100000
min_distance_point = ""
min_cluster_distance_hull = ""

for i in indices: p = proj2hull(z1, hulls[i].equations) dist = np.linalg.norm(z1-p) if dist < min_cluster_distance: min_cluster_distance = dist
min_distance_point = p
min_cluster_distance_hull = i prediction.append(min_cluster_distance_hull)prediction = np.array(prediction)

Figure 6 denotes the final clustering result.

图6表示最终的聚类结果。

Image for post
Fig 6. Final result with convex hulls
图6.凸包的最终结果

评估最终结果 (Evaluate the Final Result)

Let’s evaluate our result to see how accurate it is.

让我们评估我们的结果以查看其准确性。

from sklearn.metrics import accuracy_scoreY_pred = np.concatenate((remapped_initial_result, prediction))
Y_real = np.concatenate((y_seeds, y_rest))
print(accuracy_score(Y_real, Y_pred))

I got an accuracy of 1.0 (100%)! Awesome and exciting right? 😊

我的准确度是1.0(100%)! 太棒了,令人兴奋吧? 😊

If you want to know more about evaluating clustering results, you can check out my previous article Evaluating Clustering Results.

如果您想了解有关评估聚类结果的更多信息,可以查阅我之前的文章评估聚类结果 。

I have used a very simple dataset. You can try this method with more complex datasets and see what happens.

我使用了一个非常简单的数据集。 您可以对更复杂的数据集尝试此方法,然后看看会发生什么。

高维数据 (High-dimensional data)

I also tried to cluster a dataset with data points having 8 dimensions using my cluster hull method. You can find the jupyter notebook showing the code and results. The final results are as follows.

我还尝试使用我的群集包方法将数据集与8个维度的数据点群集在一起。 您可以找到显示代码和结果的jupyter笔记本 。 最终结果如下。

Accuracy of K-means method: 0.866
Accuracy of Convex Hull method: 0.867

There is a slight improvement in my convex hull method over K-means.

与K均值相比,我的凸包方法略有改进。

最后的想法 (Final Thoughts)

The article titled High-dimensional data clustering by using local affine/convex hulls by HakanCevikalp shows that the convex hull-based method they proposed avoids the “hole artefacts” problem (the sparse and irregular distributions in high-dimensional spaces can make the nearest-neighbour distances unreliable) and improves the accuracy of high-dimensional datasets over other state-of-the-art subspace clustering methods.

由HakanCevikalp撰写的使用局部仿射/凸包进行高维数据聚类的文章显示,他们提出的基于凸包的方法避免了“ Kong伪像 ”问题(高维空间中稀疏和不规则的分布可以使最近的邻居距离不可靠),并比其他最新的子空间聚类方法提高了高维数据集的准确性。

You can find the jupyter notebook containing the code used for this article.

您可以找到包含本文所用代码的jupyter笔记本 。

Hope this article was interesting and useful.

希望本文有趣而有用。

Cheers! 😃

干杯! 😃

翻译自: https://towardsdatascience.com/clustering-using-convex-hulls-fddafeaa963c

上凸包和下凸包

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/389017.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

幸运三角形 南阳acm491(dfs)

幸运三角形 时间限制&#xff1a;1000 ms | 内存限制&#xff1a;65535 KB 难度&#xff1a;3描述话说有这么一个图形&#xff0c;只有两种符号组成&#xff08;‘’或者‘-’&#xff09;&#xff0c;图形的最上层有n个符号&#xff0c;往下个数依次减一&#xff0c;形成倒置…

决策树有框架吗_决策框架

决策树有框架吗In a previous post, I mentioned that thinking exhaustively is exhausting! Volatility and uncertainty are ever present and must be factored into our decision making — yet, we often don’t have the time or data to properly account for it.在上一…

8 一点就消失_消失的莉莉安(26)

文|明鸢Hi&#xff0c;中午好&#xff0c;我是暖叔今天是免费连载《消失的莉莉安》第26章消失的莉莉安▶▶往期链接&#xff1a;▼ 向下滑动阅读1&#xff1a;“消失的莉莉安(1)”2&#xff1a; 消失的莉莉安(2)3&#xff1a;“消失的莉莉安(3)”4&#xff1a;“消失的莉莉安…

mysql那本书适合初学者_3本书适合初学者

mysql那本书适合初学者为什么要书籍&#xff1f; (Why Books?) The internet is a treasure-trove of information on a variety of topics. Whether you want to learn guitar through Youtube videos or how to change a tire when you are stuck on the side of the road, …

语音对话系统的设计要点与多轮对话的重要性

这是阿拉灯神丁Vicky的第 008 篇文章就从最近短视频平台的大妈与机器人快宝的聊天说起吧。某银行内&#xff0c;一位阿姨因等待办理业务的时间太长&#xff0c;与快宝机器人展开了一场来自灵魂的对话。对于银行工作人员的不满&#xff0c;大妈向快宝说道&#xff1a;“你们的工…

c读取txt文件内容并建立一个链表_C++链表实现学生信息管理系统

可以增删查改&#xff0c;使用链表存储&#xff0c;支持排序以及文件存储及数据读取&#xff0c;基本可以应付期末大作业&#xff08;狗头&#xff09; 界面为源代码为一个main.cpp和三个头文件&#xff0c;具体为 main.cpp#include <iostream> #include <fstream>…

阎焱多少身价_2020年,数据科学家的身价是多少?

阎焱多少身价Photo by Christine Roy on Unsplash克里斯汀罗伊 ( Christine Roy) 摄于Unsplash Although we find ourselves in unprecedented times of uncertainty, current events have shown just how valuable the fields of Data Science and Computer Science truly are…

单据打印_Excel多功能进销存套表,自动库存单据,查询打印一键操作

Hello大家好&#xff0c;我是帮帮。今天跟大家分享一张Excel多功能进销存管理套表&#xff0c;自动库存&#xff0c;单据打印&#xff0c;查询统算一键操作。为了让大家能更稳定的下载模板&#xff0c;我们又开通了全新下载方式(见文章末尾)&#xff0c;以便大家可以轻松获得免…

卡尔曼滤波滤波方程_了解卡尔曼滤波器及其方程

卡尔曼滤波滤波方程Before getting into what a Kalman filter is or what it does, let’s first do an exercise. Open the google maps application on your phone and check your device’s current location.在了解什么是卡尔曼滤波器或其功能之前&#xff0c;我们先做一个…

Candidate sampling:NCE loss和negative sample

在工作中用到了类似于negative sample的方法&#xff0c;才发现我其实并不了解candidate sampling。于是看了一些相关资料&#xff0c;在此简单总结一些相关内容。 主要内容来自tensorflow的candidate_sampling和卡耐基梅隆大学一个学生写的一份notesNotes on Noise Contrastiv…

golang key map 所有_Map的底层实现 为什么遍历Map总是乱序的

Golang中Map的底层结构其实提到Map&#xff0c;一般想到的底层实现就是哈希表&#xff0c;哈希表的结构主要是Hashcode 数组。存储kv时&#xff0c;首先将k通过hashcode后对数组长度取余&#xff0c;决定需要放入的数组的index当数组对应的index已有元素时&#xff0c;此时产生…

朴素贝叶斯分类器 文本分类_构建灾难响应的文本分类器

朴素贝叶斯分类器 文本分类背景 (Background) Following a disaster, typically you will get millions and millions of communications, either direct or via social media, right at the time when disaster response organizations have the least capacity to filter and…

第二轮冲次会议第六次

今天早上八点我们进行了站立会议 此次站立会议我们开了30分钟 参加会议的人员&#xff1a; 黄睿麒 侯熙磊 会议内容&#xff1a;我们今天讨论了如何分离界面&#xff0c;是在显示上进行限制从而达到不同引用展现不同便签信息&#xff0c;还是单独开一个界面从而实现显示不同界面…

markdown 链接跳转到标题_我是如何使用 Vim 高效率写 Markdown 的

本文仅适合于对vim有一定了解的人阅读&#xff0c;没有了解的人可以看看文中的视频我使用 neovim 代替 vim &#xff0c;有些插件是 neovim 独占&#xff0c; neovim 和 vim 的区别请自行 google系统: Manjaro(Linux)前言之前我一直使用的是 vscode 和 typora 作为 markdown 编…

Seaborn:Python

Seaborn is a data visualization library built on top of matplotlib and closely integrated with pandas data structures in Python. Visualization is the central part of Seaborn which helps in exploration and understanding of data.Seaborn是建立在matplotlib之上…

福大软工 · 第十次作业 - 项目测评(团队)

写在前面 本次作业测试报告链接林燊大哥第一部分 调研&#xff0c;评测 一、评测 软件的bug&#xff0c;功能评测&#xff0c;黑箱测试 1.下载并使用&#xff0c;描述最简单直观的个人第一次上手体验 IOS端 UI界面简单明了&#xff0c;是我喜欢的极简风格。课程模块界面简洁优雅…

销货清单数据_2020年8月数据科学阅读清单

销货清单数据Note: I am not affiliated with any of the writers in this article. These are simply books and essays that I’m excited to share with you. There are no referrals or a cent going in my pocket from the authors or publishers mentioned. Reading is a…

c++运行不出结果_fastjson 不出网利用总结

点击蓝字 关注我们 声明 本文作者:flashine 本文字数:2382 阅读时长:20分钟 附件/链接:点击查看原文下载 声明:请勿用作违法用途,否则后果自负 本文属于WgpSec原创奖励计划,未经许可禁止转载 前言 之前做项目在内网测到了一个fastjson反序列化漏洞,使用dnslo…

FocusBI:租房分析可视化(PowerBI网址体验)

微信公众号&#xff1a;FocusBI关注可了解更多的商业智能、数据仓库、数据库开发、爬虫知识及沪深股市数据推送。问题或建议&#xff0c;请关注公众号发送消息留言;如果你觉得FocusBI对你有帮助&#xff0c;欢迎转发朋友圈或在文章末尾点赞[1] 《商业智能教程》pdf下载地址 …

米其林餐厅 盐之花_在世界范围内探索《米其林指南》

米其林餐厅 盐之花Among the culinary world, there are few greater accolades for a restaurant than being awarded a Michelin star (or three!), or being listed as one of the best in the world by a reputable guide. Foodies and fine dine lovers like myself, see …