t-sne原理解释_T-SNE解释-数学与直觉

t-sne原理解释

The method of t-distributed Stochastic Neighbor Embedding (t-SNE) is a method for dimensionality reduction, used mainly for visualization of data in 2D and 3D maps. This method can find non-linear connections in the data and therefore it is highly popular. In this post, I’ll give an intuitive explanation for how t-SNE works and then describe the math behind it.

t分布随机邻居嵌入(t-SNE)方法是一种降维方法,主要用于2D和3D地图中的数据可视化。 这种方法可以在数据中找到非线性连接,因此非常受欢迎。 在这篇文章中,我将对t-SNE的工作原理给出直观的解释,然后描述其背后的数学原理。

See your data in a lower dimension

以较低的维度查看数据

So when and why would you want to visualize your data in a low dimension? When working on data with more than 2–3 features you might want to check if your data has clusters in it. This information can help you understand your data and, if needed, choose the number of clusters for clustering models such as k-means.

那么什么时候以及为什么要以低维度可视化数据? 处理具有2–3个以上功能的数据时,您可能需要检查数据中是否包含群集。 此信息可以帮助您理解数据,并在需要时选择用于聚类模型(例如k均值)的聚类数。

Now let’s look at a short example that will help understand what we want to get. Let’s say we have data in a 2D space and we want to reduce its dimension into 1D. Here’s an example of data in 2D:

现在,让我们看一个简短的示例,该示例将有助于理解我们想要得到的东西。 假设我们在2D空间中有数据,并且想将其维数减小为1D。 这是2D数据的示例:

In this example, each color represents a cluster. We can see that each cluster has a different density. We will see how the model deals with that in the dimensional reduction process.

在此示例中,每种颜色代表一个群集。 我们可以看到每个簇具有不同的密度。 我们将看到模型在降维过程中如何处理该问题。

Now, if we try to simply project the data onto just one of its dimensions, we see an overlap of at least two of the clusters:

现在,如果我们尝试将数据仅投影到其一个维度上,就会看到至少两个集群的重叠:

Image for post
Figure 2: Data projections to one dimension
图2:一维数据投影

So we understand that we need to find a better way to do this dimension reduction.

因此,我们知道我们需要找到一种更好的方法来减少尺寸。

T-SNE algorithm deals with this problem, and I’ll explain its performance in three stages:

T-SNE算法解决了这个问题,我将分三个阶段来说明其性能:

  • Calculating a joint probability distribution that represents the similarities between the data points (don’t worry, I’ll explain that soon!).

    计算表示数据点之间相似度的联合概率分布(不用担心,我会尽快解释!)。
  • Creating a dataset of points in the target dimension and then calculating the joint probability distribution for them as well.

    在目标维度中创建点的数据集,然后也为它们计算联合概率分布。
  • Using gradient descent to change the dataset in the low-dimensional space so that the joint probability distribution representing it would be as similar as possible to the one in the high dimension.

    使用梯度下降来更改低维空间中的数据集,以便表示它的联合概率分布将与高维空间中的数据尽可能相似。

算法 (The Algorithm)

第一阶段-亲爱的朋友,您成为我邻居的可能性有多大? (First Stage — Dear points, how likely are you to be my neighbors?)

The first stage of the algorithm is calculating the Euclidian distances of each point from all of the other points. Then, taking these distances and transforming them into conditional probabilities that represent the similarity between every two points. What does that mean? It means that we want to evaluate how similar every two points in the data are, or in other words, how likely they are to be neighbors.

该算法的第一阶段是计算每个点与所有其他点的欧几里得距离。 然后,采用这些距离并将其转换为表示每两个点之间相似度的条件概率。 那是什么意思? 这意味着我们要评估数据中每两个点的相似度,换句话说, 它们成为邻居的可能性。

The conditional probability of point xⱼ to be next to point xᵢ is represented by a Gaussian centered at xᵢ with a standard deviation of σᵢ (I’ll mention later on what influences σᵢ). It is written mathematically in the following way:

点xⱼ紧接点xᵢ的条件概率由以xᵢ为中心,标准差为σdeviation的高斯表示(我将在后面介绍影响σᵢ的因素)。 它是通过以下方式以数学方式编写的:

Image for post
The probability of point xᵢ to have xⱼ as it’s neighbor
点xᵢ成为xⱼ的概率

The reason for dividing by the sum of all the other points placed at the Gaussian centered at xᵢ is that we may need to deal with clusters with different densities. To explain that, let’s go back to the example of Figure 1. As you can see the density of the orange cluster is lower than the density of the blue cluster. Therefore, if we calculate the similarities of each two points by a Gaussian only, we will see lower similarities between the orange points compared to the blue ones. In our final output we won’t mind that some clusters had different densities, we will just want to see them as clusters, and therefore we do this normalization.

用位于以xᵢ为中心的高斯分布的所有其他点的总和除的原因是,我们可能需要处理具有不同密度的聚类。 为了解释这一点,让我们回到图1的示例。您可以看到橙色簇的密度低于蓝色簇的密度。 因此,如果仅通过高斯计算每两个点的相似度,则橙色点与蓝色点之间的相似度会更低。 在最终输出中,我们不会介意某些簇具有不同的密度,我们只想将它们视为簇,因此我们进行了归一化。

From the conditional distributions created we calculate the joint probability distribution, using the following equation:

根据创建的条件分布,我们使用以下公式计算联合概率分布:

Image for post

Using the joint probability distribution rather than the conditional probability is one of the improvements in the method of t-SNE relative to the former SNE. The symmetric property of the pairwise similarities (pᵢⱼ = pⱼᵢ) helps simplify the calculation at the third stage of the algorithm.

相对于以前的SNE,使用联合概率分布而不是条件概率是t-SNE方法的改进之一。 成对相似性的对称性(pᵢⱼ=pⱼᵢ)有助于简化算法第三阶段的计算。

第二阶段-低维度创建数据 (Second Stage — Creating data in a low dimension)

In this stage, we create a dataset of points in a low-dimensional space and calculate a joint probability distribution for them as well.

在此阶段,我们在低维空间中创建点的数据集,并为其计算联合概率分布。

To do that, we build a random dataset of points with the same number of points as we had in the original dataset, and K features, where K is our target dimension. Usually, K will be 2 or 3 if we want to use the dimension reduction for visualization. If we go back to our example, at this stage the algorithm builds a random dataset of points in 1D:

为此,我们构建了一个点的随机数据集,其点数与原始数据集中的点数相同,并且建立了K个要素,其中K是我们的目标维度。 通常,如果我们要使用降维进行可视化,则K将为2或3。 如果回到示例,在此阶段,该算法将建立一维点的随机数据集:

Image for post
Figure 3: A random set of points in 1D
图3:一维中的一组随机点

For this set of points, we will create their joint probability distribution but this time we will be using the t-distribution and not the Gaussian distribution, as we did for the original dataset. This is another advantage of t-SNE compared to the former SNE (t in t-SNE stands for t-distribution) that I will soon explain. We will mark the probabilities here by q, and the points by y.

对于这组点,我们将创建它们的联合概率分布,但是这次我们将使用t分布而不是高斯分布,就像对原始数据集所做的那样。 与之前将要解释的以前的SNE(t-SNE中的t代表t-分布)相比,t-SNE的另一个优势是。 我们在这里用q标记概率,在y上标记点。

Image for post

The reason for choosing t-distribution rather than the Gaussian distribution is the heavy tails property of the t-distribution. This quality causes moderate distances between points in the high-dimensional space to become extreme in the low-dimensional space, and that helps prevent “crowding” of the points in the lower dimension. Another advantage of using t-distribution is an improvement in the optimization process in the third part of the algorithm.

选择t分布而不是高斯分布的原因是t分布的重尾特性。 这种质量使高维空间中的点之间的适度距离在低维空间中变得极端,并且有助于防止较低维中的点“拥挤”。 使用t分布的另一个优点是改进了算法第三部分的优化过程。

第三阶段-让魔术发生! (Third Stage — Let the magic happen!)

Or in other words, change your dataset in the low-dimensional space so it will best visualize your data

或者换句话说,在低维空间中更改数据集,以便最佳地可视化数据

Now we use the Kullback-Leiber divergence to make the joint probability distribution of the data points in the low dimension as similar as possible to the one from the original dataset. If this transformation succeeds we will get a good dimension reduction.

现在,我们使用Kullback-Leiber散度使低维数据点的联合概率分布尽可能类似于原始数据集中的数据。 如果此转换成功,我们将获得很好的尺寸缩减。

I’ll briefly explain what Kullback-Leiber divergence (KL divergence) is. KL divergence is a measure of how much two distributions are different from one another. For distributions P and Q in the probability space of χ, the KL divergence is defined by:

我将简要说明什么是Kullback-Leiber散度(KL散度)。 KL散度是两个分布之间有多少不同的度量。 对于χ概率空间中的P和Q分布,KL散度定义为:

Image for post
The definition of KL divergence between the probability distributions P and Q
概率分布P和Q之间的KL散度的定义

As much as the distributions are similar to each other, the value of the KL divergence is smaller, reaching zero when the distributions are identical.

尽管分布彼此相似,但KL散度的值较小,当分布相同时达到零。

Back to our algorithm — we try to change the lower dimension dataset such that its joint probability distribution will be as similar as possible to the one from the original data. This is done by using gradient descent. The cost function that the gradient descent tries to minimize is the KL divergence of the joint probability distribution P from the high-dimensional space and Q from the low-dimensional space.

回到我们的算法-我们尝试更改较低维度的数据集,以使其联合概率分布与原始数据中的概率分布尽可能相似。 这是通过使用梯度下降来完成的。 梯度下降试图最小化的代价函数是来自高维空间的联合概率分布P和来自低维空间的Q的KL散度。

Image for post
The cost function for the gradient descent is the KL divergence between P and Q, the joint probability distributions of the high and low dimensions respectively
梯度下降的成本函数是P和Q之间的KL散度,分别是高维和低维的联合概率分布

From this optimization, we get the values of the points in the low dimension dataset and use it for our visualization. In our example, we see the clusters in the low-dimensional space as follows:

通过此优化,我们获得了低维数据集中的点的值,并将其用于可视化。 在我们的示例中,我们看到了低维空间中的聚类,如下所示:

Image for post

模型中的参数 (Parameters in the model)

There are several parameters in this model that you can adjust to your needs. Some of them relate to the process of gradient descent, where the most important ones are the learning rate and the number of iterations. If you are not familiar with gradient descent I recommend going through its explanation for better understanding.

您可以根据需要调整此模型中的几个参数。 其中一些与梯度下降过程有关,其中最重要的是学习率和迭代次数。 如果您不熟悉梯度下降,我建议您仔细阅读它的解释以更好地理解。

Another parameter in t-SNE is perplexity. It is used for choosing the standard deviation σᵢ of the Gaussian representing the conditional distribution in the high-dimensional space. I will not elaborate on the math behind it, but it can be interpreted as the number of effective neighbors for each point. The model is rather robust for perplexities between 5 to 50, but you can see some examples of how changes in perplexity affect t-SNE results in the following article.

t-SNE中的另一个参数是困惑 。 它用于选择代表高维空间中条件分布的高斯标准偏差σᵢ。 我不会详细说明其背后的数学原理,但是可以将其解释为每个点的有效邻居数。 该模型对于5到50之间的困惑度相当鲁棒,但是在下一篇文章中 ,您可以看到一些困惑度变化如何影响t-SNE结果的示例。

结论 (Conclusion)

That’s it! I hope this post helped you better understand the operating algorithm behind t-SNE and will help you use it effectively. For more details on the math of the method, I recommend looking at the original paper of TSNE. Thank you for reading :)

而已! 我希望这篇文章可以帮助您更好地了解t-SNE背后的操作算法,并可以帮助您有效地使用它。 有关该方法的数学运算的更多详细信息,建议查看TSNE的原始论文 。 谢谢您的阅读:)

翻译自: https://medium.com/swlh/t-sne-explained-math-and-intuition-94599ab164cf

t-sne原理解释

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/388176.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

Android Studio如何减小APK体积

最近在用AndroidStudio开发一个小计算器,代码加起来还不到200行。但是遇到一个问题,导出的APK文件大小竟然达到了1034K。这不科学,于是就自己动手精简APK。下面我们大家一起学习怎么缩小一个APK的大小,以hello world为例。 新建工…

js合并同类数组里面的对象_通过同类群组保留估算客户生命周期价值

js合并同类数组里面的对象This is Part I of the two-part series dedicated to estimating customer lifetime value. In this post, I will describe how to estimate LTV, on a conceptual level, in order to explain what we’re going to be doing in Part II with the P…

C#解析HTML

第一种方法:用正则表达式来分析 [csharp] view plaincopy 转自网上的一个实例:所有的href都抽取出来: using System; using System.Net; using System.Text; using System.Text.RegularExpressions; namespace HttpGet { c…

com编程创建快捷方式中文_如何以编程方式为博客创建wordcloud?

com编程创建快捷方式中文Recently, I was in need of an image for our blog and wanted it to have some wow effect or at least a better fit than anything typical we’ve been using. Pondering over ideas for a while, word cloud flashed in my mind. 💡Us…

ETL技术入门之ETL初认识

ETL技术入门之ETL初认识 分类: etl2014-07-10 23:11 3021人阅读 评论(2) 收藏 举报数据仓库商业价值etlbi目录(?)[-] ETL是什么先说下背景知识下面给下ETL的详细解释定义现在来看下kettle的transformation文件一个最简单的E过程例子windows环境 上图左边的是打开表…

ActiveSupport::Concern 和 gem 'name_of_person'(300✨) 的内部运行机制分析

理解ActiveRecord::Concern: 参考:include和extend的区别: https://www.cnblogs.com/chentianwei/p/9408963.html 传统的模块看起来像: module Mdef self.included(base)# base(一个类)扩展了一个模块"ClassMethods", b…

Python 3.8.0a2 发布,面向对象编程语言

百度智能云 云生态狂欢季 热门云产品1折起>>> Python 3.8.0a2 发布了,这是 3.8 系列计划中 4 个 alpha 版本的第 2 个。 alpha 版本旨在更加易于测试新功能和 bug 修复状态,以及发布流程。在 alpha 阶段会添加新功能,直到 beta 阶…

基于plotly数据可视化_如何使用Plotly进行数据可视化

基于plotly数据可视化The amount of data in the world is growing every second. From sending a text to clicking a link, you are creating data points for companies to use. Insights that can be drawn from this collection of data can be extremely valuable. Every…

ESLint简介

ESLint简介 ESLint是一个用来识别 ECMAScript 并且按照规则给出报告的代码检测工具,使用它可以避免低级错误和统一代码的风格。如果每次在代码提交之前都进行一次eslint代码检查,就不会因为某个字段未定义为undefined或null这样的错误而导致服务崩溃&…

数据科学与大数据是什么意思_什么是数据科学?

数据科学与大数据是什么意思Data Science is an interdisciplinary field that uses a combination of code, statistical analysis, and algorithms to gain insights from structured and unstructured data.数据科学是一个跨学科领域,它结合使用代码,…

C#制作、打包、签名、发布Activex全过程

一、前言 最近有这样一个需求,需要在网页上面启动客户端的软件,软件之间的通信、调用,单单依靠HTML是无法实现了,因此必须借用Activex来实现。由于本人主要擅长C#,自然本文给出了用C#实现的范例,本文的预期…

用Python创建漂亮的交互式可视化效果

Plotly is an interactive Python library that provides a wide range of visualisations accessible through a simple interface.Plotly是一个交互式Python库,通过简单的界面即可提供广泛的可视化效果。 There are many different visualisation libraries avai…

Hadoop 2.0集群配置详细教程

Hadoop 2.0集群配置详细教程 前言 Hadoop2.0介绍 Hadoop是 apache 的开源 项目,开发的主要目的是为了构建可靠,可拓展 scalable ,分布式的系 统, hadoop 是一系列的子工程的 总和,其中包含 1. hadoop common &#xff…

php如何减缓gc_管理信息传播-使用数据科学减缓错误信息的传播

php如何减缓gcWith more people now than ever relying on social media to stay updated on current events, there is an ethical responsibility for hosting companies to defend against false information. Disinformation, which is a type of misinformation that is i…

[UE4]删除UI:Remove from Parent

同时要将保存UI的变量清空,以释放占用的系统内存 转载于:https://www.cnblogs.com/timy/p/9842206.html

BZOJ2503: 相框

Description P大的基础电路实验课是一个无聊至极的课。每次实验,T君总是提前完成,管理员却不让T君离开,T君只能干坐在那儿无所事事。先说说这个实验课,无非就是把几根导线和某些元器件(电阻、电容、电感等)…

泰坦尼克号 数据分析_第1部分:泰坦尼克号-数据分析基础

泰坦尼克号 数据分析My goal was to get a better understanding of how to work with tabular data so I challenged myself and started with the Titanic -project. I think this was an excellent way to learn the basics of data analysis with python.我的目标是更好地了…

vba数组dim_NDArray — —一个基于Java的N-Dim数组工具包

vba数组dim介绍 (Introduction) Within many development languages, there is a popular paradigm of using N-Dimensional arrays. They allow you to write numerical code that would otherwise require many levels of nested loops in only a few simple operations. Bec…

关于position的四个标签

四个标签是static,relative,absolute,fixed。 static 该值是正常流,并且是默认值,因此你很少看到(如果存在的话)指定该值。 relative:框的位置能够相对于它在正常流中的位置有所偏移…

python算法和数据结构_Python中的数据结构和算法

python算法和数据结构To至 Leonardo da Vinci达芬奇(Leonardo da Vinci) 介绍 (Introduction) The purpose of this article is to give you a panorama of data structures and algorithms in Python. This topic is very important for a Data Scientist in order to help …