余弦相似度和欧氏距离_欧氏距离和余弦相似度

余弦相似度和欧氏距离

Image for post
Photo by Markus Winkler on Unsplash
Markus Winkler在Unsplash上拍摄的照片

This is a quick and straight to the point introduction to Euclidean distance and cosine similarity with a focus on NLP.

这是对欧氏距离和余弦相似度的快速而直接的介绍,重点是NLP。

欧氏距离 (Euclidean Distance)

The Euclidean distance metric allows you to identify how far two points or two vectors are apart from each other.

欧几里德距离度量标准可让您确定两个点或两个向量彼此相距多远。

Now suppose you are a high school student and you have three classes. A math class, a philosophy class, and a psychology class. You want to check the similarity between these classes based on the words your professors use in class. For the sake of simplicity, let’s consider these two words: “theory” and “harmony”. You could then create a table like this to record the occurrence of these words in each class:

现在假设您是一名高中生,您有3个班级。 数学课,哲学课和心理学课。 您想根据您的教授在课堂上使用的单词来检查这些课程之间的相似性。 为了简单起见,让我们考虑以下两个词:“理论”和“和谐”。 然后,您可以创建一个像这样的表来记录每个类中这些单词的出现情况:

Image for post

In this table, the word “theory” is repeated 60 times in math class, 20 times in philosophy class, and 25 times in psychology class whereas the word harmony is repeated 10, 40, and 70 times in math, philosophy, and psychology classes respectively. Let’s translate this data into a 2D plane.

在此表中,“理论”一词在数学课中重复了60次,在哲学课中重复了20次,在心理学课中重复了25次,而在数学,哲学和心理学课中,“和谐”一词重复了10、40和70次分别。 让我们将此数据转换为2D平面。

Word vectors in 2D plane

The Euclidean distance is simply the distance between the points. In the graph below.

欧几里得距离就是点之间的距离。 在下图中。

Image for post

You can see clearly that d1 which is the distance between psychology and philosophy is smaller than d2 which is the distance between philosophy and math. But how do you calculate d1 and d2?

您可以清楚地看到,心理学与哲学之间的距离d1小于哲学与数学之间的距离d2。 但是,如何计算d1和d2?

The generic formula is the following.

通用公式如下。

Image for post

In our case, for d1, d(v, w) = d(philosophy, psychology)`, which is:

在我们的情况下,对于d1, d(v, w) = d(philosophy, psychology) `,即:

Image for post

And d2

和d2

Image for post

As expected d2 > d1.

如预期的那样,d2> d1。

How to do this in python?

如何在python中做到这一点?

import numpy as np# define the vectorsmath = np.array([60, 10])philosophy = np.array([20, 40])psychology = np.array([25, 70])# calculate d1d1 = np.linalg.norm(philosophy - psychology)# calculate d2d2 = np.linalg.norm(philosophy - math)

余弦相似度 (Cosine Similarity)

Suppose you only have 2 hours of psychology class per week and 5 hours of both math class and philosophy class. Because you attend more of these two classes, the occurrence of the words “theory” and “harmony” will be greater than for the psychology class. Thus the updated table:

假设您每周只有2个小时的心理学课,而数学课和哲学课则只有5个小时。 由于您参加这两个课程中的更多课程,因此“理论”和“和谐”一词的出现将比心理学课程中的要大。 因此,更新后的表:

Image for post

And the updated 2D graph:

以及更新后的2D图形:

Image for post

Using the formula we’ve given earlier for Euclidean distance, we will find that, in this case, d1 is greater than d2. But we know psychology is closer to philosophy than it is to math. The frequency of the courses, trick the Euclidean distance metric. Cosine similarity is here to solve this problem.

使用我们先前给出的欧几里得距离公式,我们会发现,在这种情况下,d1大于d2。 但是我们知道心理学比数学更接近于哲学。 课程的频率欺骗欧几里德距离度量标准。 余弦相似度在这里解决了这个问题。

Instead of calculating the straight line distance between the points, cosine similarity cares about the angle between the vectors.

余弦相似度关心的是矢量之间的角度,而不是计算点之间的直线距离。

Image for post

Zooming in on the graph, we can see that the angle α, is smaller than the angle β. That’s all cosine similarity wants to know. In other words, the smaller the angle, the closer the vectors are to each other.

放大该图,我们可以看到角度α小于角度β。 这就是所有余弦相似度想要知道的。 换句话说,角度越小,向量彼此越接近。

The generic formula goes as follows

通用公式如下

Image for post

β is the angle between the vectors philosophy (represented by v) and math (represented by w).

β是向量原理(用v表示)和数学(用w表示)之间的夹角。

Image for post

Whereas cos(alpha) = 0.99 which is higher than cos(beta) meaning philosophy is closer to psychology than it is to math.

cos(alpha) = 0.99 (高于cos(beta)意味着哲学比数学更接近心理学。

Recall that

回想起那个

Image for post

and

Image for post

This implies that the smaller the angle, the greater your cosine similarity will be and the greater your cosine similarity, the more similar your vectors are.

这意味着角度越小,您的余弦相似度就越大,并且您的余弦相似度越大,向量就越相似。

Python implementation

Python实现

import numpy as npmath = np.array([80, 45])philosophy = np.array([50, 60])psychology = np.array([15, 20])cos_beta = np.dot(philosophy, math) / (np.linalg.norm(philosophy) * np.linalg.norm(math))print(cos_beta)

带走 (Takeaway)

I bet you should know by now how Euclidean distance and cosine similarity works. The former considers the straight line distance between two points whereas the latter cares about the angle between the two vectors in question.

我敢打赌,您现在应该知道欧几里得距离和余弦相似度是如何工作的。 前者考虑了两个点之间的直线距离,而后者则考虑了所讨论的两个向量之间的角度。

Euclidean distance is more straightforward and is guaranteed to work whenever your features distribution is balanced. But most of the time, we deal with unbalanced data. In such cases, it’s better to use cosine similarity.

欧几里得距离更简单明了,并且可以保证只要要素分布平衡就可以使用。 但是大多数时候,我们处理不平衡的数据。 在这种情况下,最好使用余弦相似度。

翻译自: https://medium.com/@josmyfaure/euclidean-distance-and-cosine-similarity-which-one-to-use-and-when-28c97a18fe68

余弦相似度和欧氏距离

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/389936.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

七、 面向对象(二)

匿名类对象 创建的类的对象是匿名的。当我们只需要一次调用类的对象时,我们就可以考虑使用匿名的方式创建类的对象。特点是创建的匿名类的对象只能够调用一次! package day007;//圆的面积 class circle {double radius;public double getArea() {// TODO…

机器学习 客户流失_通过机器学习预测流失

机器学习 客户流失介绍 (Introduction) This article is part of a project for Udacity “Become a Data Scientist Nano Degree”. The Jupyter Notebook with the code for this project can be downloaded from GitHub.本文是Udacity“成为数据科学家纳米学位”项目的一部分…

Qt中的坐标系统

转载:原野追逐 Qt使用统一的坐标系统来定位窗口部件的位置和大小。 以屏幕的左上角为原点即(0, 0)点,从左向右为x轴正向,从上向下为y轴正向,这整个屏幕的坐标系统就用来定位顶层窗口; 此外,窗口内部也有自己…

预测股票价格 模型_建立有马模型来预测股票价格

预测股票价格 模型前言 (Preface) If you are reading this, it’s most likely because you love to solve puzzles. I’m a very competitive person by nature. The Mt. Everest of puzzles, in my opinion, is trying to find excess returns through active trading in th…

Python 模块 timedatetime

time & datetime 模块 在平常的代码中,我们常常需要与时间打交道。在Python中,与时间处理有关的模块就包括:time,datetime,calendar(很少用,不讲),下面分别来介绍。 在开始之前,首先要说明几…

柠檬工会_工会经营者

柠檬工会Hey guys! This week we’ll be going over some ways to work with result sets in MySQL. These result sets are the outputs of your everyday queries, such as:大家好! 本周,我们将介绍一些在MySQL中处理结果集的方法。 这些结果集是您日常…

写给Java开发者看的JavaScript对象机制

帮助面向对象开发者理解关于JavaScript对象机制 本文是以一个熟悉OO语言的开发者视角,来解释JavaScript中的对象。 对于不了解JavaScript 语言,尤其是习惯了OO语言的开发者来说,由于语法上些许的相似会让人产生心理预期,JavaScrip…

大数据ab 测试_在真实数据上进行AB测试应用程序

大数据ab 测试Hello Everyone!大家好! I am back with another article about Data Science. In this article, I will write about what is A-B testing and how to use it on real life data-set to compare two advertisement methods.我回来了另一篇有关数据科…

node:爬虫爬取网页图片

前言 周末自己在家闲着没事,刷着微信,玩着手机,发现自己的微信头像该换了,就去网上找了一下头像,看着图片,自己就想着作为一个码农,可以把这些图片都爬取下来做成一个微信小程序,说干…

如何更好的掌握一个知识点_如何成为一个更好的讲故事的人3个关键点

如何更好的掌握一个知识点You’re launching a digital transformation initiative in the middle of the ongoing pandemic. You are pretty excited about this big-ticket investment, which has the potential to solve remote-work challenges that your organization fac…

centos 搭建jenkins+git+maven

gitmavenjenkins持续集成搭建发布人:[李源] 2017-12-08 04:33:37 一、搭建说明 系统:centos 6.5 jdk:1.8.0_144 jenkins:jenkins-2.93-1.1 git:git-2.9.0 maven:Maven 3.3.9 二、部署 2.1、jdk安装 1)下…

什么事数据科学_如果您想进入数据科学,则必须知道的7件事

什么事数据科学No way. No freaking way to enter data science any time soon…That is exactly what I thought a year back.没门。 很快就不会出现进入数据科学的怪异方式 ……这正是我一年前的想法。 A little bit about my data science story: I am a complete beginner…

Java基础-基本数据类型

Java中常见的转义字符: 某些字符前面加上\代表了一些特殊含义: \r :return 表示把光标定位到本行行首. \n :next 表示把光标定位到下一行同样的位置. 单独使用在某些平台上会产生不同的效果.通常这两个一起使用,即:\r\n. 表示换行. \t :tab键,长度上相当于四个或者是八个空格 …

季节性时间序列数据分析_如何指导时间序列数据的探索性数据分析

季节性时间序列数据分析为什么要进行探索性数据分析? (Why Exploratory Data Analysis?) You might have heard that before proceeding with a machine learning problem it is good to do en end-to-end analysis of the data by carrying a proper exploratory …

TortoiseGit上传项目到GitHub

1. 简介 gitHub是一个面向开源及私有软件项目的托管平台,因为只支持git 作为唯一的版本库格式进行托管,故名gitHub。 2. 准备 2.1 安装git:https://git-scm.com/downloads。无脑安装 2.2 安装TortoiseGit(小乌龟):https://torto…

利用PHP扩展Taint找出网站的潜在安全漏洞实践

一、背景 笔者从接触计算机后就对网络安全一直比较感兴趣,在做PHP开发后对WEB安全一直比较关注,2016时无意中发现Taint这个扩展,体验之后发现确实好用;不过当时在查询相关资料时候发现关注此扩展的人数并不多;最近因为…

美团骑手检测出虚假定位_在虚假信息活动中检测协调

美团骑手检测出虚假定位Coordination is one of the central features of information operations and disinformation campaigns, which can be defined as concerted efforts to target people with false or misleading information, often with some strategic objective (…

CertUtil.exe被利用来下载恶意软件

1、前言 经过国外文章信息,CertUtil.exe下载恶意软件的样本。 2、实现原理 Windows有一个名为CertUtil的内置程序,可用于在Windows中管理证书。使用此程序可以在Windows中安装,备份,删除,管理和执行与证书和证书存储相…

335. 路径交叉

335. 路径交叉 给你一个整数数组 distance 。 从 X-Y 平面上的点 (0,0) 开始,先向北移动 distance[0] 米,然后向西移动 distance[1] 米,向南移动 distance[2] 米,向东移动 distance[3] 米,持续移动。也就是说&#x…

回归分析假设_回归分析假设的最简单指南

回归分析假设The Linear Regression is the simplest non-trivial relationship. The biggest mistake one can make is to perform a regression analysis that violates one of its assumptions! So, it is important to consider these assumptions before applying regress…