数据中心细节_当细节很重要时数据不平衡

数据中心细节

定义不平衡数据 (Definition Imbalanced Data)

When we speak of imbalanced data, what we mean is that at least one class is underrepresented. For example, when considering the problem of building a classifier, let’s call it the Idealisstic-Voter. We give it the task of identifying politicians that the American public finds trustworthy. When looking at the dataset, we realize that almost every sample is hugely unpopular.

当我们谈到数据不平衡时,我们的意思是说至少有一个 类别 的代表性不足 。 例如,在考虑构建分类器的问题时,我们称其为Idealisstic-Voter。 我们的任务是确定美国公众认为值得信赖的政治人物。 当查看数据集时,我们意识到几乎每个样本都不受欢迎。

In classifying terminology we call the classes with very few samples the minority-classes and the ones with many the majority-classes.

在对术语进行分类时,我们将很少样本的类别称为少数类别 ,而将多数样本称为多数类别

那么问题是什么呢? (So what is the issue?)

Let’s assume there are 100 politicians, and only one of them is considered trustworthy by the broad public; let’s call him Abraham. We could now quickly build an Idealisstic-Voter. That achieves a Precision of 99% by simply saying “all politicians are liars,” i.e. classifying all politicians as untrustworthy. The loss will be low, and we can make an excellent presentation about how great our Idealisstic-Voter scores on the most common metrics.

假设有100位政治人物,而其中只有一位被广大公众信任。 我们叫他亚伯拉罕。 现在,我们可以快速构建理想选民。 只需说“所有政客都是骗子”,即将所有政客归类为不信任,即可达到99%的精确度。 损失将很小,我们可以就最常见的指标上的Idealisstic-Voter得分有多出色做一个很好的介绍。

PRECISION = TP / (TP+FP) = 99%
ACCURACY = TP / (TP + FN) = 99 %
F1 = (2*PRECISION*RECALL)/(PRECISION+RECALL) = 99.5 %

But that is not the goal! What we want is to split the good from the bad, and we really did not split anything. So how can we detect that we have a problem? That we did injustice to Abraham, and we all know he does not like injustices …

但这不是目标! 我们想要的是将好事与坏事分开,而我们实际上并没有将任何事情分开。 那么,如何检测到我们有问题呢? 我们对亚伯拉罕造成了不公正,我们都知道他不喜欢不公正……

平衡指标 (Balanced Metrics)

Whenever you are working with imbalanced data, make it a habit to also look at the balanced metrics. They do the same as the ones you are familiar with, but additionally, they take the average over all classes.

每当您使用不平衡的数据时,请养成也要注意平衡的指标的习惯。 它们的作用与您熟悉的方法相同,但此外,它们取所有类的平均值。

BALANCED_PRECISION=(P_TRUSTWORTHY + P_UNTRUSTWORTHY)/2=50%

Another metric that works very nicely in this context is the False Positive Rate in comparison with the True Positive Rate. In the multilabel case, you will mostly opt for some macro-averaged score.

在这种情况下,另一个非常有效的指标是误报率与误报率。 在多标签的情况下,您通常会选择一些宏观平均得分。

We realized that we have a problem, how can we solve it?

我们意识到自己有问题,如何解决?

上采样和下采样 (Upsampling and Downsampling)

Probably the simplest way you can use to make the Idealisstic-Voter solve the problem by himself is by presenting him with more reasonable options. This would mean either collecting more data or using the lovely Upsampling technique.

也许使您的Idealisstic-Voter自己解决问题的最简单方法就是为他提供更合理的选择。 这意味着要么收集更多数据,要么使用可爱的升采样技术。

This technique works such that we clone our beloved Abraham, meaning that we duplicate the members of the minority class. Once we balanced the dataset to a sufficient degree, we will notice that our model has actually to learn something to attain a good score!

这种技术的工作原理是我们克隆了我们心爱的亚伯拉罕,这意味着我们复制了少数群体的成员。 一旦我们将数据集平衡到足够的程度,我们就会注意到我们的模型实际上必须学习一些东西才能获得高分!

下采样 (Downsampling)

The opposite of Upsampling is Downsampling, aka Undersampling. This technique is great if you have more than enough data to do your job (Which never happened to me, but I assume the big data collectors of this world may sometimes be in this position).

上采样的相反是下采样,也就是欠采样。 如果您有足够多的数据来完成工作,则此技术非常有用(这对我而言从未发生过,但我认为这个世界的大数据收集者有时可能会处于这个位置)。

In this case, you can eliminate some of the instances of the majority class. You don’t have to go 50–50, but enough to make the Idealisstic-Voter motivated to learn something. The same holds for all of the techniques we are discussing today.

在这种情况下,您可以消除多数类的某些实例 您不必走50–50,但足以使Idealisstic-Voter有动力学习一些东西。 我们今天讨论的所有技术都一样。

生成合成样本 (Generating Synthetic Samples)

Image for post
Generating Synthetic Examples, Image by Author
生成合成示例,作者提供

Ok, now we are getting into the cool things. How about we generate data out of thin air? How about we add some Abrahams with cool techno glasses? He’d surely be just as trustworthy and additionally win the Youth Vote!

好的,现在我们进入了很酷的事情。 我们如何凭空生成数据呢? 我们如何添加一些带有酷炫技术眼镜的亚伯拉罕呢? 他肯定会同样值得信赖,并且还会赢得青年投票!

加一点ϵps (Adding a little bit of ϵps)

Let’s say our dataset has the feature age. We can clearly see that the only trustworthy politicians are 2020–1809 = 211 years old, 1809 being Abraham’s birth year. But what about politicians that have all the same accomplishments but are 211.001 and 210.999 years old?

假设我们的数据集具有特征年龄。 我们可以清楚地看到,唯一值得信赖的政客是2020- 1809年= 211岁,而1809年是亚伯拉罕的出生年。 但是,具有相同成就但分别具有211.001和210.999岁的政治家呢?

Probably the same! But be extremely careful when using this technique since you are effectively falsifying your data. This technique can easily backfire, and you should only use it when you are very confident that it is the right choice. But this brings us to a slightly more elaborate and automated technique.

大概一样! 但是,使用此技术时要格外小心,因为您实际上是在篡改数据。 此技术很容易适得其反,只有在非常确信它是正确的选择时,才应使用它。 但这将我们带入了一种更加精细和自动化的技术。

冒烟 (SMOTE)

SMOTE stands for Synthetic Minority Over-sampling Technique. It constructs new samples by carefully observing the attributes of all samples. It then proceeds to alter the values just so that they stay within the range observed in our minority-class. Sounds complicated, let’s look at a simple example, and it should become as clear as your mouse clicks on the clap. Let’s assume we add JFK to the dataset and look again at the age of trustworthy politicians.

SMOTE代表合成少数族裔过采样技术。 它通过仔细观察所有样本的属性来构造新样本。 然后,它继续更改这些值,以使它们保持在我们的少数群体观察到的范围内。 听起来很复杂,让我们看一个简单的示例,它应该就像您的鼠标单击拍手一样清晰。 假设我们将JFK添加到数据集中,然后再次查看可信赖的政治人物的年龄。

Image for post
SMOTE illustration, Image by Author
插图,作者提供的图片

As we can see, the area indicated as SMOTE SAFE is the area in which we can freely generate new synthetic samples! You can find a more in-depth explanation here. It is fascinating, but I think it is most important to know that such methods exist.

如我们所见,标为“ SMOTE SAFE”的区域是我们可以自由生成新的合成样本的区域! 您可以在这里找到更深入的解释。 令人着迷,但是我认为最重要的是要知道这种方法的存在。

发挥创意 (Get Creative)

We can find some incredible examples of this technique in the autonomous driving scene. While all the major competitors in this field have collected several petabytes of data, conditions such as kids running in front of cars should be tested before actually encountering them. These cases may also happen so rarely that you will never find enough training data.

我们可以在自动驾驶场景中找到这种技术的一些令人难以置信的例子。 尽管该领域的所有主要竞争对手都收集了数PB的数据,但在实际遇到汽车前奔跑的孩子等情况之前,应先进行测试。 这些情况可能也很少发生,以至于您永远找不到足够的培训数据。

Generating Synthetic data to test such and various other edge cases under many different lighting and weather conditions is a must. One often uses GAN’s or similar networks to convert already existing samples into various other forms, e.g., in winter or when raining. There are multiple ways of how this can be done. A short video that illustrates this can be seen here (I am not affiliated with them nor know how good they are, but they have an excellent short video ;)

必须生成合成数据以测试在许多不同的光照和天气条件下的此类以及其他各种边缘情况。 人们经常使用GAN或类似的网络将已经存在的样本转换为其他各种形式,例如在冬天或下雨天。 有多种方法可以做到这一点。 可以在此处看到一个简短的视频来说明这一点(我不隶属于他们,也不知道他们有多棒,但是他们有一个很棒的短视频;)

A video demonstrating synthetic data generation
演示合成数据生成的视频

更改损失函数 (Change The Loss Function)

But we do have one more thing! What if I told you that instead of dealing with the problem at the data level, we could deal with it at the heart of our algorithm, the loss function.

但是,我们还有一件事! 如果我告诉您,我们可以在算法的核心(损失函数)处处理问题,而不是在数据级别处理问题。

Over the years, many approaches have been proposed, and there truly is no one size fits all. It quite often depends heavily on how important it is to identify the minority-class, and what kind of costs occur when you misclassify a sample for the majority class.

多年以来,已经提出了许多方法,并且确实没有一种方法能适合所有情况。 通常,这在很大程度上取决于识别少数群体的重要性以及当您对多数群体的样本进行错误分类时会产生何种成本。

Weighted Loss Function, Define It Yourself

加权损失函数,自己定义

One way of mixing such a loss is simply by multiplying the terms in which a minority-class occurs with a constant>1. What this effectively does is encourage the model to pay more attention to the instances of the minority class. In Keras this is built in an can be applied without additional hassle

混合这种损失的一种方法就是简单地将出现少数类的项与常数> 1相乘。 这实际上是在鼓励模型更加关注少数群体的情况。 在Keras中,这是内置的,可以轻松应用

class_weights = {0: 1.,
1: 50.}
model.fit(X_train, Y_train,class_weights=class_weights)

In this particular case, it tells Keras that it should treat one occurrence of class 1 as 50 instances of class 0. In some easy python pseudo-code, we could think of it as something like this

在这种特殊情况下,它告诉Keras,应将出现的1类事件视为50个0类实例。在一些简单的python伪代码中,我们可以将其视为这样的事情

def loss_weighted(x,y_true):y_pred=model.forward(x)
if y_true==1:
return 50*loss(y_pred,y_true)
elif y_true==0:
return 1*loss(y_pred,y_true)

类平衡损失 (Class Balanced Loss)

As the amount of classes becomes bigger, it is tedious to defined things explicitly, as we did above. For this, a variety of solutions have been proposed, and one of the most notable works is “Class-Balanced Loss Based on Effective Number of Samples.”

随着类数量的增加,像我们上面所做的那样,显式定义事物很繁琐。 为此,提出了多种解决方案,其中最著名的作品之一是“ 基于有效样本数的类平衡损失 ”。

What they effectively proposed is reweighting the loss function. The authors do this in such a way that the effective amount of samples per class are taken into account. To control on what side of the balancing the loss ends up, one can additionally tune it with a hyperparameter.

他们有效建议的是对损失函数进行加权。 作者这样做的方式是考虑到每类样本的有效量。 为了控制损耗的平衡方向,可以另外使用超参数对其进行调谐。

While this is a very simplistic view of their process, I feel that it grasped the most important aspects to keep in your memory. If you ever encounter a case where you have a lot of classes and significant imbalances between the classes, make sure to give this approach a go.

尽管这是他们过程的非常简单的视图,但我认为它掌握了最重要的方面,可以保留在您的记忆中。 如果遇到类别过多且类别之间存在严重不平衡的情况,请确保采取这种方法。

结论 (Conclusion)

Congratulations, you know now what you can do when you encounter imbalanced data. Several approaches exist, and choosing the right one for your problem may not always be an easy task. It will heavily depend on your business needs to decide how much time and resources you want to put into balancing things more fairly. Being aware of potential solutions is the first step to solving them!

恭喜,您现在知道遇到不平衡数据时可以做什么。 存在几种方法,为您的问题选择正确的方法可能并不总是一件容易的事。 这将在很大程度上取决于您的业务需求,以决定您要花费多少时间和资源来更公平地平衡事务。 意识到潜在的解决方案是解决它们的第一步!

翻译自: https://towardsdatascience.com/imbalanced-data-when-details-matter-16bd3ec7ef74

数据中心细节

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/389940.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

辛普森悖论_所谓的辛普森悖论

辛普森悖论We all know the Simpsons family from Disneyland, but have you heard about the Simpson’s Paradox from statistic theory? This article will illustrate the definition of Simpson’s Paradox with an example, and show you how can it harm your statisti…

查看NVIDIA使用率工具目录

2019独角兽企业重金招聘Python工程师标准>>> C:\Program Files\NVIDIA Corporation\Display.NvContainer\NVDisplay.Container.exe 转载于:https://my.oschina.net/u/2430809/blog/1927560

余弦相似度和欧氏距离_欧氏距离和余弦相似度

余弦相似度和欧氏距离Photo by Markus Winkler on UnsplashMarkus Winkler在Unsplash上拍摄的照片 This is a quick and straight to the point introduction to Euclidean distance and cosine similarity with a focus on NLP.这是对欧氏距离和余弦相似度的快速而直接的介绍&…

七、 面向对象(二)

匿名类对象 创建的类的对象是匿名的。当我们只需要一次调用类的对象时,我们就可以考虑使用匿名的方式创建类的对象。特点是创建的匿名类的对象只能够调用一次! package day007;//圆的面积 class circle {double radius;public double getArea() {// TODO…

机器学习 客户流失_通过机器学习预测流失

机器学习 客户流失介绍 (Introduction) This article is part of a project for Udacity “Become a Data Scientist Nano Degree”. The Jupyter Notebook with the code for this project can be downloaded from GitHub.本文是Udacity“成为数据科学家纳米学位”项目的一部分…

Qt中的坐标系统

转载:原野追逐 Qt使用统一的坐标系统来定位窗口部件的位置和大小。 以屏幕的左上角为原点即(0, 0)点,从左向右为x轴正向,从上向下为y轴正向,这整个屏幕的坐标系统就用来定位顶层窗口; 此外,窗口内部也有自己…

预测股票价格 模型_建立有马模型来预测股票价格

预测股票价格 模型前言 (Preface) If you are reading this, it’s most likely because you love to solve puzzles. I’m a very competitive person by nature. The Mt. Everest of puzzles, in my opinion, is trying to find excess returns through active trading in th…

Python 模块 timedatetime

time & datetime 模块 在平常的代码中,我们常常需要与时间打交道。在Python中,与时间处理有关的模块就包括:time,datetime,calendar(很少用,不讲),下面分别来介绍。 在开始之前,首先要说明几…

柠檬工会_工会经营者

柠檬工会Hey guys! This week we’ll be going over some ways to work with result sets in MySQL. These result sets are the outputs of your everyday queries, such as:大家好! 本周,我们将介绍一些在MySQL中处理结果集的方法。 这些结果集是您日常…

写给Java开发者看的JavaScript对象机制

帮助面向对象开发者理解关于JavaScript对象机制 本文是以一个熟悉OO语言的开发者视角,来解释JavaScript中的对象。 对于不了解JavaScript 语言,尤其是习惯了OO语言的开发者来说,由于语法上些许的相似会让人产生心理预期,JavaScrip…

大数据ab 测试_在真实数据上进行AB测试应用程序

大数据ab 测试Hello Everyone!大家好! I am back with another article about Data Science. In this article, I will write about what is A-B testing and how to use it on real life data-set to compare two advertisement methods.我回来了另一篇有关数据科…

node:爬虫爬取网页图片

前言 周末自己在家闲着没事,刷着微信,玩着手机,发现自己的微信头像该换了,就去网上找了一下头像,看着图片,自己就想着作为一个码农,可以把这些图片都爬取下来做成一个微信小程序,说干…

如何更好的掌握一个知识点_如何成为一个更好的讲故事的人3个关键点

如何更好的掌握一个知识点You’re launching a digital transformation initiative in the middle of the ongoing pandemic. You are pretty excited about this big-ticket investment, which has the potential to solve remote-work challenges that your organization fac…

centos 搭建jenkins+git+maven

gitmavenjenkins持续集成搭建发布人:[李源] 2017-12-08 04:33:37 一、搭建说明 系统:centos 6.5 jdk:1.8.0_144 jenkins:jenkins-2.93-1.1 git:git-2.9.0 maven:Maven 3.3.9 二、部署 2.1、jdk安装 1)下…

什么事数据科学_如果您想进入数据科学,则必须知道的7件事

什么事数据科学No way. No freaking way to enter data science any time soon…That is exactly what I thought a year back.没门。 很快就不会出现进入数据科学的怪异方式 ……这正是我一年前的想法。 A little bit about my data science story: I am a complete beginner…

Java基础-基本数据类型

Java中常见的转义字符: 某些字符前面加上\代表了一些特殊含义: \r :return 表示把光标定位到本行行首. \n :next 表示把光标定位到下一行同样的位置. 单独使用在某些平台上会产生不同的效果.通常这两个一起使用,即:\r\n. 表示换行. \t :tab键,长度上相当于四个或者是八个空格 …

季节性时间序列数据分析_如何指导时间序列数据的探索性数据分析

季节性时间序列数据分析为什么要进行探索性数据分析? (Why Exploratory Data Analysis?) You might have heard that before proceeding with a machine learning problem it is good to do en end-to-end analysis of the data by carrying a proper exploratory …

TortoiseGit上传项目到GitHub

1. 简介 gitHub是一个面向开源及私有软件项目的托管平台,因为只支持git 作为唯一的版本库格式进行托管,故名gitHub。 2. 准备 2.1 安装git:https://git-scm.com/downloads。无脑安装 2.2 安装TortoiseGit(小乌龟):https://torto…

利用PHP扩展Taint找出网站的潜在安全漏洞实践

一、背景 笔者从接触计算机后就对网络安全一直比较感兴趣,在做PHP开发后对WEB安全一直比较关注,2016时无意中发现Taint这个扩展,体验之后发现确实好用;不过当时在查询相关资料时候发现关注此扩展的人数并不多;最近因为…

美团骑手检测出虚假定位_在虚假信息活动中检测协调

美团骑手检测出虚假定位Coordination is one of the central features of information operations and disinformation campaigns, which can be defined as concerted efforts to target people with false or misleading information, often with some strategic objective (…