Class Imbalance Problem

本文转自:http://www.chioka.in/class-imbalance-problem/#comment-202282

What is the Class Imbalance Problem?

It is the problem in machine learning where the total number of a class of data (positive) is far less than the total number of another class of data (negative). This problem is extremely common in practice and can be observed in various disciplines including fraud detection, anomaly detection, medical diagnosis, oil spillage detection, facial recognition, etc.

Why is it a problem?

Most machine learning algorithms and works best when the number of instances of each classes are roughly equal. When the number of instances of one class far exceeds the other, problems arise. This is best illustrated below with an example.

Given a dataset of transaction data, we would like to find out which are fraudulent and which are genuine ones. Now, it is highly cost to the e-commerce company if a fraudulent transaction goes through as this impacts our customers trust in us, and costs us money. So we want to catch as many fraudulent transactions as possible.

If there is a dataset consisting of 10000 genuine and 10 fraudulent transactions, the classifier will tend to classify fraudulent transactions as genuine transactions. The reason can be easily explained by the numbers. Suppose the machine learning algorithm has two possibly outputs as follows:

  1. Model 1 classified 7 out of 10 fraudulent transactions as genuine transactions and 10 out of 10000 genuine transactions as fraudulent transactions.
  2. Model 2 classified 2 out of 10 fraudulent transactions as genuine transactions and 100 out of 10000 genuine transactions as fraudulent transactions.

If the classifier’s performance is determined by the number of mistakes, then clearly Model 1 is better as it makes only a total of 17 mistakes while Model 2 made 102 mistakes. However, as we want to minimize the number of fraudulent transactions happening, we should pick Model 2 instead which only made 2 mistakes classifying the fraudulent transactions. Of course, this could come at the expense of more genuine transactions being classified as fraudulent transactions, but will be a cost we can bear for now. Anyhow, a general machine learning algorithm will just pick Model 1 than Model 2, which is a problem. In practice, this means we will let a lot of fraudulent transactions go through although we could have stopped them by using Model 2. This translates to unhappy customers and money lost for the company.

How to tell the machine learning algorithm which is the better solution?

To tell the machine learning algorithm (or the researcher) that Model 2 is better than Model 1, we need to show that Model 2 above is better than Model 1 above. For that, we will need better metrics than just counting the number of mistakes made.

We introduce the concept of True Positive, True Negative, False Positive and False Negative:

  • True Positive (TP) – An example that is positive and is classified correctly as positive
  • True Negative (TN) – An example that is negative and is classified correctly as negative
  • False Positive (FP) – An example that is negative but is classified wrongly as positive
  • False Negative (FN) – An example that is positive but is classified wrongly as negative

Based on this above. We will have also the following of True Positive Rate, True Negative Rate, False Positive Rate, False Negative Rate:

Metrics Table

With these new metrics, let’s compare it with the conventional metrics of counting the number of mistakes made with the example above. First, we will use the old metrics to calculate the number of mistakes made (error):

Traditional Metrics

As illustrated above, Model 1 looks like it has lower error (0.1% error) than Model 2 (1.0% error) but we know that Model 2 is the better one, as it makes less false negatives (FN) (maximize true positive (TP)). Now let’s see what the performance of Model 1 and Model 2 are like with the new metrics:

New Metrics

Now, we can see that the false negative rate of Model 1 is at 70% while the false negative rate of Model 2 is just at 20%, which is clearly a better classifier. This is what we should educate the machine learning algorithm (or us) to use in order to allow it to pick a better algorithm.

How to mitigate this problem?

Now knowing what the Class Imbalance Problem is and why is it a problem, we need to know how to deal with this problem.

We can roughly classify the approaches into two major categories: sampling based approaches and cost function based approaches.

Cost function based approaches

The intuition behind cost function based approaches is that if we think one false negative is worse than one false positive, we will count that one false negative as, e.g., 100 false negatives instead. For example, if 1 false negative is as costly as 100 false positives, then the machine learning algorithm will try to make fewer false negatives compared to false positives (since it is cheaper). For example, in the case of SVM, the generic formula is:

SVM_formula

where w is the normal vector to the hyperplane. and E[i] is the error of each data instance, Cis a cost constant and n is the number of data instances. To assign a different cost function to false negative and false positive, we can modify the formula to as follows:

SVM_formula_2

Where C+ is a cost constant for positive cases and C- is a cost constant for negative cases, n+is the total number of positive cases and n- is the total number of negative cases. Without diving too deep into the formula above, this is just an example how one may assign different to cost to positive and negative classes.

Sampling based approaches

This can be roughly classified into three categories:

  1. Oversampling, by adding more of the minority class so it has more effect on the machine learning algorithm
  2. Undersampling, by removing some of the majority class so it has less effect on the machine learning algorithm
  3. Hybrid, a mix of oversampling and undersampling

However, these approaches have clear drawbacks, as explained below.

Undersampling

By undersampling, we could risk removing some of the majority class instances which is more representative, thus discarding useful information. This can be illustrated as follows: Undersampling

Here the green line is the ideal decision boundary we would like to have, and blue is the actual result. On the left side is the result of just applying a general machine learning algorithm without using undersampling. On the right, we undersampled the negative class but removed some informative negative class, and caused the blue decision boundary to be slanted, causing some negative class to be classified as positive class wrongly.

Oversampling

By oversampling, just duplicating the minority classes could lead the classifier to overfitting to a few examples, which can be illustrated below:Oversampling

On the left hand side is before oversampling, where as on the right hand side is oversampling has been applied. On the right side, The thick positive signs indicate there are multiple repeated copies of that data instance. The machine learning algorithm then sees these cases many times and thus designs to overfit to these examples specifically, resulting in a blue line boundary as above.

Hybrid approach

By combining undersampling and oversampling approaches, we get the advantages but also drawbacks of both approaches as illustrated above, which is still a tradeoff.

More recent approaches to the problem

In 2002, an sampling based algorithm called SMOTE (Synthetic Minority Over-Sampling Technique) was introduced that try to address the class imbalance problem. It is one of the most adopted approaches due to its simplicity and effectiveness. It is a combination of oversampling and undersampling, but the oversampling approach is not by replicating minority class but constructing new minority class data instance via an algorithm.

In traditional oversampling, minority class is replicated exactly. In SMOTE, new minority instances are constructed in this way:

SMOTE

The intuition behind the construction algorithm is that oversampling causes overfit because of repeated instances causes the decision boundary to tighten. Instead, we will create “similar” examples instead. To the machine learning algorithm, these new constructed instances are not exact copies and thus softens the decision boundary as a result. This be can illustrated as follows:

SMOTE boundary

As a result, the classifier is more general and does not overfit.

Even more recent approaches

Interested readers may look into more recent literature regarding RUSBoost, SMOTEBagging and Underbagging, which are all regarded as more promising approaches since SMOTE.

However, SMOTE is still very popular due to its simplicity.

Summary

The Class Imbalance Problem is a common problem affecting machine learning due to having disproportionate number of class instances in practice. To compare solutions, we will use alternative metrics (True Positive, True Negative, False Positive, False Negative) instead of general accuracy of counting number of mistakes.

Due to its prevalence, there are many approaches out there to deal with this problem. They can be generally classified into two major categories of 1) sampling based, and 2) cost function based. Sampling based can be broken into three major categories: a) over sampling b) under sampling c) hybrid of oversampling and undersampling.

Now armed with what, why and how, you can start dealing with real world scenarios of handling datasets with this problem.


本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/247057.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

matlab中的类标转换程序

matlab中的类标转换程序 原始类标为Y,新类标为Y_new %进行排序,针对类标数目orig_labels sort(unique(Y)); Y_new Y;new_labels 1:length(orig_labels);for i1:length(orig_labels)Y_new(find(Yorig_labels(i)))Inf;Y_new(isinf(Y_new))new_labels(…

this关键字+super关键字

一.this关键字1.实例一:(1)需求:使用Java类描述一个动物;(2)实例:class Animal{ String name; //成员变量 String color; public Animal(String n,String c){ na…

python中的print

python3 中去除了print语句,加入print()函数实现相同的功能 print() 会在输出窗口中显示一些文本。 >>> print "hello,world!" SyntaxError: Missing parentheses in call to print >>> print("hello,world!") hello,world…

final+static

final final关键字顾名思义代表“最后的”,意味着不能被更改。它的定义,可以概括地分为以下三点: 被final修饰的类不能被继承;被final修饰的方法不能被重写;被final修饰的变量不能被改变。注:引用类型的变量…

程序代码编辑器和浏览器代码编辑器&代码可视化执行过程

tutorialspoint http://www.tutorialspoint.com/codingground.htm 1. Sublime Text :http://blog.l1n3.net/editor/sublime-text-introduce/ 下载 :http://www.sublimetext.com/3 2. Notepad https://notepad-plus-plus.org/zh/ 更多细节请查看 htt…

匿名对象+内部类

匿名对象 普通的类对象在使用时会定义一个类类型的变量,用来保存new出来的类所在的地址。而匿名类取消掉了这个变量,这个地址由编译器来处理,并且在new出来之后,它占用的内存会有JVM自动回收掉。后续无法再使用了。例如 public cl…

听技术播客:一边学Python编程一边学英语

本文转自:http://codingpy.com/article/recommended-python-podcasts/ 学技术的朋友一般都会关注不少技术博客(blog),但是关注技术播客(podcast)的人估计不会太多。这里一方面也是由于相关的播客数量&#…

mysql补充

mysql补充 mysql使用流程 开启服务端,mysqld或者net start mysqlcmd下键入mysql -u root -p,输入设置好的密码,连接mysql客户端show databases;展示所有的mysql仓库创建一个库:create database CRM;然后sho…

编程书单:十本Python编程语言的入门书籍

本文转自:http://codingpy.com/article/10-python-beginner-books/ 本文与大家分享一些Python编程语言的入门书籍,其中不乏经典。我在这里分享的,大部分是这些书的英文版,如果有中文版的我也加上了。有关书籍的介绍,大…

JavaScript异步

JavaScript异步类型 延迟类型:setTimeout、setInterval、setImmediate监听事件:监听new Image加载状态、监听script加载状态、监听iframe加载状态、Message带有异步功能类型: Promise、ajax、Worker、async/awaitJavaScript常用异步编程 Prom…

Sublime配置与各种插件

本文转自:http://www.cnblogs.com/yyhh/p/4232063.html Sublime Text 3 安装Package Control 点击View -> Show Console 在下方命令行内,输入以下命令。 import urllib.request,os;pfPackage Control.sublime-package;ippsublime.installed_packages_…

把Sublime Text 2打造成一个轻量级Python的IDE

本文转自:http://blog.l1n3.net/python/sublime-text-to-python-ide/ 因为这段时间迷上了Python,所以想吧Sublime Text 2弄成一个Python的简易IDE,Python自带的IDLE简直太难用!!!! 配置Python环…

数据库表操作、数据类型及完整性约束

数据库表操作、数据类型及完整性约束 库操作补充 数据库命名规则: 可以由字母、数字、下划线、@、#、$区分大小写唯一性不能使用关键字如 create select不能单独使用数字最长128位表操作补充 #语法: create table 表名…

算法第一次作业

1.代码规范(由于日后可能会用C和Java,就找了两种) Google C代码规范:https://blog.csdn.net/freeking101/article/details/78930381 Ggoogle Jave代码规范:https://www.jianshu.com/p/4e50269037ed 2.《数学之美》读后…

Sublime Text官方文档 中英文版本

英文版本:http://docs.sublimetext.info/en/latest/index.html 中文翻译版本:http://sublime-text.readthedocs.org/en/latest/reference/build_systems.html

第99:真正理解拉格朗日乘子法和 KKT 条件

转载于:https://www.cnblogs.com/invisible2/p/11441485.html

有了CodinGame,玩着游戏就能学编程

本文转自:http://www.codingpy.com/article/learning-to-code-becomes-a-game/ 今天编程派向大家推荐一个有趣的编程练习平台,而它与其他平台的差异,就在于它将编程练习变成了一个个游戏。这个平台的名字叫CodinGame,是一家法国创…

第98:svd原理

SVD分解:任何矩阵都可以分解成第一行的形式,3个相乘。UV都是正交矩阵,中间的是奇异值。 3个相乘的形式可以拆分。即奇异值*第一行*第一列。在相加。 奇异值有时很小,在这种情况下,丢掉,可以减少计算量&…

Python学习(变量与字符串)

print()、input()、if/else就可以做一个简陋的游戏 print() # 打印函数,将信息打印出来input() # 将信息打印,并且要求输入一段话,并且把这段话。input函数,这个函数会将字符串显示在IDLE上,并且让用户输入信息&#…

第97:一文读懂协方差与协方差矩阵

转载于:https://www.cnblogs.com/invisible2/p/11442777.html