辛普森悖论_所谓的辛普森悖论

辛普森悖论

We all know the Simpsons family from Disneyland, but have you heard about the Simpson’s Paradox from statistic theory? This article will illustrate the definition of Simpson’s Paradox with an example, and show you how can it harm your statistical tests and analysis.

我们都知道迪斯尼乐园的辛普森一家,但您是否从统计理论中听说过辛普森悖论? 本文将通过一个示例说明Simpson's Paradox的定义,并向您展示它如何危害您的统计测试和分析。

What is Simpson’s Paradox?

什么是辛普森悖论?

Simpson’s paradox refers to the situations in which a trend or relationship that is observed within multiple groups disappears or reverses when the groups are combined. The quick answer to why there is Simpson's paradox is the existence of confounding variables. I will illustrate it with the example below.

辛普森悖论是指当组合在一起时,在多个组中观察到的趋势或关系消失或逆转的情况。 为何存在辛普森悖论的快速答案是存在混杂变量。 我将通过以下示例进行说明。

An example of Simpson’s Paradox

辛普森悖论的一个例子

Let’s take a simple example from a study analyzing the mortality rate difference between smokers and non-smokers, which was conducted by Appleton, French, and Vanderpump in 1996. Here is the data they have collected in the study:

让我们举一个简单的例子,该研究是由Appleton,French和Vanderpump于1996年进行的一项分析吸烟者和非吸烟者之间的死亡率差异的研究。以下是他们在研究中收集的数据:

Image for post
the mortality rate for smokers and non-smokers
吸烟者和非吸烟者的死亡率

One would expect the mortality rate to be higher for smokers compared to non-smokers due to the harm caused by smoking. However, the data shows that the mortality rate is higher for non-smokers. The relationship is better represented here:

人们会认为,由于吸烟造成的危害,与不吸烟者相比,吸烟者的死亡率更高。 但是,数据显示,非吸烟者的死亡率较高。 该关系在这里可以更好地表示:

Image for post
mortality rate chart
死亡率表

The grey line in the chart represents the mortality rate, and it is higher for non-smokers. Why is that? Let’s bring down the data into multiple groups by ages:

图表中的灰线代表死亡率,非吸烟者死亡率更高。 这是为什么? 让我们按年龄将数据分为多个组:

Image for post

Here is the chart plotting the mortality rate by age groups and by smoking or not:

这是按年龄组和吸烟与否绘制死亡率的图表:

Image for post

The chart shows that in the dataset, the mortality rate increase as age increases for both smokers and non-smokers. It is reasonable to conclude that age is positively correlated with the mortality rate, no matter by the evidence from this data, or from common sense.

图表显示,在数据集中,吸烟者和非吸烟者的死亡率都随着年龄的增长而增加。 可以合理地得出结论,无论是根据该数据还是常识,年龄与死亡率呈正相关。

In the meantime, if we compare the smoking rate across different age groups, as the chart presented below:

同时,如果我们比较不同年龄段的吸烟率,如下图所示:

Image for post

There are more smokers than non-smokers for all age groups except 65–74, and 75+. 27% of the non-smokers are older than 65, and only 8% of the smokers are older than 65. Thus, the chart shows that the age distributions are substantially different between smokers and non-smokers. The smoking population is younger than the non-smoking population from the data. In other words, age is negatively correlated with the probability of being in the smoking group or not.

除了65-74岁和75岁以上的年龄段外,所有年龄段的吸烟者都比不吸烟者多。 27%的不吸烟者年龄在65岁以上,只有8%的吸烟者年龄在65岁以上。因此,图表显示,吸烟者与不吸烟者之间的年龄分布存在很大差异。 根据数据,吸烟人口比非吸烟人口年轻。 换句话说,年龄与是否参加吸烟组负相关。

The previous evidence supports the statement that when we examine the relationship between smoking and mortality rate, we cannot ignore age, which is called a confounding variable(or a lurking variable). Age is positively correlated with mortality rate but is negatively correlated with smoking. Older groups have a higher mortality rate, but fewer of them are smokers. Thus, a greater proportion of older non-smokers, with a 100% mortality rate in this dataset, pushes up the average mortality rate for the non-smoker group. That is why we observe that the mortality rate is lower for the non-smokers across all age groups, but it is higher in the non-smoker group when we combine all groups together. This example perfectly illustrates what is Simpson’s Paradox, and why it happens.

先前的证据支持这样的说法:当我们检查吸烟与死亡率之间的关系时,我们不能忽略年龄,这被称为混杂变量(或潜伏变量)。 年龄与死亡率呈正相关,但与吸烟呈负相关。 年龄较大的人群死亡率较高,但吸烟者较少。 因此,在此数据集中具有较高死亡率的年龄较大的不吸烟者比例为100%,从而推高了不吸烟者群体的平均死亡率。 这就是为什么我们观察到所有年龄段的不吸烟者的死亡率都较低,但将所有年龄段的人群合并在一起,则不吸烟者的死亡率较高。 这个例子完美地说明了什么是辛普森悖论,以及它为什么发生。

How to deal with Simpson’s Paradox?

如何应对辛普森悖论?

Now we know what and why, it is time to know how to fix it. Simpson’s Paradox can cause great harm for statistical analyses or tests because of the reversed or insignificant relationship when ignoring the confounding variables. Thus, the way to deal with Simpson’s Paradox is to find the confounding variable and control it during your analysis. Take the previous data as an example, you cannot jump to the conclusion that non-smokers have a higher mortality rate thus smoking is good for health, when you are only observing the results from group averages. Breaking down the data into different age groups will give you a better understanding of the relationship.

现在我们知道了什么以及为什么,现在该知道如何修复它了。 辛普森悖论可能会给统计分析或测试造成极大伤害,因为当忽略混淆变量时,它们之间的关系相反或无关紧要。 因此,处理辛普森悖论的方法是找到混淆变量,并在分析过程中对其进行控制。 以以前的数据为例,当您仅观察小组平均值的结果时,您无法得出结论,即非吸烟者的死亡率较高,因此吸烟有益于健康。 将数据分为不同年龄段可以使您更好地了解这种关系。

Hope this article helps you understand Simpson’s Paradox. Thank you for reading!

希望本文能帮助您理解辛普森悖论。 感谢您的阅读!

翻译自: https://medium.com/the-innovation/the-so-called-simpsons-paradox-6d0efdca6fdc

辛普森悖论

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/389939.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

查看NVIDIA使用率工具目录

2019独角兽企业重金招聘Python工程师标准>>> C:\Program Files\NVIDIA Corporation\Display.NvContainer\NVDisplay.Container.exe 转载于:https://my.oschina.net/u/2430809/blog/1927560

余弦相似度和欧氏距离_欧氏距离和余弦相似度

余弦相似度和欧氏距离Photo by Markus Winkler on UnsplashMarkus Winkler在Unsplash上拍摄的照片 This is a quick and straight to the point introduction to Euclidean distance and cosine similarity with a focus on NLP.这是对欧氏距离和余弦相似度的快速而直接的介绍&…

七、 面向对象(二)

匿名类对象 创建的类的对象是匿名的。当我们只需要一次调用类的对象时,我们就可以考虑使用匿名的方式创建类的对象。特点是创建的匿名类的对象只能够调用一次! package day007;//圆的面积 class circle {double radius;public double getArea() {// TODO…

机器学习 客户流失_通过机器学习预测流失

机器学习 客户流失介绍 (Introduction) This article is part of a project for Udacity “Become a Data Scientist Nano Degree”. The Jupyter Notebook with the code for this project can be downloaded from GitHub.本文是Udacity“成为数据科学家纳米学位”项目的一部分…

Qt中的坐标系统

转载:原野追逐 Qt使用统一的坐标系统来定位窗口部件的位置和大小。 以屏幕的左上角为原点即(0, 0)点,从左向右为x轴正向,从上向下为y轴正向,这整个屏幕的坐标系统就用来定位顶层窗口; 此外,窗口内部也有自己…

预测股票价格 模型_建立有马模型来预测股票价格

预测股票价格 模型前言 (Preface) If you are reading this, it’s most likely because you love to solve puzzles. I’m a very competitive person by nature. The Mt. Everest of puzzles, in my opinion, is trying to find excess returns through active trading in th…

Python 模块 timedatetime

time & datetime 模块 在平常的代码中,我们常常需要与时间打交道。在Python中,与时间处理有关的模块就包括:time,datetime,calendar(很少用,不讲),下面分别来介绍。 在开始之前,首先要说明几…

柠檬工会_工会经营者

柠檬工会Hey guys! This week we’ll be going over some ways to work with result sets in MySQL. These result sets are the outputs of your everyday queries, such as:大家好! 本周,我们将介绍一些在MySQL中处理结果集的方法。 这些结果集是您日常…

写给Java开发者看的JavaScript对象机制

帮助面向对象开发者理解关于JavaScript对象机制 本文是以一个熟悉OO语言的开发者视角,来解释JavaScript中的对象。 对于不了解JavaScript 语言,尤其是习惯了OO语言的开发者来说,由于语法上些许的相似会让人产生心理预期,JavaScrip…

大数据ab 测试_在真实数据上进行AB测试应用程序

大数据ab 测试Hello Everyone!大家好! I am back with another article about Data Science. In this article, I will write about what is A-B testing and how to use it on real life data-set to compare two advertisement methods.我回来了另一篇有关数据科…

node:爬虫爬取网页图片

前言 周末自己在家闲着没事,刷着微信,玩着手机,发现自己的微信头像该换了,就去网上找了一下头像,看着图片,自己就想着作为一个码农,可以把这些图片都爬取下来做成一个微信小程序,说干…

如何更好的掌握一个知识点_如何成为一个更好的讲故事的人3个关键点

如何更好的掌握一个知识点You’re launching a digital transformation initiative in the middle of the ongoing pandemic. You are pretty excited about this big-ticket investment, which has the potential to solve remote-work challenges that your organization fac…

centos 搭建jenkins+git+maven

gitmavenjenkins持续集成搭建发布人:[李源] 2017-12-08 04:33:37 一、搭建说明 系统:centos 6.5 jdk:1.8.0_144 jenkins:jenkins-2.93-1.1 git:git-2.9.0 maven:Maven 3.3.9 二、部署 2.1、jdk安装 1)下…

什么事数据科学_如果您想进入数据科学,则必须知道的7件事

什么事数据科学No way. No freaking way to enter data science any time soon…That is exactly what I thought a year back.没门。 很快就不会出现进入数据科学的怪异方式 ……这正是我一年前的想法。 A little bit about my data science story: I am a complete beginner…

Java基础-基本数据类型

Java中常见的转义字符: 某些字符前面加上\代表了一些特殊含义: \r :return 表示把光标定位到本行行首. \n :next 表示把光标定位到下一行同样的位置. 单独使用在某些平台上会产生不同的效果.通常这两个一起使用,即:\r\n. 表示换行. \t :tab键,长度上相当于四个或者是八个空格 …

季节性时间序列数据分析_如何指导时间序列数据的探索性数据分析

季节性时间序列数据分析为什么要进行探索性数据分析? (Why Exploratory Data Analysis?) You might have heard that before proceeding with a machine learning problem it is good to do en end-to-end analysis of the data by carrying a proper exploratory …

TortoiseGit上传项目到GitHub

1. 简介 gitHub是一个面向开源及私有软件项目的托管平台,因为只支持git 作为唯一的版本库格式进行托管,故名gitHub。 2. 准备 2.1 安装git:https://git-scm.com/downloads。无脑安装 2.2 安装TortoiseGit(小乌龟):https://torto…

利用PHP扩展Taint找出网站的潜在安全漏洞实践

一、背景 笔者从接触计算机后就对网络安全一直比较感兴趣,在做PHP开发后对WEB安全一直比较关注,2016时无意中发现Taint这个扩展,体验之后发现确实好用;不过当时在查询相关资料时候发现关注此扩展的人数并不多;最近因为…

美团骑手检测出虚假定位_在虚假信息活动中检测协调

美团骑手检测出虚假定位Coordination is one of the central features of information operations and disinformation campaigns, which can be defined as concerted efforts to target people with false or misleading information, often with some strategic objective (…

CertUtil.exe被利用来下载恶意软件

1、前言 经过国外文章信息,CertUtil.exe下载恶意软件的样本。 2、实现原理 Windows有一个名为CertUtil的内置程序,可用于在Windows中管理证书。使用此程序可以在Windows中安装,备份,删除,管理和执行与证书和证书存储相…