ibm python db_使用IBM HR Analytics数据集中的示例的Python独立性卡方检验

ibm python db

Suppose you are exploring a dataset and you want to examine if two categorical variables are dependent on each other.

假设您正在探索一个数据集,并且想要检查两个分类变量是否相互依赖。

The motivation could be a better understanding of the relationship between an outcome variable and a predictor, identification of dependent predictors, etc.

动机可能是更好地理解结果变量与预测变量之间的关系,识别依赖的预测变量等。

In this case, a Chi-square test can be an effective statistical tool.

在这种情况下, 卡方检验可能是有效的统计工具。

In this post, I will discuss how to do this test in Python (both from scratch and using SciPy) with examples on a popular HR analytics dataset — the IBM Employee Attrition & Performance dataset.

在这篇文章中,我将讨论流行的HR分析数据集(IBM Employee Attrition&Performance数据集)上的示例,如何使用Python(从头开始并使用SciPy)进行此测试。

好奇心表 (Table of Curiosities)

  1. What is Chi-square test?

    什么是卡方检验?

  2. What are the categorical variables that we want to examine?

    我们要检查的分类变量是什么?

  3. How to perform this test from scratch?

    如何从头开始执行此测试?

  4. Is there a shortcut to do this?

    有捷径可做吗?

  5. What else can we do?

    我们还能做什么?

  6. What are the limitations?

    有什么限制?

总览 (Overview)

Chi-square test is a statistical hypothesis test to perform when the test statistic is Chi-square distributed under the null hypothesis and particularly the Chi-square test for independence is often used to examine independence between two categorical variables [1].

卡方检验是一种统计假设检验 ,当检验统计量为原假设下的卡方分布时,特别是卡方检验的独立性通常用于检验两个类别变量之间的独立性[1]。

The key assumptions associated with this test are: 1. random sample from the population. 2. each subject cannot be in more than 1 group in any variable.

与该测试相关的主要假设是:1.从总体中随机抽样。 2.每个主题的任何变量都不能超过1组。

To better illustrate this test, I have chosen the IBM HR dataset from Kaggle (link), which includes a sample of employee HR information regarding attrition, work satisfaction, performance, etc. People often use it to uncover insights about the relationship between employee attrition and other factors.

为了更好地说明此测试,我从Kaggle( 链接 )中选择了IBM HR数据集,其中包括有关员工流失,工作满意度,绩效等方面的员工HR信息的样本。人们经常使用它来揭示有关员工流失之间关系的见解。和其他因素。

Note that this is a fictional data set created by IBM data scientists [2].

请注意,这是由IBM数据科学家创建的虚拟数据集[2]。

To see the full Python code, check out my Kaggle kernel.

要查看完整的Python代码,请查看我的Kaggle内核 。

Without further ado, let’s get to the details!

事不宜迟,让我们来谈谈细节!

勘探 (Exploration)

Let’s first check out the number of employees and the number of attributes:

首先让我们检查一下雇员人数和属性数目:

data.shape
--------------------------------------------------------------------
(1470, 35)

There are 1470 employees and 35 attributes.

有1470名员工和35个属性。

Next, we can check what these attributes are and see if there is any missing value associated with each of them:

接下来,我们可以检查这些属性是什么,并查看与每个属性相关联的缺失值:

data.isna().any()
--------------------------------------------------------------------
Age False
Attrition False
BusinessTravel False
DailyRate False
Department False
DistanceFromHome False
Education False
EducationField False
EmployeeCount False
EmployeeNumber False
EnvironmentSatisfaction False
Gender False
HourlyRate False
JobInvolvement False
JobLevel False
JobRole False
JobSatisfaction False
MaritalStatus False
MonthlyIncome False
MonthlyRate False
NumCompaniesWorked False
Over18 False
OverTime False
PercentSalaryHike False
PerformanceRating False
RelationshipSatisfaction False
StandardHours False
StockOptionLevel False
TotalWorkingYears False
TrainingTimesLastYear False
WorkLifeBalance False
YearsAtCompany False
YearsInCurrentRole False
YearsSinceLastPromotion False
YearsWithCurrManager False
dtype: bool

Identify Categorical Variables

识别类别变量

Suppose we want to examine if there is a relationship between ‘Attrition’ and ‘JobSatisfaction’.

假设我们要检查“损耗”和“工作满意度”之间是否存在关系。

Counts for the two categories of ‘Attrition’:

计算“损耗”的两个类别:

data['Attrition'].value_counts()
--------------------------------------------------------------------
No 1233
Yes 237
Name: Attrition, dtype: int64

Counts for the four categories of ‘JobSatisfaction’ ordered by frequency:

按频率对“工作满意度”的四个类别进行计数:

data['JobSatisfaction'].value_counts()
--------------------------------------------------------------------
4 459
3 442
1 289
2 280
Name: JobSatisfaction, dtype: int64

Note that for ‘JobSatisfaction’, 1 is ‘Low’, 2 is ‘Medium’, 3 is ‘High’, and 4 is ‘Very High’.

请注意,对于“工作满意度”,1为“低”,2为“中”,3为“高”,4为“非常高”。

Null Hypothesis and Alternate Hypothesis

零假设和替代假设

For our Chi-square test for independence here, the null hypothesis is that there is no significant relationship between ‘Attrition’ and ‘JobSatisfaction’.

对于此处的独立性卡方检验,零假设是“损耗”与“工作满意度”之间没有显着关系。

The alternative hypothesis is that there is significant relationship between ‘Attrition’ and ‘JobSatisfaction’.

另一种假设 ,有“磨损”和“工作满意度”之间的关系显著。

Contingency Table

列联表

In order to compute the Chi-square test statistic, we would need to construct a contingency table.

为了计算卡方检验统计量,我们需要构造一个列联表。

We can do that using the ‘crosstab’ function from pandas:

我们可以使用pandas的'crosstab'函数来做到这一点:

pd.crosstab(data.Attrition, data.JobSatisfaction, margins=True)
Image for post

The numbers in this table represent frequencies. For example, the ‘46’ shown under both ‘2’ in ‘JobSatisfaction’ and ‘Yes’ in ‘Attrition’ means that out of the 1470 employees, 46 of them rated their job satisfaction as ‘Medium’ and they did leave the company.

该表中的数字代表频率。 例如,“ JobSatisfaction”中的“ 2”和“ Attrition”中的“ Yes”同时显示的“ 46”表示在1470名员工中,有46名员工的工作满意度为“中级”,他们确实离开了公司。

Chi-square Statistic

卡方统计

The formula for calculating the Chi-square statistic (X²) is shown as follows:

卡方统计量(X²)的计算公式如下所示:

X² = sum of [(observed-expected)² / expected]

X²= [(观察到的期望值)²/期望值的总和

The term ‘observed’ refers to the numbers we have seen in the contingency table, and the term ‘expected’ refers to the expected numbers when the null hypothesis is true.

术语“ 观察到 ”是指我们在列联表中看到的数字,术语“ 预期 ”是指当零假设为真时的预期数字。

Under the null hypothesis, there is no significant relationship between ‘Attrition’ and ‘JobSatisfaction’, which means the percentage of attrition should be consistent across the four categories of job satisfaction. As an example, the expected frequency for ‘4’ and ‘Attrition’ should be the number of employees that rate their job satisfactions as ‘Very High’ * (total attrition/total employee count), which is 459*237/1470, or about 74.

在原假设下,“减员”与“工作满意度”之间没有显着关系,这意味着在四个工作满意度类别中,减员百分比应保持一致。 例如,“ 4”和“减员”的预期频率应为将其工作满意度评为“非常高” *(总减员/雇员总数)的雇员数,即459 * 237/1470,或者大约74

Let’s compute all the expected numbers and store them in a list called ‘exp’:

让我们计算所有预期数字并将它们存储在名为“ exp”的列表中:

row_sum = ct.iloc[0:2,4].values
exp = []
for j in range(2):
for val in ct.iloc[2,0:4].values:
exp.append(val * row_sum[j] / ct.loc['All', 'All'])
print(exp)
--------------------------------------------------------------------
[242.4061224489796,
234.85714285714286,
370.7387755102041,
384.99795918367346,
46.593877551020405,
45.142857142857146,
71.26122448979592,
74.00204081632653]

Note that the last term (74) verifies that our calculation is correct.

请注意,最后一项(74)验证我们的计算正确。

Now we can compute X²:

现在我们可以计算X²:

((obs - exp)**2/exp).sum()
--------------------------------------------------------------------
17.505077010348

Degree of Freedom

自由度

One parameter we need apart from X² is the degree of freedom, which is computed as (number of categories in the first variable-1)*(number of categories in the second variable-1), and it is (2–1)*(4–1) in this case, or 3.

除X²之外,我们需要的另一个参数是自由度,它的计算方式是(第一个变量-1中的类别数)*(第二个变量-1中的类别数),它是(2-1)*在这种情况下为(4-1),或3。

(len(row_sum)-1)*(len(ct.iloc[2,0:4].values)-1)
--------------------------------------------------------------------
3

Interpretation

解释

With both X² and degrees of freedom, we can use a Chi-square table/calculator to determine its corresponding p-value and conclude if there is a significant relationship given a specified significance level of alpha.

对于X²和自由度,我们可以使用卡方表/计算器来确定其对应的p值,并得出在指定的显着性水平α下是否存在显着关系。

In another word, given the degrees of freedom, we know that the ‘observed’ should be close to ‘expected’ under the null hypothesis which means X² should be reasonably small. When X² is larger than a threshold, we know the p-value (probability of having a such as large X² given the null hypothesis) is extremely low, and we would reject the null hypothesis.

换句话说,给定自由度,我们知道在零假设下,“观察到的”应该接近“预期”,这意味着X²应该相当小。 当X²大于阈值时,我们知道p值(给定原假设的情况下具有X2这样大的概率)极低,我们将拒绝原假设。

In Python, we can compute the p-value as follows:

在Python中,我们可以如下计算p值:

1 - stats.chi2.cdf(chi_sq_stats, dof)
--------------------------------------------------------------------
0.000556300451038716

Suppose the significance level is 0.05. We can conclude that there is a significant relationship between ‘Attrition’ and ‘JobSatisfaction’.

假设显着性水平为0.05。 我们可以得出结论,“损耗”与“工作满意度”之间存在显着的关系。

Using SciPy

使用SciPy

There is a shortcut to perform this test in Python, which leverages the SciPy library (documentation).

有一个捷径可以在Python中执行此测试,它利用了SciPy库( 文档 )。

obs = np.array([ct.iloc[0][0:4].values,
ct.iloc[1][0:4].values])
stats.chi2_contingency(obs)[0:3]
--------------------------------------------------------------------
(17.505077010348, 0.0005563004510387556, 3)

Note that the three terms are X² statistic, p-value, and degree of freedom, respectively. These results are consistent with the ones we computed by hand earlier.

请注意,这三个项分别是X²统计量,p值和自由度。 这些结果与我们之前手工计算的结果一致。

‘Attrition’ and ‘Education’

“减员”与“教育”

It is somewhat intuitive that whether the employee leaves the company is related to the job satisfaction. Now let’s look at another example where we examine if there is significant relationship between ‘Attrition’ and ‘Education’:

从某种程度上说,员工是否离开公司与工作满意度有关。 现在让我们看另一个示例,在该示例中我们检查“损耗”和“教育”之间是否存在显着关系:

ct = pd.crosstab(data.Attrition, data.Education, margins=True)
obs = np.array([ct.iloc[0][0:5].values,
ct.iloc[1][0:5].values])
stats.chi2_contingency(obs)[0:3]
--------------------------------------------------------------------
(3.0739613982367193, 0.5455253376565949, 4)

The p-value is over 0.5, so at the significance level of 0.05, we fail to reject that there is no relationship between ‘Attrition’ and ‘Education’.

p值超过0.5,因此在显着性水平0.05时,我们不能拒绝“损耗”与“教育”之间没有任何关系。

Break Down the Analysis by Department

按部门细分分析

We can also check if a significant relationship exists breaking down by department. For example, we know there is a significant relationship between ‘Attrition’ and ‘WorkLifeBalance’ but we want to examine if that is agnostic to departments. First, let’s see what are the departments and the number of employees in each of them:

我们还可以按部门检查是否存在重大关系。 例如,我们知道“损耗”和“ WorkLifeBalance”之间存在显着的关系,但是我们想检查一下这是否与部门无关。 首先,让我们看看每个部门中的部门和员工人数:

data['Department'].value_counts()
--------------------------------------------------------------------
Research & Development 961
Sales 446
Human Resources 63
Name: Department, dtype: int64

To ensure enough samples for the Chi-square test, we will only focus on R&D and Sales in this analysis.

为了确保有足够的样本用于卡方检验,在此分析中,我们将仅关注研发和销售。

alpha = 0.05
for i in dep_counts.index[0:2]:
sub_data = data[data.Department == i]
ct = pd.crosstab(sub_data.Attrition, sub_data.WorkLifeBalance, margins=True)
obs = np.array([ct.iloc[0][0:4].values,ct.iloc[1][0:4].values])
print("For " + i + ": ")
print(ct)
print('With an alpha value of {}:'.format(alpha))
if stats.chi2_contingency(obs)[1] <= alpha:
print("Dependent relationship between Attrition and Work Life Balance")
else:
print("Independent relationship between Attrition and Work Life Balance")
print("")
--------------------------------------------------------------------
For Research & Development:
WorkLifeBalance 1 2 3 4 All
Attrition
No 41 203 507 77 828
Yes 19 32 68 14 133
All 60 235 575 91 961
With an alpha value of 0.05:
Dependent relationship between Attrition and Work Life Balance
For Sales:
WorkLifeBalance 1 2 3 4 All
Attrition
No 10 78 226 40 354
Yes 6 24 50 12 92
All 16 102 276 52 446
With an alpha value of 0.05:
Independent relationship between Attrition and Work Life Balance

From these output, we can see that there is a significant relationship in the R&D department, but not in the Sales department.

从这些输出中,我们可以看到R&D部门之间存在重要关系,而Sales部门则没有。

注意事项和局限性 (Caveats and Limitations)

There are a few caveats when conducting this analysis as well as some limitations of this test:

进行此分析时需要注意一些事项,以及此测试的一些局限性:

  1. In order to draw a meaningful conclusion, the number of samples in each scenario needs to be sufficiently large, which might not be the case in reality.

    为了得出有意义的结论,每种情况下的样本数量必须足够大,实际上可能并非如此。
  2. A significant relationship does not imply causality.

    一个显著的关系并不意味着因果关系。

  3. The Chi-square test itself does not provide additional insights besides ‘significant relationship or not’. For example, the test does not inform that as job satisfaction increases, the proportion of employees who leave the company tends to decrease.

    卡方检验本身除了“是否存在重要关系”外,不提供其他见解。 例如,该测试并未告知随着工作满意度的提高,离开公司的员工比例趋于下降。

摘要 (Summary)

Let’s quickly recap.

让我们快速回顾一下。

We performed a Chi-square test for independence to examine the relationship between variables in the IBM HR Analytics dataset. We discussed two ways to do it in Python, both from scratch and using SciPy. Last, we showed that when a significant relationship exists, we can also stratify it and check if it is true for each level.

我们针对独立性执行卡方检验,以检查IBM HR Analytics数据集中变量之间的关系。 我们从头开始和使用SciPy讨论了两种在Python中执行此操作的方法。 最后,我们证明了当存在重要关系时,我们还可以对其进行分层,并检查每个级别的关系是否正确。

I hope you enjoyed this blog post and please share any thoughts that you may have :)

我希望您喜欢这篇博客文章,并请分享您可能有的任何想法:)

Check out my other post on building an image classification through Streamlit and PyTorch:

查看我关于通过Streamlit和PyTorch建立图像分类的其他文章:

翻译自: https://towardsdatascience.com/chi-square-test-for-independence-in-python-with-examples-from-the-ibm-hr-analytics-dataset-97b9ec9bb80a

ibm python db

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/388110.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

sql 左联接 全联接_通过了解自我联接将您SQL技能提升到一个新的水平

sql 左联接 全联接The last couple of blogs that I have written have been great for beginners ( Data Concepts Without Learning To Code or Developing A Data Scientist’s Mindset). But, I would really like to push myself to create content for other members of …

hadoop windows

1、安装JDK1.6或更高版本 官网下载JDK&#xff0c;安装时注意&#xff0c;最好不要安装到带有空格的路径名下&#xff0c;例如:Programe Files&#xff0c;否则在配置Hadoop的配置文件时会找不到JDK&#xff08;按相关说法&#xff0c;配置文件中的路径加引号即可解决&#xff…

科学价值 社交关系 大数据_服务的价值:数据科学和用户体验研究美好生活

科学价值 社交关系 大数据A crucial part of building a product is understanding exactly how it provides your customers with value. Understanding this is understanding how you fit into the lives of your customers, and should be central to how you build on wha…

在Ubuntu下创建hadoop组和hadoop用户

一、在Ubuntu下创建hadoop组和hadoop用户 增加hadoop用户组&#xff0c;同时在该组里增加hadoop用户&#xff0c;后续在涉及到hadoop操作时&#xff0c;我们使用该用户。 1、创建hadoop用户组 2、创建hadoop用户 sudo adduser -ingroup hadoop hadoop 回车后会提示输入新的UNIX…

vs azure web_在Azure中迁移和自动化Chrome Web爬网程序的指南。

vs azure webWebscraping as a required skill for many data-science related jobs is becoming increasingly desirable as more companies slowly migrate their processes to the cloud.随着越来越多的公司将其流程缓慢迁移到云中&#xff0c;将Web爬网作为许多与数据科学相…

hadoop eclipse windows

首先说一下本人的环境: Windows7 64位系统 Spring Tool Suite Version: 3.4.0.RELEASE Hadoop2.6.0 一&#xff0e;简介 Hadoop2.x之后没有Eclipse插件工具&#xff0c;我们就不能在Eclipse上调试代码&#xff0c;我们要把写好的java代码的MapReduce打包成jar然后在Linux上运…

netstat 在windows下和Linux下查看网络连接和端口占用

假设忽然起个服务&#xff0c;告诉我8080端口被占用了&#xff0c;OK&#xff0c;我要去看一下是什么服务正在占用着&#xff0c;能不能杀 先假设我是在Windows下&#xff1a; 第一列&#xff1a; Proto 协议 第二列&#xff1a; 本地地址【ip端口】 第三列&#xff1a;远程地址…

selenium 解析网页_用Selenium进行网页搜刮

selenium 解析网页网页抓取系列 (WEB SCRAPING SERIES) 总览 (Overview) Selenium is a portable framework for testing web applications. It is open-source software released under the Apache License 2.0 that runs on Windows, Linux and macOS. Despite serving its m…

代理ARP协议(Proxy ARP)

代理ARP&#xff08;Proxy-arp&#xff09;的原理就是当出现跨网段的ARP请求时&#xff0c;路由器将自己的MAC返回给发送ARP广播请求发送者&#xff0c;实现MAC地址代理&#xff08;善意的欺骗&#xff09;&#xff0c;最终使得主机能够通信。 图中R1和R3处于不同的局域网&…

hive 导入hdfs数据_将数据加载或导入运行在基于HDFS的数据湖之上的Hive表中的另一种方法。

hive 导入hdfs数据Preceding pen down the article, might want to stretch out appreciation to all the wellbeing teams beginning from cleaning/sterile group to Nurses, Doctors and other who are consistently battling to spare the mankind from continuous Covid-1…

对Faster R-CNN的理解(1)

目标检测是一种基于目标几何和统计特征的图像分割&#xff0c;最新的进展一般是通过R-CNN&#xff08;基于区域的卷积神经网络&#xff09;来实现的&#xff0c;其中最重要的方法之一是Faster R-CNN。 1. 总体结构 Faster R-CNN的基本结构如下图所示&#xff0c;其基础是深度全…

大数据业务学习笔记_学习业务成为一名出色的数据科学家

大数据业务学习笔记意见 (Opinion) A lot of aspiring Data Scientists think what they need to become a Data Scientist is :许多有抱负的数据科学家认为&#xff0c;成为一名数据科学家需要具备以下条件&#xff1a; Coding 编码 Statistic 统计 Math 数学 Machine Learni…

postman 请求参数为数组及JsonObject

2019独角兽企业重金招聘Python工程师标准>>> 1. (1)数组的请求方式(post) https://blog.csdn.net/qq_21205435/article/details/81909184 (2)数组的请求方式&#xff08;get&#xff09; http://localhost:port/list?ages10,20,30 后端接收方式&#xff1a; PostMa…

python 开发api_使用FastAPI和Python快速开发高性能API

python 开发apiIf you have read some of my previous Python articles, you know I’m a Flask fan. It is my go-to for building APIs in Python. However, recently I started to hear a lot about a new API framework for Python called FastAPI. After building some AP…

基于easyui开发Web版Activiti流程定制器详解(一)——目录结构

&#xfeff;&#xfeff;题外话&#xff08;可略过&#xff09;&#xff1a; 前一段时间&#xff08;要是没记错的话应该是3个月以前&#xff09;发布了一个更新版本&#xff0c;很多人说没有文档看着比较困难&#xff0c;所以打算拿点时间出来详细给大家讲解一下&#xff0c;…

基于easyui开发Web版Activiti流程定制器详解(二)——文件列表

&#xfeff;&#xfeff;上一篇我们介绍了目录结构&#xff0c;这篇给大家整理一个文件列表以及详细说明&#xff0c;方便大家查找文件。 由于设计器文件主要保存在wf/designer和js/designer目录下&#xff0c;所以主要针对这两个目录进行详细说明。 wf/designer目录文件详解…

Power BI:M与DAX以及度量与计算列

When I embarked on my Power BI journey I was almost immediately slapped with an onslaught of foreign and perplexing terms that all seemed to do similar, but somehow different, things.当我开始Power BI之旅时&#xff0c;我几乎立刻受到了外国和困惑术语的冲击&am…

git 基本命令和操作

设置全局用户名密码 $ git config --global user.name runoob $ git config --global user.email testrunoob.comgit init:初始化仓库 创建新的 Git 仓库 git clone: 拷贝一个 Git 仓库到本地 : git clone [url]git add:将新增的文件添加到缓存 : git add test.htmlgit status …

基于easyui开发Web版Activiti流程定制器详解(三)——页面结构(上)

&#xfeff;&#xfeff;上一篇介绍了定制器相关的文件&#xff0c;这篇我们来看看整个定制器的界面部分&#xff0c;了解了页面结构有助于更好的理解定制器的实现&#xff0c;那么现在开始吧&#xff01; 首先&#xff0c;我们来看看整体的结构&#xff1a; 整体结构比较简单…

基于easyui开发Web版Activiti流程定制器详解(四)——页面结构(下)

&#xfeff;&#xfeff;题外话&#xff1a; 这两天周末在家陪老婆和儿子没上来更新请大家见谅&#xff01;上一篇介绍了调色板和画布区的页面结构&#xff0c;这篇讲解一下属性区的结构也是定制器最重要的一个页面。 属性区整体页面结构如图&#xff1a; 在这个区域可以定义工…