回归分析假设_回归分析假设的最简单指南

回归分析假设

The Linear Regression is the simplest non-trivial relationship. The biggest mistake one can make is to perform a regression analysis that violates one of its assumptions! So, it is important to consider these assumptions before applying regression analysis on the dataset.

线性回归是最简单的非平凡关系。 一个人可能犯的最大错误是进行违反其假设之一的回归分析! 因此,在对数据集进行回归分析之前,必须考虑这些假设。

This article focuses both on the assumptions and measures to fix them in case the dataset violates it.

本文着重于假设和纠正假设的方法,以防数据集违反假设。

  1. Linearity: The specified model must represent a linear relationship.

    线性:指定的模型必须表示线性关系。

This is the simplest assumption to deal with as it signifies that the relationship between dependent and independent variable is linear wherein independent variable is multiplied by its coefficient to obtain dependent variable.

这是要处理的最简单假设,因为它表示因变量和自变量之间的关系是线性的,其中将自变量乘以其系数即可获得因变量。

Y=β0​+β1X1​+…+βkXk+ε

Y =β0 +β1X1 + ... +βKXK

It is quite easy to verify this assumption as plotting independent variable against dependent variable on a scatterplot gives us insights whether the pattern formed can be represented through a line or not. However, applying linear regression on data would not be appropriate if a line can’t fit the data. In the latter case, one can perform non-linear regression, logarithmic or exponential transformation on the dataset to convert it into a linear relationship.

验证这一假设非常容易,因为在散点图上绘制自变量与因变量的关系使我们洞悉所形成的模式是否可以通过线条表示。 但是,如果一条线无法拟合数据,则对数据进行线性回归将是不合适的。 在后一种情况下,可以对数据集执行非线性回归,对数或指数变换,以将其转换为线性关系。

2. No endogeneity of regressors: The independent variables shouldn’t be correlated with the error term.

2. 回归变量无内生性:自变量不应与误差项相关。

This refers to the prohibition of link between the independent variable and the error term. Mathematically, it can be expressed in the following way.

这是指禁止自变量与错误项之间的链接。 在数学上,它可以用以下方式表示。

𝜎 𝑥,𝜀 =0:∀𝑥,𝜀

𝜎 𝜀,𝜀 = 0:∀𝑥,𝜀

As we know that independent variables involved in the model are somewhat correlated. The incorrect exclusion of one or more independent variable that could be relevant for the model gives us the omitted variable bias. This excluded variable ultimately gets reflected in the error term resulting in the covariance between the independent variable and the error term as non zero.

众所周知,模型中涉及的自变量有些相关。 错误地排除可能与模型相关的一个或多个自变量会给我们省略变量偏差。 该排除的变量最终反映在误差项中,导致自变量和误差项之间的协方差为非零。

The only way to deal with this assumption is to try different variables for the model so as to ensure that relevant variables are very well conisdered in the model.

处理此假设的唯一方法是为模型尝试不同的变量,以确保在模型中很好地考虑了相关变量。

3. Normality and Homoscedasticity: The variance of the errors should be consistent across observations.

3. 正态性和同方性:误差的方差在所有观测值之间应保持一致。

This assumption states that the error term is normally distributed and an expected value (mean) is zero. It is important to note that normal distribution of the term is only required for making inferences.

该假设表明误差项为正态分布,期望值(均值)为零。 重要的是要注意,仅在进行推断时才需要该术语的正态分布。

𝜀 ~𝑁 (0,𝜎2)

𝜀〜𝑁(0,𝜎2)

As far as homoscedasticity is concerned, it simply means variance of all error terms related to independent variables is equal to each other. However, below is an example of a dataset with different variance of the error terms. The regression performed on this dataset would have a better result for smaller values of independent and dependent variables.

就同​​质性而言,它仅表示与自变量相关的所有误差项的方差彼此相等。 但是,以下是误差项的方差不同的数据集的示例。 对于较小的自变量和因变量,对该数据集执行的回归将具有更好的结果。

Image for post

The way forward to validate this assumption is to look for omitted variable bias, outliers and perform log transformation.

验证该假设的方法是寻找遗漏的变量偏差,离群值并执行对数转换。

4. No Autocorrelation: No identifiable relationship should exist between the values of the error term

4. 无自相关:误差项的值之间不应存在可识别的关系

This assumption is the least favorite of all as it is hard to fix. Mathematically, it is represented in the following way.

该假设是所有假设中最不喜欢的,因为它很难解决。 在数学上,它以以下方式表示。

𝜎 𝜀𝑖𝜀𝑗=0:∀𝑖 ≠𝑗

𝜎 𝜀𝑖𝜀𝑗 = 0:∀𝑖≠𝑗

It is assumed that error terms are un-correlated. A common way to identify this is Durbin-Watson test which is provided in the regression summary table. If the value is less than one or more than three, it indicates autocorrelation. If the value is 2, there is no autocorrelation. It is better to avoid linear regression when there is autocorrelation.

假定误差项是不相关的。 识别此问题的常用方法是回归汇总表中提供的Durbin-Watson检验。 如果该值小于一或大于三,则表示自相关。 如果值为2,则不存在自相关。 自相关时最好避免线性回归。

5. No Multicollinearity: No predictor variable should be perfectly (or almost perfectly) explained by the other predictors.

5.没有多重共线性:其他预测变量不能完美(或几乎完美)地解释预测变量。

It is observed when two or more variables have high correlation. The logic behind this assumption is that if two variables have high collinearity, there is no point of representing both the variables in the model .

当两个或多个变量具有高相关性时可以观察到。 该假设背后的逻辑是,如果两个变量具有较高的共线性,则没有必要在模型中表示两个变量。

𝜌 𝑥𝑖𝑥𝑗 ≉1:∀𝑖,𝑗; 𝑖 ≠𝑗

≉1:∀𝑖,𝑗; 𝑖≠𝑗

It is easy to validate this assumption by dropping one of the variable or transforming them into one.

通过删除变量之一或将其转换为一个变量可以很容易地验证这一假设。

Criticisms/suggestions are really welcome 🙂.

批评/建议真的很受欢迎🙂。

翻译自: https://medium.com/swlh/simplest-guide-to-regression-analysis-assumptions-1a51d9ed69ae

回归分析假设

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/389902.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

Spring Aop之Advisor解析

2019独角兽企业重金招聘Python工程师标准>>> 在上文Spring Aop之Target Source详解中,我们讲解了Spring是如何通过封装Target Source来达到对最终获取的目标bean进行封装的目的。其中我们讲解到,Spring Aop对目标bean进行代理是通过Annotatio…

为什么随机性是信息

用位思考 (Thinking in terms of Bits) Imagine you want to send outcomes of 3 coin flips to your friends house. Your friend knows that you want to send him those messages but all he can do is get the answer of Yes/No questions arranged by him. Lets assume th…

大数据相关从业_如何在组织中以数据从业者的身份闪耀

大数据相关从业Build bridges, keep the maths under your hat and focus on serving.架起桥梁,将数学放在脑海中,并专注于服务。 通过协作而不是通过孤立的孤岛来交付出色的数据工作。 (Deliver great data work through collaboration not through co…

Django进阶之中间件

中间件简介 在http请求 到达视图函数之前 和视图函数return之后,django会根据自己的规则在合适的时机执行中间件中相应的方法。 中间件的执行流程 1、执行完所有的request方法 到达视图函数。 2、执行中间件的其他方法 2、经过所有response方法 返回客户端。 注意…

汉诺塔递归算法进阶_进阶python 1递归

汉诺塔递归算法进阶When something is specified in terms of itself, it is called recursion. The recursion gives us a new idea of how to solve a kind of problem and this gives us insights into the nature of computation. Basically, many of computational artifa…

windows 停止nginx

1、查找进程 tasklist | findstr nginx2、杀死进程 taskkill /pid 6508 /F3、一次杀死多个进程taskkill /pid 6508 /pid 16048 /f转载于:https://blog.51cto.com/dressame/2161759

SpringBoot返回json和xml

有些情况接口需要返回的是xml数据&#xff0c;在springboot中并不需要每次都转换一下数据格式&#xff0c;只需做一些微调整即可。 新建一个springboot项目&#xff0c;加入依赖jackson-dataformat-xml&#xff0c;pom文件代码如下&#xff1a; <?xml version"1.0&quo…

orange 数据分析_使用Orange GUI的放置结果数据分析

orange 数据分析Objective : Analysing of several factors influencing the recruitment of students and extracting information through plots.目的&#xff1a;分析影响学生招生和通过情节提取信息的几个因素。 Description : The following analysis presents the diffe…

普里姆从不同顶点出发_来自三个不同聚类分析的三个不同教训数据科学的顶点...

普里姆从不同顶点出发绘制大流行时期社区的风险群图&#xff1a;以布宜诺斯艾利斯为例 (Map Risk Clusters of Neighbourhoods in the time of Pandemic: a case of Buenos Aires) 介绍 (Introduction) Every year is unique and particular. But, 2020 brought the world the …

荷兰牛栏 荷兰售价_荷兰的公路货运是如何发展的

荷兰牛栏 荷兰售价I spent hours daily driving on one of the busiest motorways in the Netherlands when commuting was still a norm. When I first came across with the goods vehicle data on CBS website, it immediately attracted my attention: it could answer tho…

Vim 行号的显示与隐藏

2019独角兽企业重金招聘Python工程师标准>>> Vim 行号的显示与隐藏 一、当前文档的显示与隐藏 1 打开一个文档 [rootpcname ~]# vim demo.txt This is the main Apache HTTP server configuration file. It contains the configuration directives that give the s…

结对项目-小学生四则运算系统网页版项目报告

结对作业搭档&#xff1a;童宇欣 本篇博客结构一览&#xff1a; 1&#xff09;.前言(包括仓库地址等项目信息) 2&#xff09;.开始前PSP展示 3&#xff09;.结对编程对接口的设计 4&#xff09;.计算模块接口的设计与实现过程 5&#xff09;.计算模块接口部分的性能改进 6&…

袁中的第三次作业

第一题&#xff1a; 输出月份英文名 设计思路: 1:看题目&#xff1a;主函数与函数声明&#xff0c;知道它要你干什么2&#xff1a;理解与分析&#xff1a;在main中&#xff0c;给你一个月份数字n&#xff0c;要求你通过调用函数char *getmonth&#xff0c;来判断&#xff1a;若…

Python从菜鸟到高手(1):初识Python

1 Python简介 1.1 什么是Python Python是一种面向对象的解释型计算机程序设计语言&#xff0c;由荷兰人吉多范罗苏姆&#xff08;Guido van Rossum&#xff09;于1989年发明&#xff0c;第一个公开发行版发行于1991年。目前Python的最新发行版是Python3.6。 Python是纯粹的自由…

如何成为数据科学家_成为数据科学家需要了解什么

如何成为数据科学家Data science is one of the new, emerging fields that has the power to extract useful trends and insights from both structured and unstructured data. It is an interdisciplinary field that uses scientific research, algorithms, and graphs to…

阿里云对数据可靠性保障的一些思考

背景互联网时代的数据重要性不言而喻&#xff0c;任何数据的丢失都会给企事业单位、政府机关等造成无法计算和无法弥补的损失&#xff0c;尤其随着云计算和大数据时代的到来&#xff0c;数据中心的规模日益增大&#xff0c;环境更加复杂&#xff0c;云上客户群体越来越庞大&…

linux实验二

南京信息工程大学实验报告 实验名称 linux 常用命令练习 实验日期 2018-4-4 得分指导教师 系 计软院 专业 软嵌 年级 2015 级 班次 &#xff08;1&#xff09; 姓名王江远 学号20151398006 一、实验目的 1. 掌握 linux 系统中 shell 的基础知识 2. 掌握 linux 系统中文件系统的…

个人项目api接口_5个免费有趣的API,可用于学习个人项目等

个人项目api接口Public APIs are awesome!公共API很棒&#xff01; There are over 50 pieces covering APIs on just the Towards Data Science publication, so I won’t go into too lengthy of an introduction. APIs basically let you interact with some tool or servi…

咕泡-模板方法 template method 设计模式笔记

2019独角兽企业重金招聘Python工程师标准>>> 模板方法模式&#xff08;Template Method&#xff09; 定义一个操作中的算法的骨架&#xff0c;而将一些步骤延迟到子类中Template Method 使得子类可以不改变一个算法的结构即可重定义该算法的某些特定步骤Template Me…

如何评价强gis与弱gis_什么是gis的简化解释

如何评价强gis与弱gisTL;DR — A Geographic Information System is an information system that specializes in the storage, retrieval and display of location data.TL; DR — 地理信息系统 是专门从事位置数据的存储&#xff0c;检索和显示的信息系统。 The standard de…