missforest_missforest最佳丢失数据插补算法

missforest

Missing data often plagues real-world datasets, and hence there is tremendous value in imputing, or filling in, the missing values. Unfortunately, standard ‘lazy’ imputation methods like simply using the column median or average don’t work well.

丢失的数据通常困扰着现实世界的数据集,因此,估算或填写丢失的值具有巨大的价值。 不幸的是,标准的“惰性”插补方法(例如仅使用列中位数或平均值)效果不佳。

On the other hand, KNN is a machine-learning based imputation algorithm that has seen success but requires tuning of the parameter k and additionally, is vulnerable to many of KNN’s weaknesses, like being sensitive to being outliers and noise. Additionally, depending on circumstances, it can be computationally expensive, requiring the entire dataset to be stored and computing distances between every pair of points.

另一方面,KNN是一种基于机器学习的插补算法,它已经取得了成功,但需要调整参数k,而且容易受到KNN的许多弱点的影响,例如对异常值和噪声敏感。 另外,根据情况,计算可能会很昂贵,需要存储整个数据集并计算每对点之间的距离。

MissForest is another machine learning-based data imputation algorithm that operates on the Random Forest algorithm. Stekhoven and Buhlmann, creators of the algorithm, conducted a study in 2011 in which imputation methods were compared on datasets with randomly introduced missing values. MissForest outperformed all other algorithms in all metrics, including KNN-Impute, in some cases by over 50%.

MissForest是基于随机森林算法的另一种基于机器学习的数据插补算法。 该算法的创建者Stekhoven和Buhlmann于2011年进行了一项研究,该研究在具有随机引入的缺失值的数据集上比较了插补方法。 在所有指标上,MissForest的性能均优于其他所有算法,包括KNN-Impute,在某些情况下超过50%。

First, the missing values are filled in using median/mode imputation. Then, we mark the missing values as ‘Predict’ and the others as training rows, which are fed into a Random Forest model trained to predict, in this case, Age based on Score. The generated prediction for that row is then filled in to produce a transformed dataset.

首先,使用中位数/众数插补来填充缺失值。 然后,我们将缺失的值标记为'Predict',将其他值标记为训练行,将其输入经过训练的Random Forest模型中,该模型用于预测基于Score Age 。 然后填写针对该行生成的预测,以生成转换后的数据集。

Image for post
Assume that the dataset is truncated. Image created by author.
假设数据集被截断。 图片由作者创建。

This process of looping through missing data points repeats several times, each iteration improving on better and better data. It’s like standing on a pile of rocks while continually adding more to raise yourself: the model uses its current position to elevate itself further.

这种遍历缺失数据点的循环过程会重复几次,每次迭代都会改善越来越好的数据。 这就像站在一堆岩石上,而不断增加更多东西以提高自己:模型使用其当前位置进一步提升自己。

The model may decide in the following iterations to adjust predictions or to keep them the same.

模型可以在接下来的迭代中决定调整预测或使其保持不变。

Image for post
Image created by author
图片由作者创建

Iterations continue until some stopping criteria is met or after a certain number of iterations has elapsed. As a general rule, datasets become well imputed after four to five iterations, but it depends on the size and amount of missing data.

迭代一直持续到满足某些停止条件或经过一定数量的迭代之后。 通常,经过四到五次迭代后,数据集的插补效果会很好,但这取决于丢失数据的大小和数量。

There are many benefits of using MissForest. For one, it can be applied to mixed data types, numerical and categorical. Using KNN-Impute on categorical data requires it to be first converted into some numerical measure. This scale (usually 0/1 with dummy variables) is almost always incompatible with the scales of other dimensions, so the data must be standardized.

使用MissForest有很多好处。 一方面,它可以应用于数值和分类的混合数据类型。 对分类数据使用KNN-Impute要求首先将其转换为某种数字量度。 此比例(通常为0/1,带有虚拟变量 )几乎总是与其他尺寸的比例不兼容,因此必须对数据进行标准化。

In a similar vein, no pre-processing is required. Since KNN uses naïve Euclidean distances, all sorts of actions like categorical encoding, standardization, normalization, scaling, data splitting, etc. need to be taken to ensure its success. On the other hand, Random Forest can handle these aspects of data because it doesn’t make assumptions of feature relationships like K-Nearest Neighbors does.

同样,不需要预处理。 由于KNN使用朴素的欧几里得距离,因此需要采取各种措施,例如分类编码,标准化,归一化,缩放,数据拆分等,以确保其成功。 另一方面,Random Forest可以处理数据的这些方面,因为它没有像K-Nearest Neighbors那样假设特征关系。

MissForest is also robust to noisy data and multicollinearity, since random-forests have built-in feature selection (evaluating entropy and information gain). KNN-Impute yields poor predictions when datasets have weak predictors or heavy correlation between features.

MissForest还对嘈杂的数据和多重共线性具有鲁棒性,因为随机森林具有内置的特征选择(评估熵和信息增益 )。 当数据集的预测变量较弱或特征之间的相关性很强时,KNN-Impute的预测结果很差。

The results of KNN are also heavily determined by a value of k, which must be discovered on what is essentially a try-it-all approach. On the other hand, Random Forest is non-parametric, so there is no tuning required. It can also work with high-dimensional data, and is not prone to the Curse of Dimensionality to the heavy extent KNN-Impute is.

KNN的结果在很大程度上还取决于k的值,该值必须在本质上是一种“万能尝试”方法中进行发现。 另一方面,“随机森林”是非参数的,因此不需要调整。 它也可以处理高维数据,并且在很大程度上不会出现KNN-Impute的维数诅咒。

On the other hand, it does have some downsides. For one, even though it takes up less space, if the dataset is sufficiently small it may be more expensive to run MissForest. Additionally, it’s an algorithm, not a model object; this means it must be run every time data is imputed, which may not work in some production environments.

另一方面,它确实有一些缺点。 一方面,即使占用的空间较小,但如果数据集足够小,则运行MissForest可能会更昂贵。 另外,它是一种算法,而不是模型对象。 这意味着每次插补数据时都必须运行它,这在某些生产环境中可能无法运行。

Using MissForest is simple. In Python, it can be done through the missingpy library, which has a sklearn-like interface and has many of the same parameters as the RandomForestClassifier/RandomForestRegressor. The complete documentation can be found on GitHub here.

使用MissForest很简单。 在Python中,这可以通过missingpy库完成,该库具有sklearn的界面,并且具有与RandomForestClassifier / RandomForestRegressor相同的许多参数。 完整的文档可以在GitHub上找到 。

The model is only as good as the data, so taking proper care of the dataset is a must. Consider using MissForest next time you need to impute missing data!

该模型仅与数据一样好,因此必须适当注意数据集。 下次需要填写缺少的数据时,请考虑使用MissForest!

Thanks for reading!

谢谢阅读!

翻译自: https://towardsdatascience.com/missforest-the-best-missing-data-imputation-algorithm-4d01182aed3

missforest

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/389282.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

华硕猛禽1080ti_F-22猛禽动力回路的视频分析

华硕猛禽1080tiThe F-22 Raptor has vectored thrust. This means that the engines don’t just push towards the front of the aircraft. Instead, the thrust can be directed upward or downward (from the rear of the jet). With this vectored thrust, the Raptor can …

Memory-Associated Differential Learning论文及代码解读

Memory-Associated Differential Learning论文及代码解读 论文来源: 论文PDF: Memory-Associated Differential Learning论文 论文代码: Memory-Associated Differential Learning代码 论文解读: 1.Abstract Conventional…

大数据技术 学习之旅_如何开始您的数据科学之旅?

大数据技术 学习之旅Machine Learning seems to be fascinating to a lot of beginners but they often get lost into the pool of information available across different resources. This is true that we have a lot of different algorithms and steps to learn but star…

数据可视化工具_数据可视化

数据可视化工具Visualizations are a great way to show the story that data wants to tell. However, not all visualizations are built the same. My rule of thumb is stick to simple, easy to understand, and well labeled graphs. Line graphs, bar charts, and histo…

Android Studio调试时遇见Install Repository and sync project的问题

我们可以看到,报的错是“Failed to resolve: com.android.support:appcompat-v7:16.”,也就是我们在build.gradle中最后一段中的compile项内容。 AS自动生成的“com.android.support:appcompat-v7:16.”实际上是根据我们的最低版本16来选择16.x.x及以上编…

VGAE(Variational graph auto-encoders)论文及代码解读

一,论文来源 论文pdf Variational graph auto-encoders 论文代码 github代码 二,论文解读 理论部分参考: Variational Graph Auto-Encoders(VGAE)理论参考和源码解析 VGAE(Variational graph auto-en…

tableau大屏bi_Excel,Tableau,Power BI ...您应该使用什么?

tableau大屏biAfter publishing my previous article on data visualization with Power BI, I received quite a few questions about the abilities of Power BI as opposed to those of Tableau or Excel. Data, when used correctly, can turn into digital gold. So what …

网络编程 socket介绍

Socket介绍 Socket是应用层与TCP/IP协议族通信的中间软件抽象层,它是一组接口。在设计模式中,Socket其实就是一个门面模式,它把复杂的TCP/IP协议族隐藏在Socket接口后面,对用户来说,一组简单的接口就是全部。 Socket通…

BP神经网络反向传播手动推导

BP神经网络过程: 基本思想 BP算法是一个迭代算法,它的基本思想如下: 将训练集数据输入到神经网络的输入层,经过隐藏层,最后达到输出层并输出结果,这就是前向传播过程。由于神经网络的输出结果与实际结果…

使用python和pandas进行同类群组分析

背景故事 (Backstory) I stumbled upon an interesting task while doing a data exercise for a company. It was about cohort analysis based on user activity data, I got really interested so thought of writing this post.在为公司进行数据练习时,我偶然发…

搜索引擎优化学习原理_如何使用数据科学原理来改善您的搜索引擎优化工作

搜索引擎优化学习原理Search Engine Optimisation (SEO) is the discipline of using knowledge gained around how search engines work to build websites and publish content that can be found on search engines by the right people at the right time.搜索引擎优化(SEO…

Siamese网络(孪生神经网络)详解

SiameseFCSiamese网络(孪生神经网络)本文参考文章:Siamese背景Siamese网络解决的问题要解决什么问题?用了什么方法解决?应用的场景:Siamese的创新Siamese的理论Siamese的损失函数——Contrastive Loss损失函…

Dubbo 源码分析 - 服务引用

1. 简介 在上一篇文章中,我详细的分析了服务导出的原理。本篇文章我们趁热打铁,继续分析服务引用的原理。在 Dubbo 中,我们可以通过两种方式引用远程服务。第一种是使用服务直联的方式引用服务,第二种方式是基于注册中心进行引用。…

一件登录facebook_我从Facebook的R教学中学到的6件事

一件登录facebookBetween 2018 to 2019, I worked at Facebook as a data scientist — during that time I was involved in developing and teaching a class for R beginners. This was a two-day course that was taught about once a month to a group of roughly 15–20 …

SiameseFC超详解

SiameseFC前言论文来源参考文章论文原理解读首先要知道什么是SOT?(Siamese要做什么)SiameseFC要解决什么问题?SiameseFC用了什么方法解决?SiameseFC网络效果如何?SiameseFC基本框架结构SiameseFC网络结构Si…

Python全栈工程师(字符串/序列)

ParisGabriel Python 入门基础字符串:str用来记录文本信息字符串的表示方式:在非注释中凡是用引号括起来的部分都是字符串‘’ 单引号“” 双引号 三单引""" """ 三双引有内容代表非空字符串否则是空字符串 区别&#xf…

跨库数据表的运算

跨库数据表的运算,一直都是一个说难不算太难,说简单却又不是很简单的、总之是一个麻烦的事。大量的、散布在不同数据库中的数据表们,明明感觉要把它们合并起来,再来个小小的计算,似乎也就那么回事……但真要做起来&…

熊猫在线压缩图_回归图与熊猫和脾气暴躁

熊猫在线压缩图数据可视化 (Data Visualization) I like the plotting facilities that come with Pandas. Yes, there are many other plotting libraries such as Seaborn, Bokeh and Plotly but for most purposes, I am very happy with the simplicity of Pandas plotting…

SiameseRPN详解

SiameseRPN论文来源论文背景一,简介二,研究动机三、相关工作论文理论注意:网络结构:1.Siamese Network2.RPN3.LOSS计算4.Tracking论文的优缺点分析一、Siamese-RPN的贡献/优点:二、Siamese-RPN的缺点:代码流…

数据可视化 信息可视化_可视化数据操作数据可视化与纪录片的共同点

数据可视化 信息可视化Data visualization is a great way to celebrate our favorite pieces of art as well as reveal connections and ideas that were previously invisible. More importantly, it’s a fun way to connect things we love — visualizing data and kicki…