missforest
Missing data often plagues real-world datasets, and hence there is tremendous value in imputing, or filling in, the missing values. Unfortunately, standard ‘lazy’ imputation methods like simply using the column median or average don’t work well.
丢失的数据通常困扰着现实世界的数据集,因此,估算或填写丢失的值具有巨大的价值。 不幸的是,标准的“惰性”插补方法(例如仅使用列中位数或平均值)效果不佳。
On the other hand, KNN is a machine-learning based imputation algorithm that has seen success but requires tuning of the parameter k and additionally, is vulnerable to many of KNN’s weaknesses, like being sensitive to being outliers and noise. Additionally, depending on circumstances, it can be computationally expensive, requiring the entire dataset to be stored and computing distances between every pair of points.
另一方面,KNN是一种基于机器学习的插补算法,它已经取得了成功,但需要调整参数k,而且容易受到KNN的许多弱点的影响,例如对异常值和噪声敏感。 另外,根据情况,计算可能会很昂贵,需要存储整个数据集并计算每对点之间的距离。
MissForest is another machine learning-based data imputation algorithm that operates on the Random Forest algorithm. Stekhoven and Buhlmann, creators of the algorithm, conducted a study in 2011 in which imputation methods were compared on datasets with randomly introduced missing values. MissForest outperformed all other algorithms in all metrics, including KNN-Impute, in some cases by over 50%.
MissForest是基于随机森林算法的另一种基于机器学习的数据插补算法。 该算法的创建者Stekhoven和Buhlmann于2011年进行了一项研究,该研究在具有随机引入的缺失值的数据集上比较了插补方法。 在所有指标上,MissForest的性能均优于其他所有算法,包括KNN-Impute,在某些情况下超过50%。
First, the missing values are filled in using median/mode imputation. Then, we mark the missing values as ‘Predict’ and the others as training rows, which are fed into a Random Forest model trained to predict, in this case, Age
based on Score
. The generated prediction for that row is then filled in to produce a transformed dataset.
首先,使用中位数/众数插补来填充缺失值。 然后,我们将缺失的值标记为'Predict',将其他值标记为训练行,将其输入经过训练的Random Forest模型中,该模型用于预测基于Score
Age
。 然后填写针对该行生成的预测,以生成转换后的数据集。
This process of looping through missing data points repeats several times, each iteration improving on better and better data. It’s like standing on a pile of rocks while continually adding more to raise yourself: the model uses its current position to elevate itself further.
这种遍历缺失数据点的循环过程会重复几次,每次迭代都会改善越来越好的数据。 这就像站在一堆岩石上,而不断增加更多东西以提高自己:模型使用其当前位置进一步提升自己。
The model may decide in the following iterations to adjust predictions or to keep them the same.
模型可以在接下来的迭代中决定调整预测或使其保持不变。
Iterations continue until some stopping criteria is met or after a certain number of iterations has elapsed. As a general rule, datasets become well imputed after four to five iterations, but it depends on the size and amount of missing data.
迭代一直持续到满足某些停止条件或经过一定数量的迭代之后。 通常,经过四到五次迭代后,数据集的插补效果会很好,但这取决于丢失数据的大小和数量。
There are many benefits of using MissForest. For one, it can be applied to mixed data types, numerical and categorical. Using KNN-Impute on categorical data requires it to be first converted into some numerical measure. This scale (usually 0/1 with dummy variables) is almost always incompatible with the scales of other dimensions, so the data must be standardized.
使用MissForest有很多好处。 一方面,它可以应用于数值和分类的混合数据类型。 对分类数据使用KNN-Impute要求首先将其转换为某种数字量度。 此比例(通常为0/1,带有虚拟变量 )几乎总是与其他尺寸的比例不兼容,因此必须对数据进行标准化。
In a similar vein, no pre-processing is required. Since KNN uses naïve Euclidean distances, all sorts of actions like categorical encoding, standardization, normalization, scaling, data splitting, etc. need to be taken to ensure its success. On the other hand, Random Forest can handle these aspects of data because it doesn’t make assumptions of feature relationships like K-Nearest Neighbors does.
同样,不需要预处理。 由于KNN使用朴素的欧几里得距离,因此需要采取各种措施,例如分类编码,标准化,归一化,缩放,数据拆分等,以确保其成功。 另一方面,Random Forest可以处理数据的这些方面,因为它没有像K-Nearest Neighbors那样假设特征关系。
MissForest is also robust to noisy data and multicollinearity, since random-forests have built-in feature selection (evaluating entropy and information gain). KNN-Impute yields poor predictions when datasets have weak predictors or heavy correlation between features.
MissForest还对嘈杂的数据和多重共线性具有鲁棒性,因为随机森林具有内置的特征选择(评估熵和信息增益 )。 当数据集的预测变量较弱或特征之间的相关性很强时,KNN-Impute的预测结果很差。
The results of KNN are also heavily determined by a value of k, which must be discovered on what is essentially a try-it-all approach. On the other hand, Random Forest is non-parametric, so there is no tuning required. It can also work with high-dimensional data, and is not prone to the Curse of Dimensionality to the heavy extent KNN-Impute is.
KNN的结果在很大程度上还取决于k的值,该值必须在本质上是一种“万能尝试”方法中进行发现。 另一方面,“随机森林”是非参数的,因此不需要调整。 它也可以处理高维数据,并且在很大程度上不会出现KNN-Impute的维数诅咒。
On the other hand, it does have some downsides. For one, even though it takes up less space, if the dataset is sufficiently small it may be more expensive to run MissForest. Additionally, it’s an algorithm, not a model object; this means it must be run every time data is imputed, which may not work in some production environments.
另一方面,它确实有一些缺点。 一方面,即使占用的空间较小,但如果数据集足够小,则运行MissForest可能会更昂贵。 另外,它是一种算法,而不是模型对象。 这意味着每次插补数据时都必须运行它,这在某些生产环境中可能无法运行。
Using MissForest is simple. In Python, it can be done through the missingpy
library, which has a sklearn
-like interface and has many of the same parameters as the RandomForestClassifier
/RandomForestRegressor
. The complete documentation can be found on GitHub here.
使用MissForest很简单。 在Python中,这可以通过missingpy
库完成,该库具有sklearn
的界面,并且具有与RandomForestClassifier
/ RandomForestRegressor
相同的许多参数。 完整的文档可以在GitHub上找到 。
The model is only as good as the data, so taking proper care of the dataset is a must. Consider using MissForest next time you need to impute missing data!
该模型仅与数据一样好,因此必须适当注意数据集。 下次需要填写缺少的数据时,请考虑使用MissForest!
Thanks for reading!
谢谢阅读!
翻译自: https://towardsdatascience.com/missforest-the-best-missing-data-imputation-algorithm-4d01182aed3
missforest
本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/389282.shtml
如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!