sklearn.fit

by Nathan Toubiana

内森·图比亚纳(Nathan Toubiana)

两个小时后仍在运行吗？如何控制您的sklearn.fit (Two hours later and still running? How to keep your sklearn.fit under control)

Written by Gabriel Lerner and Nathan Toubiana

加布里埃尔·勒纳 ( Gabriel Lerner)和内森·图比亚 ( Nathan Toubiana)撰写

All you wanted to do was test your code, yet two hours later your Scikit-learn fit shows no sign of ever finishing. Scitime is a package that predicts the runtime of machine learning algorithms so that you will not be caught off guard by an endless fit.

您要做的只是测试您的代码，但是两个小时后，您的Scikit学习适应度却没有任何提高的迹象。 Scitime是一个可预测机器学习算法运行时间的软件包，因此您不会因无休止的选择而措手不及。

Whether you are in the process of building a machine learning model or deploying your code to production, knowledge of how long your algorithm will take to fit is key to streamlining your workflow. With Scitime you will be able in a matter of seconds to estimate how long the fit should take for the most commonly used Scikit Learn algorithms.

无论您是在构建机器学习模型还是在将代码部署到生产中，了解算法所需的适应时间是简化工作流程的关键。使用Scitime，您将能够在几秒钟内估算出最常用的Scikit Learn算法的拟合时间。

There have been a couple of research articles (such as this one) published on that subject. However, as far as we know, there’s no practical implementation of it. The goal here is not to predict the exact runtime of the algorithm but more to give a rough approximation.

关于该主题已经发表了几篇研究文章(例如本文章)。但是，据我们所知，还没有实际的实现。这里的目标不是预测算法的确切运行时间，而是给出一个大概的近似值。

什么是Scitime？ (What is Scitime?)

Scitime is a python package requiring at least python 3.6 with pandas, scikit-learn, psutil and joblib dependencies. You will find the Scitime repo here.

Scitime是一个Python软件包，至少需要python 3.6，并带有pandas ， scikit-learn ， psutil和joblib依赖项。您可以在此处找到Scitime回购。

The main function in this package is called “time”. Given a matrix vector X, the estimated vector Y along with the Scikit Learn model of your choice, time will output both the estimated time and its confidence interval. The package currently supports the following Scikit Learn algorithms with plans to add more in the near future:

该软件包的主要功能称为“ 时间 ”。给定矩阵向量X，估计向量Y以及您选择的Scikit Learn模型，时间将同时输出估计时间及其置信区间。该软件包当前支持以下Scikit Learn算法，并计划在不久的将来添加更多算法：

KMeans
均值
RandomForestRegressor
RandomForestRegressor
SVC
SVC
RandomForestClassifier
随机森林分类器

快速开始 (Quick Start)

Let’s install the package and run the basics.

让我们安装软件包并运行基础知识。

First create a new virtualenv (this is optional, to avoid any version conflicts!)

首先创建一个新的virtualenv(这是可选的，以避免任何版本冲突！)

❱ virtualenv env❱ source env/bin/activate

and then run:

然后运行：

❱ (env) pip install scitime

or with conda:

或使用conda：

❱ (env) conda install -c conda-forge scitime

Once the installation has succeeded, you are ready to estimate the time of your first algorithm.

一旦安装成功，您就可以估计第一个算法的时间。

Let’s say you wanted to train a kmeans clustering, for example. You would first need to import the scikit-learn package, set the kmeans parameters, and also choose the inputs (a.k.a X), here generated randomly for simplicity.

举例来说，假设您想训练kmeans聚类。您首先需要导入scikit-learn程序包，设置kmeans参数，还需要选择输入(aka X) ，为简单起见，此处随机生成。

Running this before doing the actual fit would give an approximation of the runtime:

在进行实际拟合之前运行此命令将给出运行时间的近似值：

As you can see, you can get this info only in one extra line of code! The inputs of the time function are exactly what’s needed to run the fit (that is the algo itself, and X), which makes it even easier to use.

如您所见，您只能在一行额外的代码中获得此信息！时间函数的输入正是进行拟合所需要的(即算法本身和X)，这使其更易于使用。

Looking more closely at the last line of the above code, the first output (estimation: 15 seconds in this case) is the predicted runtime you’re looking for. Scitime will also output it with a confidence interval (lower_bound and upper_bound: 10 and 30 seconds in this case). You can always compare it to the actual training time by running:

仔细查看以上代码的最后一行，第一个输出(在这种情况下， 估计为 15秒)就是您要查找的预测运行时间。 Scitime还将以置信区间( lower_bound和upper_bound：在这种情况下为10和30秒)输出它。您始终可以通过运行以下命令将其与实际训练时间进行比较：

In this case, on our local machine, the estimation is 15 seconds, whereas the actual training time is 20 seconds (but you might not get the same results, as we’ll explain later).

在这种情况下，在我们的本地计算机上，估计时间为15秒，而实际训练时间为20秒(但您可能不会获得相同的结果，我们将在后面解释)。

As a quick usage guide:

作为快速使用指南：

Estimator(meta_algo, verbose, confidence) class:

Estimator(meta_algo，verbose，置信度)类：

meta_algo: The estimator used to predict the time, either ‘RF’ or ‘NN’ (see details in next paragraph) — defaults to‘RF’
meta_algo ：用于预测时间的估计量，“ RF”或“ NN”(请参见下一段的详细信息)-默认为“ RF”
verbose: Control of the amount of log output (either 0, 1, 2 or 3) — defaults to 0
详细：控制日志输出量(0、1、2或3)—默认为0
confidence: Confidence for intervals — defaults to 95%
置信度 ：间隔的置信度 -默认为95％

estimator.time(algo, X, y) function:

estimator.time(algo，X，y)函数：

algo: algo whose runtime the user wants to predict
algo ：用户想要预测其运行时间的算法
X: numpy array of inputs to be trained
X ：要训练的输入的numpy数组
y: numpy array of outputs to be trained (set to None if the algo is unsupervised)
y ：要训练的输出的numpy数组(如果算法不受监督，则设置为None )

Quick note: to avoid any confusion, it’s worth highlighting that algo and meta_algo are two different things here: algo is the algorithm whose runtime we want to estimate, meta_algo is the algorithm used by Scitime to predict the runtime.

快速说明：为避免混淆，值得在这里强调algo和meta_algo是两个不同的东西： algo是我们要估计其运行时间的算法， meta_algo是Scitime用于预测运行时间的算法。

Scitime的工作方式 (How Scitime works)

We are able to predict the runtime to fit by using our own estimator, we call it meta algorithm (meta_algo), whose weights are stored in a dedicated pickle file in the package metadata. For each Scikit Learn model, you will find a corresponding meta algo pickle file in Scitime’s code base.

我们可以使用自己的估算器来预测运行时的合适性，我们将其称为元算法( meta_algo )，其权重存储在包元数据中的专用pickle文件中。对于每个Scikit Learn模型，您都可以在Scitime的代码库中找到一个相应的meta algo pickle文件。

You might be thinking:

您可能在想：

Why not manually estimate the time complexity with big O notations?

为什么不使用大的O符号手动估算时间复杂度？

That’s a fair point. It’s a valid way of approaching the problem and something we thought about at the beginning of the project. One thing however is that we would need to formulate the complexity explicitly for each algo and set of parameters which is rather challenging in some cases, given the number of factors playing a role in the runtime. The meta_algo basically does all the work for you, and we’ll explain how.

这是一个公平的观点。这是解决问题的有效方法，也是我们在项目开始时就已经想到的事情。然而，有一件事是，我们需要为每种算法和参数集明确地制定复杂度，鉴于在运行时中起作用的因素数量，这在某些情况下是相当具有挑战性的。 meta_algo基本上可以为您完成所有工作，我们将说明操作方法。

Two types of meta algos have been trained to estimate the time to fit (both from Scikit Learn):

已经训练了两种类型的元算法来估计适合的时间(均来自Scikit Learn)：

The RF meta algo, a RandomForestRegressor estimator.
RF元算法， RandomForestRegressor估计器。
The NN meta algo, a basic MLPRegressor estimator.
NN元算法，基本的MLPRegressor估算器。

These meta algos estimate the time to fit using an array of ‘meta’ features. Here’s a summary of how we build these features:

这些元算法使用一系列“元”功能来估计适合的时间。以下是我们如何构建这些功能的摘要：

Firstly, we fetch the shape of your input matrix X and output vector y. Second, the parameters you feed to the Scikit Learn model are taken into consideration as they will impact the training time as well. Lastly, your specific hardware, unique to your machine such as available memory and cpu counts are also considered.

首先，我们获取您的输入矩阵X和输出向量y的形状。其次，将考虑输入到Scikit Learn模型的参数，因为它们也会影响培训时间。最后，还考虑了计算机专用的特定硬件，例如可用内存和cpu计数。

As shown earlier, we also provide confidence intervals on the time prediction. The way these are computed depends on the meta algo chosen:

如前所述，我们还提供了时间预测的置信区间。计算这些方法的方式取决于所选的元算法：

For RF, since any random forest regressor is a combination of multiple trees (also called estimators), the confidence interval will be based on the distribution of the set of predictions computed by each estimator.
对于RF来说 ，由于任何随机森林回归量都是多个树的组合(也称为估计量 )，因此置信区间将基于每个估计量计算出的一组预测的分布。
For NN, the process is a little less straightforward: we first compute a set of MSEs along with the number of observations on a test set, grouped by predicted duration bins (that is from 0 to 1 second, 1 to 5 seconds, and so on), and we then compute a t-stat to get the lower and upper bounds of the estimation. As we don’t have a lot of data for very long models, the confidence interval for such data might get very broad.
对于NN来说 ，此过程要简单一些：我们首先计算一组MSE以及测试集上的观察数，然后按预测的持续时间段(即0到1秒，1到5秒，以及等等)，然后我们计算t-stat以获得估计的上下限。由于我们没有很长的模型的大量数据，因此此类数据的置信区间可能会变得很宽。

我们如何建造 (How we built it)

You might be thinking:

您可能在想：

How did you get enough data on the training time of all these sciki- learn fits over various parameters and hardware configurations?

您如何从各种参数和硬件配置的所有适合的训练时间中获得足够的数据？

The (unglamorous) answer is we generated the data ourselves using a combination of computers and VM hardwares to simulate what the training time would be on the different systems. We then fitted our meta algos on these randomly generated data points to build an estimator meant to be reliable regardless of your system.

(无耻的)答案是我们使用计算机和VM硬件的组合自己生成数据，以模拟在不同系统上的培训时间。然后，我们在这些随机生成的数据点上拟合元算法，以构建一个估计器，无论您使用什么系统，该估计器都将可靠。

While the estimate.py file handles the runtime prediction, the _model.py file helped us generate data to train our meta algos, using our dedicated Model class. Here’s a corresponding code sample, for kmeans:

虽然estimate.py文件句柄运行预测中， _model.py文件帮助我们生成的数据来训练我们的汇总交易算法，使用我们的专用Model类。这是kmeans的相应代码示例：

Note that you can also use the file _data.py directly with the command line to generate data or train a new model. Related instructions can be found in the repo Readme file.

请注意，您还可以直接在命令行中使用文件_data.py来生成数据或训练新模型。相关说明可以在回购自述文件中找到。

When generating data points, you can edit the parameters of the Scikit Learn models you want to train on. You can head to scitime/_config.json and edit the parameters of the models as well as the number of rows and columns you would want to train with.

生成数据点时，您可以编辑要在其上训练的Scikit Learn模型的参数。您可以转到scitime / _config.json并编辑模型的参数以及要训练的行数和列数。

We use an itertool function to loop through every possible combination, along with a drop rate set between 0 and 1 to control how quickly the loop will jump through the different possible iterations.

我们使用itertool函数循环遍历所有可能的组合，并设置介于0和1之间的下降率，以控制循环在不同的可能迭代中跳跃的速度。

Scitime有多准确？ (How accurate is Scitime?)

Below, we highlight how our predictions perform for the specific case of kmeans. Our generated dataset contains ~100k data points, which we split into a train and test sets (75% — 25%).

下面，我们重点介绍针对kmeans特定情况的预测效果。我们生成的数据集包含约10万个数据点，我们将其分为训练和测试集(75％— 25％)。

We grouped training predicted times by different time buckets and computed the MAPE and RMSE over each of those buckets for all our estimators using the RF meta-algo and the NN meta-algo.

我们将训练预测时间按不同的时间段进行分组，并使用RF元算法和NN元算法为所有估计量计算每个时间段的MAPE和RMSE 。

Please note that these results were performed on a restricted data set, so they might be different on unexplored data points (such as other systems / extreme values of certain model parameters). For this specific training set, the R-squared is around 80% for NN and 90% for RF.

请注意，这些结果是在受限的数据集上执行的，因此在未探索的数据点(例如其他系统/某些模型参数的极值)上，它们可能会有所不同。对于此特定训练集，NN的R平方约为RF，RF的R平方约为90％。

As we can see, not surprisingly, the accuracy is consistently higher on the train set than on the test, for both NN and RF. We also see that RF seems to perform way better than NN overall. The MAPE for RF is around 20% on the train set and 40% on the test set. The NN MAPE is surprisingly very high.

正如我们所看到的，对于NN和RF而言，在火车上的准确性始终比在测试中更高，这并不奇怪。我们还看到，RF的性能似乎总体上优于NN。 RF的MAPE在列车组上约为20％，在测试组上约为40％。令人惊讶的是，NN MAPE非常高。

Let’s slice the MAPE (on test set) by the number of predicted seconds:

让我们按预测的秒数对MAPE(在测试集上)进行切片：

One important thing to keep in mind is that for some cases the time prediction is sensitive to the meta algo chosen (RF or NN). In our experience RF has performed very well within the data set input ranges, as shown above. However, for out of range points, NN might perform better, as suggested by the end of the above chart. This would explain why NN MAPE is quite high while the RMSE is decent: it performs poorly on small values.

要记住的重要一件事是，在某些情况下，时间预测对所选的元算法(RF或NN)敏感。根据我们的经验，RF在数据集输入范围内表现非常出色，如上所述。但是，对于超出范围的点，NN的性能可能会更好，如上图结尾所示。这可以解释为什么NN MAPE很高，而RMSE却不错：在较小的值上表现不佳。

As an example, if you try to predict the runtime of a kmeans with default parameters and with an input matrix of a few thousand lines, the RF meta algo will be precise because our training dataset contains similar data points. However, for predicting very specific parameters (for instance, a very high number of clusters), NN might perform better because it extrapolates from the training set, whereas RF doesn’t. NN performs worse on the above charts because these plots are only based on data close to the set of inputs of the training data.

例如，如果您尝试使用默认参数和几千行的输入矩阵来预测kmeans的运行时间，那么RF元算法将是精确的，因为我们的训练数据集包含相似的数据点。但是，对于预测非常具体的参数(例如，非常多的群集)，NN可能会表现更好，因为它是从训练集中推断出来的，而RF却没有。 NN在上述图表上的表现较差，因为这些图仅基于接近训练数据输入集的数据。

However, as shown in this graph, the out of range values (thin lines) are extrapolated by the NN estimator, whereas the RF estimator predicts the output stepwise.

但是，如该图所示，超出范围的值(细线)由NN估计器外推，而RF估计器则逐步预测输出。

Now let’s look at the most important ‘meta’ features for the example of kmeans:

现在，让我们看一下kmeans示例的最重要的“元”功能：

As we can see, only 6 features account for more than 80% of the model variance. Among them, the most important is a parameter of the scikit-learn kmeans class itself (number of clusters), but a lot of external factors have great influence on the runtime such as number of rows/columns and available memory.

如我们所见，只有6个要素占模型差异的80％以上。其中，最重要的是scikit-learn kmeans类本身的参数(集群数)，但是许多外部因素对运行时有很大影响，例如行数/列数和可用内存。

局限性 (Limitations)

As mentioned earlier, the first limitation is related to the confidence intervals: they may be very wide, especially for NN, and for heavy models (that would take at least an hour).

如前所述，第一个限制与置信区间有关：它们可能非常宽，特别是对于NN和重型模型(至少需要一个小时)。

Additionally, the NN might perform poorly on small to medium predictions. Sometimes, for small durations, the NN might even predict a negative duration, in which case we automatically switch back to RF.

此外，在中小型预测上，NN可能效果不佳。有时，在很短的持续时间内，NN甚至可能会预测为负持续时间，在这种情况下，我们会自动切换回RF。

Another limitation of the estimator arise for when ‘special’ algo parameter values are used. For example, in a RandomForest scenario, when max_depth is set to None, the depth could take any value. This might result in a much longer time to fit which is more difficult for the meta algo to pick up, although we did our best to account for them.

当使用“特殊”算法参数值时，估计器会出现另一个限制。例如，在RandomForest场景中，当max_depth设置为None时 ，深度可以采用任何值。尽管我们尽了最大努力来解决这些问题，但这可能会导致更长的拟合时间，这对于元算法来说更加困难。

When running estimator.time(algo, X, y) we do require the user to enter the actual X and y vector which seems unnecessary, as we could simply request the shape of the data to estimate the training time. The reason for this is that we actually try to fit the model before predicting the runtime, in order to raise any instant errors. We run algo.fit(X, y) in a subprocess for one second to check for any fit error up after which we move on to the prediction part. However, there are times where the algo (and / or the input matrix) are so big that running algo.fit(X,y) will throw a memory error eventually, which we can’t account for.

当运行estimator.time(algo，X，y)时，我们确实要求用户输入实际不必要的X和y向量，因为我们可以简单地请求数据的形状来估计训练时间。原因是我们实际上在预测运行时之前尝试拟合模型，以引起任何即时错误。我们在子流程中运行algo.fit(X，y)一秒钟，以检查是否存在拟合误差，然后继续进行预测。但是，有时候算法(和/或输入矩阵)太大，以至于运行algo.fit(X，y)最终会引发内存错误，这是我们无法解释的。

未来的改进 (Future improvements)

The most effective and obvious way to improve the performance of our current predictions would be to generate more data points on different systems to better support a wide range of hardware/parameters.

改善当前预测性能的最有效，最明显的方法是在不同系统上生成更多数据点，以更好地支持广泛的硬件/参数。

We will be looking at adding more supported Scikit Learn algos in the near future. We could also implement other algos such as lightGBM or xgboost. Feel free to contact us if there’s an algorithm you would like us to implement in the next iterations of Scitime!

我们将在不久的将来考虑添加更多受支持的Scikit Learn算法。我们还可以实现其他算法，例如lightGBM或xgboost 。如果您希望我们在下一个Scitime迭代中实现某种算法，请随时与我们联系！

Other interesting avenues for improving the performance of the estimator would be to include more granular information about the input matrix such as variance, or correlation with output. We currently generate data completely randomly, for which the fit time might be higher than for real world datasets. So in some cases it might overestimate the training time.

用于提高估计器性能的其他有趣途径是包括有关输入矩阵的更详细的信息，例如方差或与输出的相关性。目前，我们完全随机地生成数据，其拟合时间可能比实际数据集更长。因此，在某些情况下，它可能会高估训练时间。

In addition we could track finer hardware specific information such as frequency of the cpu, or current cpu usage.

另外，我们可以跟踪更精细的硬件特定信息，例如cpu的频率或当前cpu的使用情况。

Ideally, as the algorithm might change from a scikit-learn version to another, and thus have an impact on the runtime, we would also account for it, for example by using the version as a ‘meta’ feature.

理想情况下，由于该算法可能会从scikit-learn版本更改为另一个版本，从而对运行时间产生影响，因此我们也将其考虑在内，例如，将该版本用作“元”功能。

As we acquire more data to fit our meta algos, we might think of using more complex meta algos, such as sophisticated neural networks (using regularization techniques like dropout or batch normalization). We could even consider using tensorflow to fit the meta algo (and add it as optional): it would not only help us get a better accuracy, but also build more robust confidence intervals using dropout.

随着我们获取更多数据以适合我们的元算法，我们可能会考虑使用更复杂的元算法，例如复杂的神经网络(使用诸如辍学或批归一化之类的正则化技术)。我们甚至可以考虑使用tensorflow以适应元算法中(并将其添加为可选)：它不仅将帮助我们获得更好的精确度，而且还利用建立更强大的置信区间辍学。

协助Scitime并向我们发送您的反馈意见 (Contributing to Scitime and sending us your feedback)

First, any kind of feedback, especially on the performance of the predictions and on ideas to improve this process of generating data, is very much appreciated!

首先，非常感谢任何形式的反馈，特别是有关预测的执行情况和有关改进此生成数据过程的想法的反馈！

As discussed before, you can use our repo to generate your own data points in order to train your own meta algorithm. When doing so, you can help make Scitime better by sharing your data points found in the result csv (~/scitime/scitime/[algo]_results.csv) so that we can integrate it to our model.

如前所述，您可以使用我们的存储库来生成自己的数据点，以训练自己的元算法。这样做时，您可以通过共享结果csv中的数据点( 〜/ scitime / scitime / [algo] _results.csv )来帮助改进Scitime，以便我们可以将其集成到模型中。

To generate your own data you can run a command similar to this one (from the package repo source):

要生成自己的数据，可以运行与此命令类似的命令(来自软件包回购源)：

❱ python _data.py --verbose 3 --algo KMeans --drop_rate 0.99

Note: if run directly using the code source (with the Model class), do not forget to set write_csv to true, otherwise the generated data points will not be saved.

注意：如果直接使用代码源(带有Model类)运行，请不要忘记将write_csv设置为true，否则将不会保存生成的数据点。

We use GitHub issues to track all bugs and feature requests. Feel free to open an issue if you have found a bug or wish to see a new feature implemented. More info can be found about how to contribute in the Scitime repo.

我们使用GitHub问题来跟踪所有错误和功能请求。 如果您发现错误或希望看到新功能实施，请随时打开问题。 可以找到有关如何在Scitime回购中做出贡献的更多信息。

For issues with training time predictions, when submitting feedback, including the full dictionary of parameters you are fitting into your model might help, so that we can diagnose why the performance is subpar for your specific use case. To do so simply set the verbose parameter to 3 and copy paste the log of the parameter dic in the issue description.

对于培训时间预测方面的问题，在提交反馈时(包括您适合模型的完整参数字典)可能会有所帮助，以便我们可以诊断为什么性能不如您的特定用例。 为此，只需将verbose参数设置为3，然后将问题dic的日志复制粘贴到问题描述中。

Find the code source

查找代码源

Find the documentation

查找文档

学分 (Credits)

Gabriel Lerner & Nathan Toubiana are the main contributors of this package and co-authors of this article
加布里埃尔·勒纳 ( Gabriel Lerner) 和内森·图比亚 ( Nathan Toubiana)是该软件包的主要撰稿人和本文的合著者
Special thanks to Philippe Mizrahi for helping along the way
特别感谢Philippe Mizrahi一直以来的帮助
Thanks for all the help we got from early reviews / beta testing
感谢您从早期评论/ Beta测试中获得的所有帮助