knn 机器学习_机器学习:通过预测意大利葡萄酒的品种来观察KNN的工作方式

knn 机器学习

Introduction

介绍

For this article, I’d like to introduce you to KNN with a practical example.

对于本文,我想通过一个实际的例子向您介绍KNN。

I will consider one of my project that you can find in my GitHub profile. For this project, I used a dataset from Kaggle.

我将考虑可以在我的GitHub个人资料中找到的我的项目之一。 对于这个项目,我使用了Kaggle的数据集。

The dataset is the result of a chemical analysis of wines grown in the same region in Italy but derived from three different cultivars organized in three classes. The analysis was done by considering the quantities of 13 constituents found in each of the three types of wines.

该数据集是对意大利同一地区种植的葡萄酒进行化学分析的结果,这些葡萄酒来自三个不同类别的三个品种。 通过考虑三种葡萄酒中每种葡萄酒中13种成分的数量来进行分析。

This article will be structured in three-part. In the first part, I will make a theoretical description of KNN, then I will focus on the part about exploratory data analysis in order to show you the insights that I found and at the end, I will show you the code that I used to prepare and evaluate the machine learning model.

本文将分为三部分。 在第一部分中,我将对KNN进行理论上的描述,然后,我将重点介绍探索性数据分析这一部分,以便向您展示我发现的见解,最后,我将向您展示我曾经使用过的代码准备和评估机器学习模型。

Part I: What is KNN and how it works mathematically?

第一部分:什么是KNN及其在数学上的作用?

The k-nearest neighbour algorithm is not a complex algorithm. The approach of KNN to predict and classify data consists of looking through the training data and finds the k training points that are closest to the new point. Then it assigns to the new data the class label of the nearest training data.

k最近邻居算法不是复杂的算法。 KNN预测和分类数据的方法包括浏览训练数据并找到最接近新点的k个训练点。 然后,它将新的训练数据的类别标签分配给新数据。

But how KNN works? To answer this question we have to refer to the formula of the euclidian distance between two points. Suppose you have to compute the distance between two points A(5,7) and B(1,4) in a Cartesian plane. The formula that you will apply is very simple:

但是KNN是如何工作的? 要回答这个问题,我们必须参考两点之间的欧几里得距离的公式。 假设您必须计算笛卡尔平面中两个点A(5,7)和B(1,4)之间的距离。 您将应用的公式非常简单:

Image for post

Okay, but how can we apply that in machine learning? Imagine to be a bookseller and you want to classify a new book called Ubick of Philip K. Dick with 240 pages which cost 14 euro. As you can see below there are 5 possible classes where to put our new book.

好的,但是我们如何将其应用到机器学习中呢? 想象成为一个书商,您想对一本名为Philip K. Dick的Ubick的新书进行分类,共有240页,售价14欧元。 如您在下面看到的,有5种可能的类别可用于放置我们的新书。

Image for post
image by author
图片作者

To know which is the best class for Ubick we can use the euclidian formula in order to compute the distance with each observation in the dataset.

要知道哪个是Ubick的最佳分类,我们可以使用欧几里得公式来计算数据集中每个观测值的距离。

Formula:

式:

Image for post
image by author
图片作者

output:

输出:

Image for post
image by author
图片作者

As you can see above the nearest class for Ubick is class C.

如您所见,Ubick最近的课程是C类

Part II: insights that I found to create the model

第二部分:我发现的创建模型的见解

Before to start to speak about the algorithm, that I used to create my model and predict the varieties of wine, let me show you briefly the main insights that I found.

在开始谈论算法之前,我曾用它来创建模型并预测葡萄酒的种类,然后让我简要地向您展示我发现的主要见解。

In the following heatmap, there are correlations between the different features. This is very useful to have a first look at the situation of our dataset and knowing if it is possible to apply a classification algorithm.

在下面的热图中,不同功能之间存在关联。 首先了解一下数据集的情况,并了解是否有可能应用分类算法,这非常有用。

Image for post
image by author
图片作者

The heatmap is great for a first look but that is not enough. I’d like also to know if there are some elements whose absolute sum of correlations is low in order to delete them before to train the machine learning model. So, I construct a histogram as you can see below.

该热图乍一看很棒,但这还不够。 我还想知道是否存在某些元素的相关绝对和很低,以便在训练机器学习模型之前将其删除。 因此,如下图所示,我构建了一个直方图。

You can see that there are three elements with low total absolute correlation. The elements are ash, magnesium and the color_intensity.

您会看到三个绝对绝对相关性较低的元素。 元素是灰,镁和color_intensity。

Image for post
image by author
图片作者

Thanks to these observations now we are sure that there is the possibility to apply a KNN algorithm to create a predictive model.

现在,由于这些观察,我们确信可以应用KNN算法创建预测模型。

Part III: use scikit-learn to make predictions

第三部分:使用scikit-learn进行预测

In this part, we will see how to prepare the model and evaluate it thanks to scikit-learn.

在这一部分中,我们将借助scikit-learn了解如何准备模型并进行评估。

Below you can observe that I split the model into two parts: 80% for training and 20% for testing. I chose this proportion because the data set is not big.

在下面,您可以看到我将模型分为两个部分:80%用于训练,20%用于测试。 我选择此比例是因为数据集不大。

# split data to train and test
y = df['class']
X = input_data.drop(columns=['ash','magnesium', 'color_intensity'])X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state=0)# to be sure that the data was split rightly (80% for train data and 20% for test data)print("X_train shape: {}".format(X_train.shape))
print("y_train shape: {}".format(y_train.shape))print("X_test shape: {}".format(X_test.shape))
print("y_test shape: {}".format(y_test.shape))

out:

出:

X_train shape: (141, 10)
y_train shape: (141,)X_test shape: (36, 10)
y_test shape: (36,)

You have to know that all machine learning models in scikit-learn are implemented in their own classes. For example, the k-nearest neighbors classification algorithm is implemented in the KNeighborsClassifier class.

您必须知道scikit-learn中的所有机器学习模型都是在各自的类中实现的。 例如,在KNeighborsClassifier类中实现了k最近邻居分类算法。

The first step is to instantiate the class into an object that I called cli as you can see below. The object contains the algorithm that I will use to build the model from the training data and make predictions on new data points. It contains also the information that the algorithm has extracted from the training data.

第一步是将类实例化为一个我称为cli的对象,如下所示。 该对象包含用于从训练数据构建模型并对新数据点进行预测的算法。 它还包含算法已从训练数据中提取的信息。

Finally, to build the model on the training set, we call the fit method of the cli object.

最后,要在训练集上构建模型,我们调用cli对象的fit方法

from sklearn.neighbors import KNeighborsClassifiercli = KNeighborsClassifier(n_neighbors=1)
cli.fit(X_train, y_train)

out:

出:

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',metric_params=None, n_jobs=None, n_neighbors=1, p=2,weights='uniform')

In the output of the fit method, you can see the parameters used in creating the model.

在fit方法的输出中,您可以看到用于创建模型的参数。

Now, it is time to evaluate the model. Below, the first output shows us that the model predict the 89% of the test data. Instead the second output give us a complete overview of the accuracy for each class.

现在,该评估模型了。 下面的第一个输出向我们展示了该模型预测了89%的测试数据。 相反,第二个输出为我们提供了每个类别的准确性的完整概述。

y_pred = cli.predict(X_test)
print("Test set score: {:.2f}".format(cli.score(X_test, y_test))) # below the values of the model 
from sklearn.metrics import classification_report
print("Final result of the model \n {}".format(classification_report(y_test, y_pred)))

out:

出:

Test set score: 0.89

out:

出:

Image for post

Conclusion

结论

I think that the best way to learn something is by practising. So in my case, I download the dataset from Kaggle which is one of the best places where to find a good dataset on which you can apply your machine learning algorithms and learn how they work.

我认为最好的学习方法是练习。 因此,就我而言,我是从Kaggle下载数据集的,这是找到良好数据集的最佳位置之一,您可以在该数据集上应用机器学习算法并了解它们的工作方式。

Thanks for reading this. There are some other ways you can keep in touch with me and follow my work:

感谢您阅读本文。 您可以通过其他方法与我保持联系并关注我的工作:

  • Subscribe to my newsletter.

    订阅我的时事通讯。

  • You can also get in touch via my Telegram group, Data Science for Beginners.

    您也可以通过我的电报小组“ 面向初学者的数据科学”来联系

翻译自: https://towardsdatascience.com/machine-learning-observe-how-knn-works-by-predicting-the-varieties-of-italian-wines-a64960bb2dae

knn 机器学习

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/391041.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

MMU内存管理单元(看书笔记)

http://note.youdao.com/noteshare?id8e12abd45bba955f73874450e5d62b5b&subD09C7B51049D4F88959668B60B1263B5 笔记放在了有道云上面了,不想再写一遍了。 韦东山《嵌入式linux完全开发手册》看书笔记转载于:https://www.cnblogs.com/coversky/p/7709381.html

Java中如何读取文件夹下的所有文件

问题:Java中如何读取文件夹下的所有文件 Java里面是如何读取一个文件夹下的所有文件的? 回答一 public void listFilesForFolder(final File folder) {for (final File fileEntry : folder.listFiles()) {if (fileEntry.isDirectory()) {listFilesFor…

github pages_如何使用GitHub Actions和Pages发布GitHub事件数据

github pagesTeams who work on GitHub rely on event data to collaborate. The data recorded as issues, pull requests, and comments become vital to understanding the project.在GitHub上工作的团队依靠事件数据进行协作。 记录为问题,请求和注释的数据对于…

c# .Net 缓存 使用System.Runtime.Caching 做缓存 平滑过期,绝对过期

1 public class CacheHeloer2 {3 4 /// <summary>5 /// 默认缓存6 /// </summary>7 private static CacheHeloer Default { get { return new CacheHeloer(); } }8 9 /// <summary>10 /// 缓存初始化11 /// </summary>12 …

python 实现分步累加_Python网页爬取分步指南

python 实现分步累加As data scientists, we are always on the look for new data and information to analyze and manipulate. One of the main approaches to find data right now is scraping the web for a particular inquiry.作为数据科学家&#xff0c;我们一直在寻找…

Java 到底有没有析构函数呢?

Java 到底有没有析构函数呢&#xff1f; ​ ​ Java 到底有没有析构函数呢&#xff1f;我没能找到任何有关找个的文档。如果没有的话&#xff0c;我要怎么样才能达到一样的效果&#xff1f; ​ ​ ​ 为了使得我的问题更加具体&#xff0c;我写了一个应用程序去处理数据并且说…

关于双黑洞和引力波,LIGO科学家回答了这7个你可能会关心的问题

引力波的成功探测&#xff0c;就像双黑洞的碰撞一样&#xff0c;一石激起千层浪。 关于双黑洞和引力波&#xff0c;LIGO科学家回答了这7个你可能会关心的问题 最近&#xff0c;引力波的成功探测&#xff0c;就像双黑洞的碰撞一样&#xff0c;一石激起千层浪。 大家兴奋之余&am…

如何使用HTML,CSS和JavaScript构建技巧计算器

A Tip Calculator is a calculator that calculates a tip based on the percentage of the total bill.小费计算器是根据总账单的百分比计算小费的计算器。 Lets build one now.让我们现在建立一个。 第1步-HTML&#xff1a; (Step 1 - HTML:) We create a form in order to…

用于MLOps的MLflow简介第1部分:Anaconda环境

在这三部分的博客中跟随了演示之后&#xff0c;您将能够&#xff1a; (After following along with the demos in this three part blog you will be able to:) Understand how you and your Data Science teams can improve your MLOps practices using MLflow 了解您和您的数…

[WCF] - 使用 [DataMember] 标记的数据契约需要声明 Set 方法

WCF 数据结构中返回的只读属性 TotalCount 也需要声明 Set 方法。 [DataContract]public class BookShelfDataModel{ public BookShelfDataModel() { BookList new List<BookDataModel>(); } [DataMember] public List<BookDataModel>…

sql注入语句示例大全_SQL Group By语句用示例语法解释

sql注入语句示例大全GROUP BY gives us a way to combine rows and aggregate data.GROUP BY为我们提供了一种合并行和汇总数据的方法。 The data used is from the campaign contributions data we’ve been using in some of these guides.使用的数据来自我们在其中一些指南…

ConcurrentHashMap和Collections.synchronizedMap(Map)的区别是什么?

ConcurrentHashMap和Collections.synchronizedMap(Map)的区别是什么&#xff1f; 我有一个会被多个线程同时修改的Map 在Java的API里面&#xff0c;有3种不同的实现了同步的Map实现 HashtableCollections.synchronizedMap(Map)ConcurrentHashMap 据我所知&#xff0c;HashT…

pymc3 贝叶斯线性回归_使用PyMC3估计的贝叶斯推理能力

pymc3 贝叶斯线性回归内部AI (Inside AI) If you’ve steered clear of Bayesian regression because of its complexity, this article shows how to apply simple MCMC Bayesian Inference to linear data with outliers in Python, using linear regression and Gaussian ra…

Hadoop Streaming详解

一&#xff1a; Hadoop Streaming详解 1、Streaming的作用 Hadoop Streaming框架&#xff0c;最大的好处是&#xff0c;让任何语言编写的map, reduce程序能够在hadoop集群上运行&#xff1b;map/reduce程序只要遵循从标准输入stdin读&#xff0c;写出到标准输出stdout即可 其次…

mongodb分布式集群搭建手记

一、架构简介 目标 单机搭建mongodb分布式集群(副本集 分片集群)&#xff0c;演示mongodb分布式集群的安装部署、简单操作。 说明 在同一个vm启动由两个分片组成的分布式集群&#xff0c;每个分片都是一个PSS(Primary-Secondary-Secondary)模式的数据副本集&#xff1b; Confi…

归约归约冲突_JavaScript映射,归约和过滤-带有代码示例的JS数组函数

归约归约冲突Map, reduce, and filter are all array methods in JavaScript. Each one will iterate over an array and perform a transformation or computation. Each will return a new array based on the result of the function. In this article, you will learn why …

为什么Java里面的静态方法不能是抽象的

为什么Java里面的静态方法不能是抽象的&#xff1f; 问题是为什么Java里面不能定义一个抽象的静态方法&#xff1f;例如&#xff1a; abstract class foo {abstract void bar( ); // <-- this is okabstract static void bar2(); //<-- this isnt why? }回答一 因为抽…

python16_day37【爬虫2】

一、异步非阻塞 1.自定义异步非阻塞 1 import socket2 import select3 4 class Request(object):5 def __init__(self,sock,func,url):6 self.sock sock7 self.func func8 self.url url9 10 def fileno(self): 11 return self.soc…

朴素贝叶斯实现分类_关于朴素贝叶斯分类及其实现的简短教程

朴素贝叶斯实现分类Naive Bayes classification is one of the most simple and popular algorithms in data mining or machine learning (Listed in the top 10 popular algorithms by CRC Press Reference [1]). The basic idea of the Naive Bayes classification is very …

python:改良廖雪峰的使用元类自定义ORM

概要本文仅仅是对廖雪峰老师的使用元类自定义ORM进行改进&#xff0c;并不是要创建一个ORM框架 编写fieldclass Field(object):def __init__(self, column_type,max_length,**kwargs):1&#xff0c;删除了参数name&#xff0c;field参数全部为定义字段类型相关参数&#xff0c;…