大数据技术 学习之旅_如何开始您的数据科学之旅?

大数据技术 学习之旅

Machine Learning seems to be fascinating to a lot of beginners but they often get lost into the pool of information available across different resources. This is true that we have a lot of different algorithms and steps to learn but starting with a strong base not only gives confidence but also the motivation to learn and explore further. In this story, we will go through the steps on how to set up your environment and start learning with the help of a known dataset i.e. IRIS dataset which is a multi-class classification problem in Machine Learning. We will also go through some helpful python libraries which can speed up the learning process which can help you even if you are a data scientist. If you have done the setup already, you can skip the setup steps. Let’s begin with the first step of your journey.

机器学习似乎使许多初学者着迷,但是他们常常迷失在跨不同资源的可用信息库中。 的确,我们有很多不同的算法和步骤需要学习,但是从一个强大的基础开始不仅会给人信心,而且还会激发学习和进一步探索的动力。 在这个故事中,我们将逐步了解如何设置环境并借助已知数据集(即IRIS数据集)开始学习,该数据集是机器学习中的多类分类问题。 我们还将介绍一些有用的python库 ,这些可以加快学习过程,即使您是数据科学家 ,也可以为您提供帮助。 如果已经完成设置,则可以跳过设置步骤。 让我们从旅程的第一步开始。

搭建环境 (Setting up the Environment)

We will use the Anaconda distribution to setup the data science environment. Download the latest version of Anaconda from here and open the anaconda prompt and run the following command-

我们将使用Anaconda发行版来设置数据科学环境。 从此处下载最新版本的Anaconda,然后打开anaconda提示符并运行以下命令:

jupyter notebook

The above command will start the Jupyter server and load the notebook directory in your browser.

上面的命令将启动Jupyter服务器,并在浏览器中加载笔记本目录。

创建一个虚拟环境 (Create a virtual environment)

I hope you are aware of virtual environments, if not you can read about them here. Although Anaconda comes with a base environment which is having most of the libraries already installed, it is recommended to use virtual environments as they help us to manage different environments that can have different packages and if something goes wrong with one environment it will not affect others. Here are the commands you can use to create, activate, and install packages in the virtual environment.

我希望您了解虚拟环境,如果没有,您可以在这里阅读有关它们的信息 。 尽管Anaconda带有一个已安装了大多数库的基本环境 ,但建议使用虚拟环境,因为它们可以帮助我们管理可以具有不同程序包的不同环境,如果一个环境出了问题,则不会影响其他环境。 。 这是可用于在虚拟环境中创建,激活和安装软件包的命令。

将虚拟环境与笔记本链接 (Link virtual environment with the Notebook)

By default, the new environment would not show up in the Jupyter notebook. You need to run following commands to link your environment with Jupyter client

默认情况下,新环境不会显示在Jupyter笔记本中。 您需要运行以下命令来将环境与Jupyter客户端链接

启动笔记本和有用的命令 (Starting notebook and useful commands)

Once you have a virtual environment, go to the browser and open a new notebook as shown below. Select the environment you just created.

拥有虚拟环境后,请转到浏览器并打开一个新笔记本,如下所示。 选择您刚刚创建的环境。

Image for post

Jupyter notebook provides many handy shortcuts. Below 2 are my favorite-

Jupyter笔记本电脑提供了许多方便的快捷方式。 以下2是我的最爱-

  1. Tab- this acts as a autocomplete.

    Tab-这是自动完成。
  2. Shift + Tab- this will give you the command details and you do not need to go to the library documentation every time.

    Shift + Tab键-这将为您提供命令详细信息,您无需每次都转到库文档。

See, how these commands can be helpful

看,这些命令有什么帮助

Image for post
Source: GIF by Author
资料来源:作者提供的GIF

探索Python库并应用机器学习 (Exploring Python libraries and applying Machine Learning)

We need to have different libraries for loading the data sets, visualizations, and modeling. We will go through each and install them in the environment. You can have a look at my notebook, feel free to download and import it in your environment and play around with it-

我们需要有不同的库来加载数据集,可视化和建模。 我们将逐一检查并将它们安装在环境中。 您可以看一下我的笔记本,可以在您的环境中随意下载和导入它,并可以随意使用-

Jupyter Contrib Nbextensions (Jupyter Contrib Nbextensions)

We often need to share our notebook with different stakeholders or might need to present them, this library provides us a lot of different extensions. I will not go through the extensions here, but I recommend using this. My favorite ones are-

我们经常需要与不同的利益相关者共享我们的笔记本,或者可能需要展示他们,这个库为我们提供了许多不同的扩展。 我不会在这里进行扩展,但是我建议使用它。 我最喜欢的是-

  1. Collapsible headings.

    可折叠的标题。
  2. Table of Contents.

    目录。
  3. Execution Time.

    执行时间处理时间。

You can install it using

您可以使用安装

conda install -c conda-forge jupyter_contrib_nbextensions

Here is a short demo on how it can help-

这是一个简短的演示,介绍了如何帮助您-

Image for post
Source: GIF by Author
资料来源:作者提供的GIF

熊猫-Python数据分析库 (Pandas- Python Data Analysis Library)

This is the heart of data science with python and provides many different capabilities like

这是python数据科学的核心,并提供许多不同的功能,例如

  • Data structures to work with the data.

    使用数据的数据结构。
  • Operations you can perform on the data.

    您可以对数据执行的操作。
  • Load and save data in different formats.

    加载和保存不同格式的数据。

and many more. Many other libraries we use for machine learning with python have pandas as a dependency. Install it using-

还有很多。 我们用于python机器学习的许多其他库都将pandas作为依赖项。 使用-安装它

conda install -c conda-forge pandas

The above command will install other libraries like NumPy which pandas uses under the hood.

上面的命令将安装其他库,例如Pandas在后台使用的NumPy。

斯克莱恩(Scikit-Learn) (Sklearn (Scikit-Learn))

We will use this library to download test datasets and apply different machine learning algorithms. Install using the following command.

我们将使用该库下载测试数据集并应用不同的机器学习算法。 使用以下命令进行安装。

conda install -c conda-forge scikit-learn

In machine learning classification problems, the problem can be understood as for X features (input variables) predict y (target value). Sklearn provides few test datasets which we can use to play with, we will take the IRIS dataset for this exercise but if you would like to play with others then you can refer to this.

在机器学习分类问题中,该问题可以理解为X特征(输入变量)预测y(目标值)。 Sklearn提供了一些可用于测试的测试数据集,我们将使用IRIS数据集进行此练习,但是如果您想与其他人一起玩,则可以参考此内容 。

Scikit-learn 0.23.1 added a feature by which we can return the test dataset directly into the X and y dataframe. Make sure you are running version 0.23.1.

Scikit-learn 0.23.1添加了一项功能,通过该功能,我们可以将测试数据集直接返回到X和y数据框中。 确保您正在运行0.23.1版。

from sklearn.datasets import load_iris
Image for post
Source: Created by Author
来源:作者创建

We will now go through the other libraries and we will use Sklearn for modeling later

现在,我们将介绍其他库,稍后将使用Sklearn进行建模

熊猫分析 (Pandas Profiling)

I am sure many of you would be aware of this library but if you are not please do give it a try. It provides a rich profiling report for the data which gives a lot of information from missing values to correlations. You need to install it using pip as conda-package downloads old version of it.

我敢肯定,你们中的许多人都会知道这个库,但是如果您不知道,请尝试一下。 它为数据提供了丰富的概要分析报告,该报告提供了从缺失值到相关性的大量信息。 您需要使用pip安装它,因为conda-package会下载旧版本。

pip install --user pandas-profiling
Image for post
Source: GIF by Author
资料来源:作者提供的GIF

This report provides many details, out of which few are-

该报告提供了许多详细信息,其中很少是-

  1. Overview of different variables in the dataset.

    数据集中不同变量的概述。
  2. Correlation between variables.

    变量之间的相关性。
  3. Interactions between variables.

    变量之间的相互作用。
  4. Details about each variable.

    有关每个变量的详细信息。

The following commands can be used to generate and save the profile report-

以下命令可用于生成和保存配置文件报告-

情节快递 (Plotly Express)

Although pandas-profiling provides a lot of useful information, we still need to visualize different information like for example we need to find how the target variable is distributed among multiple input variables. There exist many libraries for visualization, Matlplotlib and Seaborn are the famous ones you would have heard about. The main thing where Plotly stands out is the interactive plots i.e. you can interact with the generated plots. Install it using the following command

尽管pandas分析提供了许多有用的信息,但是我们仍然需要可视化不同的信息,例如我们需要找到目标变量如何在多个输入变量之间分配。 有许多可视化库,Matlplotlib和Seaborn是您会听说的著名库。 Plotly脱颖而出的主要功能是交互式绘图,即您可以与生成的绘图进行交互。 使用以下命令安装

conda install -c conda-forge plotly

Below we plotted a scatter plot between sepal length with petal length and used ‘color’ to show how the target variable is related.

下面我们在萼片长度和花瓣长度之间绘制了散点图,并使用“颜色”显示了目标变量之间的关系。

You can see below how we can filter out different targets.

您可以在下面看到我们如何过滤出不同的目标。

Image for post
Source: GIF by Author
来源:作者提供的GIF

This library provides a lot of additional functionality, maybe we can cover that in a different story.

该库提供了许多其他功能,也许我们可以在另一个故事中进行介绍。

培训和测试数据集 (Training and Test dataset)

The idea of generating models is to predict values that are not known. If we learn the model on the entire dataset then we will not be able to evaluate how it performs on the unseen data. To achieve this, we split the dataset into training and test dataset. A training dataset is used to train the model and the test set is used to evaluate the model. Sklearn provides a function ‘train_test_split’ which split the dataset into train and test datasets. The following code can be used to split the datasets.

生成模型的想法是预测未知的值。 如果我们在整个数据集上学习该模型,那么我们将无法评估该模型在看不见的数据上的表现。 为此,我们将数据集分为训练和测试数据集。 训练数据集用于训练模型,而测试集用于评估模型。 Sklearn提供了一个函数“ train_test_split”,可将数据集分为训练和测试数据集。 以下代码可用于拆分数据集。

from sklearn.model_selection import train_test_split
Image for post
Source: Image by Author
来源:作者提供的图片

调整超参数 (Tuning Hyperparameters)

One of the important tasks in machine learning is to tune hyperparameters, these parameters are the different attributes of the algorithm which control the learning process. Different values are suitable for different learning problems and it is important to find out the best parameters. Sklearn provides mainly two ways ‘GridSearchCV’ and ‘RandomizedSearchCV’ to find the best parameters. For large training sets, we might need to use RandomizedSearchCV as it will take a lot of time to learn all parameters. In the IRIS dataset, we have only 150 rows and hence we used ‘GridSearchCV’.

机器学习的重要任务之一是调整超参数,这些参数是控制学习过程的算法的不同属性。 不同的值适用于不同的学习问题,因此找出最佳参数很重要。 Sklearn主要提供“ GridSearchCV”和“ RandomizedSearchCV”两种查找最佳参数的方法。 对于大型训练集,我们可能需要使用RandomizedSearchCV,因为它将花费大量时间来学习所有参数。 在IRIS数据集中,我们只有150行,因此我们使用了“ GridSearchCV”。

For this story, we will train the LogisticsRegression model which is well-suited for classification problems and have different hyperparameters like ‘solver’, ‘C’, ‘penalty’, and ‘l1-ratio’. Not every solver supports all parameters and hence we create different dictionaries for all different solvers.

对于这个故事,我们将训练物流回归模型,该模型非常适合分类问题,并且具有不同的超参数,例如“求解器”,“ C”,“罚分”和“ l1-比率”。 并非每个求解器都支持所有参数,因此我们为所有不同的求解器创建不同的字典。

The above code would search for different combinations of parameters and find the best one which best generalizes the problem.

上面的代码将搜索参数的不同组合,并找到最能概括该问题的最佳组合。

评估模型 (Evaluating the model)

As we mentioned, we need to evaluate the model on the test dataset, many different metrics are available. The common one is accuracy for classification problems. Here we will show the accuracy, classification_report, and confusion matrix which Sklearn provides.

如前所述,我们需要在测试数据集上评估模型,可以使用许多不同的指标。 常见的一种是分类问题的准确性。 在这里,我们将显示Sklearn提供的准确性,classification_report和混淆矩阵。

Image for post

The IRIS dataset is classified as an easy dataset which means data is already suitable for machine learning purposes and hence we were able to get a perfect score i.e. accuracy score of 1.0 with our model. This means our model predicted all the samples in the test dataset correctly. It will vary with the different problems you are trying to solve.

IRIS数据集被归类为简单数据集,这意味着数据已经适合于机器学习目的,因此我们能够获得理想的评分,即模型的准确性评分为1.0。 这意味着我们的模型可以正确预测测试数据集中的所有样本。 您要解决的不同问题会有所不同。

结论 (Conclusion)

The idea of this story was to give you a head start on machine learning and a glimpse of different libraries that you can utilize to speed up the process. I provided a simple overview of many things to keep this story short and precise. There is still a lot to explore such as different types of machine learning problems, different models, different metrics, and where to use them. You can try different things, in the same way, I did here so that you can see how it works. I will try to add more stories to give you deep dive into specific areas to help accelerate your learning.

这个故事的目的是让您抢先学习机器学习,并瞥见可以用来加速过程的各种库。 为了使这个故事简短而准确,我提供了许多内容的简单概述。 还有很多需要探索的东西,例如不同类型的机器学习问题,不同的模型,不同的度量标准以及在何处使用它们。 您可以用相同的方法尝试不同的事情,我在这里做了,所以您可以了解它的工作原理。 我将尝试添加更多故事,以使您深入了解特定领域,以帮助您加速学习。

Update 16th June 2020: Recently I found a way to combine Sklearn Pipeline with GridSearchCV to search for best preprocessing steps. If interested, check out this- Are you using Pipeline in Scikit-Learn?

2020年6月16日更新 :最近,我找到了一种将Sklearn Pipeline与GridSearchCV结合以搜索最佳预处理步骤的方法。 如果有兴趣,请查看此- 您是否在Scikit-Learn中使用管道?

翻译自: https://medium.com/swlh/start-your-data-science-journey-today-37366ee463f

大数据技术 学习之旅

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/389276.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

数据可视化工具_数据可视化

数据可视化工具Visualizations are a great way to show the story that data wants to tell. However, not all visualizations are built the same. My rule of thumb is stick to simple, easy to understand, and well labeled graphs. Line graphs, bar charts, and histo…

Android Studio调试时遇见Install Repository and sync project的问题

我们可以看到,报的错是“Failed to resolve: com.android.support:appcompat-v7:16.”,也就是我们在build.gradle中最后一段中的compile项内容。 AS自动生成的“com.android.support:appcompat-v7:16.”实际上是根据我们的最低版本16来选择16.x.x及以上编…

VGAE(Variational graph auto-encoders)论文及代码解读

一,论文来源 论文pdf Variational graph auto-encoders 论文代码 github代码 二,论文解读 理论部分参考: Variational Graph Auto-Encoders(VGAE)理论参考和源码解析 VGAE(Variational graph auto-en…

tableau大屏bi_Excel,Tableau,Power BI ...您应该使用什么?

tableau大屏biAfter publishing my previous article on data visualization with Power BI, I received quite a few questions about the abilities of Power BI as opposed to those of Tableau or Excel. Data, when used correctly, can turn into digital gold. So what …

网络编程 socket介绍

Socket介绍 Socket是应用层与TCP/IP协议族通信的中间软件抽象层,它是一组接口。在设计模式中,Socket其实就是一个门面模式,它把复杂的TCP/IP协议族隐藏在Socket接口后面,对用户来说,一组简单的接口就是全部。 Socket通…

BP神经网络反向传播手动推导

BP神经网络过程: 基本思想 BP算法是一个迭代算法,它的基本思想如下: 将训练集数据输入到神经网络的输入层,经过隐藏层,最后达到输出层并输出结果,这就是前向传播过程。由于神经网络的输出结果与实际结果…

使用python和pandas进行同类群组分析

背景故事 (Backstory) I stumbled upon an interesting task while doing a data exercise for a company. It was about cohort analysis based on user activity data, I got really interested so thought of writing this post.在为公司进行数据练习时,我偶然发…

搜索引擎优化学习原理_如何使用数据科学原理来改善您的搜索引擎优化工作

搜索引擎优化学习原理Search Engine Optimisation (SEO) is the discipline of using knowledge gained around how search engines work to build websites and publish content that can be found on search engines by the right people at the right time.搜索引擎优化(SEO…

Siamese网络(孪生神经网络)详解

SiameseFCSiamese网络(孪生神经网络)本文参考文章:Siamese背景Siamese网络解决的问题要解决什么问题?用了什么方法解决?应用的场景:Siamese的创新Siamese的理论Siamese的损失函数——Contrastive Loss损失函…

Dubbo 源码分析 - 服务引用

1. 简介 在上一篇文章中,我详细的分析了服务导出的原理。本篇文章我们趁热打铁,继续分析服务引用的原理。在 Dubbo 中,我们可以通过两种方式引用远程服务。第一种是使用服务直联的方式引用服务,第二种方式是基于注册中心进行引用。…

一件登录facebook_我从Facebook的R教学中学到的6件事

一件登录facebookBetween 2018 to 2019, I worked at Facebook as a data scientist — during that time I was involved in developing and teaching a class for R beginners. This was a two-day course that was taught about once a month to a group of roughly 15–20 …

SiameseFC超详解

SiameseFC前言论文来源参考文章论文原理解读首先要知道什么是SOT?(Siamese要做什么)SiameseFC要解决什么问题?SiameseFC用了什么方法解决?SiameseFC网络效果如何?SiameseFC基本框架结构SiameseFC网络结构Si…

Python全栈工程师(字符串/序列)

ParisGabriel Python 入门基础字符串:str用来记录文本信息字符串的表示方式:在非注释中凡是用引号括起来的部分都是字符串‘’ 单引号“” 双引号 三单引""" """ 三双引有内容代表非空字符串否则是空字符串 区别&#xf…

跨库数据表的运算

跨库数据表的运算,一直都是一个说难不算太难,说简单却又不是很简单的、总之是一个麻烦的事。大量的、散布在不同数据库中的数据表们,明明感觉要把它们合并起来,再来个小小的计算,似乎也就那么回事……但真要做起来&…

熊猫在线压缩图_回归图与熊猫和脾气暴躁

熊猫在线压缩图数据可视化 (Data Visualization) I like the plotting facilities that come with Pandas. Yes, there are many other plotting libraries such as Seaborn, Bokeh and Plotly but for most purposes, I am very happy with the simplicity of Pandas plotting…

SiameseRPN详解

SiameseRPN论文来源论文背景一,简介二,研究动机三、相关工作论文理论注意:网络结构:1.Siamese Network2.RPN3.LOSS计算4.Tracking论文的优缺点分析一、Siamese-RPN的贡献/优点:二、Siamese-RPN的缺点:代码流…

数据可视化 信息可视化_可视化数据操作数据可视化与纪录片的共同点

数据可视化 信息可视化Data visualization is a great way to celebrate our favorite pieces of art as well as reveal connections and ideas that were previously invisible. More importantly, it’s a fun way to connect things we love — visualizing data and kicki…

python 图表_使用Streamlit-Python将动画图表添加到仪表板

python 图表介绍 (Introduction) I have been thinking of trying out Streamlit for a while. So last weekend, I spent some time tinkering with it. If you have never heard of this tool before, it provides a very friendly way to create custom interactive Data we…

Python--day26--复习

转载于:https://www.cnblogs.com/xudj/p/9953293.html

SiameseRPN++分析

SiamRPN论文来源论文背景什么是目标跟踪什么是孪生网络结构Siamese的局限解决的问题论文分析创新点一:空间感知策略创新点二:ResNet-50深层网络创新点三:多层特征融合创新点四:深层互相关代码分析整体代码简述(1&#…