熊猫tv新功能介绍_您应该知道的4种熊猫绘图功能

熊猫tv新功能介绍

Pandas is a powerful package for data scientists. There are many reasons we use Pandas, e.g. Data wrangling, Data cleaning, and Data manipulation. Although, there is a method that rarely talks about regarding Pandas package and that is the Data plotting.

Pandas是数据科学家的强大工具包。 我们使用Pandas的原因很多,例如数据整理,数据清理和数据操作。 虽然,有一种方法很少谈论有关Pandas软件包的问题,​​那就是Data plotting

Data plotting, just like the name implies, is a process to plot the data into some graph or chart to visualise the data. While we have much fancier visualisation package out there, some method is just available in the pandas plotting API.

顾名思义,数据绘制是将数据绘制到某些图形或图表中以可视化数据的过程。 虽然我们有很多更好的可视化程序包,但熊猫绘图API中仅提供了一些方法。

Let’s see a few selected method I choose.

让我们看看我选择的一些选定方法。

1.拉德维兹 (1. radviz)

RadViz is a method to visualise N-dimensional data set into a 2D plot. The problem where we have more than 3-dimensional (features) data or more is that we could not visualise it, but RadViz allows it to happen.

RadViz是一种将N维数据集可视化为2D图的方法。 我们拥有超过3维(特征)数据或更多数据的问题是我们无法可视化它,但是RadViz允许它发生。

According to Pandas, radviz allows us to project an N-dimensional data set into a 2D space where the influence of each dimension can be interpreted as a balance between the importance of all dimensions. In a simpler term, it means we could project a multi-dimensional data into a 2D space in a primitive way.

根据Pandas的说法,radviz允许我们将N维数据集投影到2D空间中,其中每个维的影响可以解释为所有维的重要性之间的平衡。 简单来说,这意味着我们可以以原始方式将多维数据投影到2D空间中

Let’s try to use the function in a sample dataset.

让我们尝试在样本数据集中使用该函数。

#RadViz example
import pandas as pd
import seaborn as sns#To use the pd.plotting.radviz, you need a multidimensional data set with all numerical columns but one as the class column (should be categorical).mpg = sns.load_dataset('mpg')pd.plotting.radviz(mpg.drop(['name'], axis =1), 'origin')
Image for post
RadViz Result
RadViz结果

Above is the result of RadViz function, but how you would interpret the plot?

上面是RadViz函数的结果,但是如何解释该图呢?

So, each Series in the DataFrame is represented as an evenly distributed slice on a circle. Just look at the example above, there is a circle with the series name.

因此,DataFrame中的每个Series均表示为圆上均匀分布的切片。 只要看一下上面的例子,就会有一个带有系列名称的圆圈。

Each data point then is plotted in the circle according to the value on each Series. Highly correlated Series in the DataFrame are placed closer on the unit circle. In the example, we could see the japan and europe car data are closer to the model_year while the usa car is closer to the displacement. It means japan and europe car are most likely correlated to the model_year while usa car is with the displacement.

然后,根据每个系列的值将每个数据点绘制在圆圈中。 DataFrame中高度相关的Series位于单位圆上。 在示例中,我们可以看到日本和欧洲的汽车数据更接近model_year,而美国汽车的数据更接近排量。 这意味着日本和欧洲的汽车最有可能与model_year相关,而美国汽车则与排量相关。

If you want to know more about RadViz, you could check the paper here.

如果您想了解有关RadViz的更多信息,可以在此处查看该论文。

2. bootstrap_plot (2. bootstrap_plot)

According to Pandas, the bootstrap plot is used to estimate the uncertainty of a statistic by relying on random sampling with replacement. In simpler words, it is used to trying to determine the uncertainty in fundamental statistic such as mean and median by resampling the data with replacement (you could sample the same data multiple times). You could read more about bootstrap here.

根据Pandas的说法, 引导程序图依赖于随机抽样和替换来估计统计的不确定性。 用简单的话来说, 它用于尝试通过替换对数据进行重采样来确定基本统计数据的不确定性,例如均值和中位数 (您可以多次采样同一数据)。 您可以在此处阅读有关引导的更多信息。

The boostrap_plot function will generate bootstrapping plots for mean, median and mid-range statistics for the given number of samples of the given size. Let’s try using the function with an example dataset.

boostrap_plot函数将为给定大小的给定数量的样本生成均值,中值和中间范围统计量的自举图。 让我们尝试将函数与示例数据集一起使用。

For example, I have the mpg dataset and already have the information regarding the mpg feature data.

例如,我有mpg数据集,并且已经有了有关mpg特征数据的信息。

mpg['mpg'].describe()
Image for post

We could see that the mpg mean is 23.51 and the median is 23. Although this is just a snapshot of the real-world data. How are the values actually is in the population is unknown, that is why we could measure the uncertainty with the bootstrap methods.

我们可以看到mpg平均值为23.51,中位数为23。尽管这只是真实数据的快照。 实际值如何在总体中是未知的,这就是为什么我们可以使用自举法来测量不确定性的原因。

#bootstrap_plot examplepd.plotting.bootstrap_plot(mpg['mpg'],size = 50 , samples = 500)
Image for post

Above is the result example of bootstap_plot function. Mind that the result could be different than the example because it relies on random resampling.

上面是bootstap_plot函数的结果示例。 请注意,结果可能与示例不同,因为它依赖于随机重采样。

We could see in the first set of the plots (first row) is the sampling result, where the x-axis is the repetition, and the y-axis is the statistic. In the second set is the statistic distribution plot (Mean, Median and Midrange).

我们可以在第一组图(第一行)中看到采样结果,其中x轴是重复项,y轴是统计量。 第二组是统计分布图(均值,中位数和中位数)。

Take an example of the mean, most of the result is around 23, but it could be between 22.5 and 25 (more or less). This set the uncertainty in the real world that the mean in the population could be between 22.5 and 25. Note that there is a way to estimate the uncertainty by taking the values in the position 2.5% and 97.5% quantile (95% confident) although it is still up to your judgement.

以平均值为例,大多数结果在23左右,但可能在22.5到25之间(或多或少)。 这设置了现实世界中的不确定性,即总体平均值可能在22.5和25之间。请注意,尽管有2.5%和97.5%的分位数(95%的置信度),但是有一种方法可以估计不确定性这仍然取决于您的判断。

3. lag_plot (3. lag_plot)

A lag plot is a scatter plot for a time series and the same data lagged. Lag itself is a fixed amount of passing time; for example, lag 1 is a day 1 (Y1) with a 1-day time lag (Y1+1 or Y2).

滞后图是时间序列的散点图,并且相同数据滞后。 滞后本身是固定的通过时间; 例如,滞后1是第1天(Y1),时滞为1天(Y1 + 1或Y2)。

A lag plot is used to checks whether the time series data is random or not, and if the data is correlated with themselves. Random data should not have any identifiable patterns, such as linear. Although, why we bother with randomness or correlation? This is because many Time Series models are based on the linear regression, and one assumption is no correlation (Specifically is no Autocorrelation).

滞后图用于检查时间序列数据是否随机,以及数据是否与自身相关。 随机数据不应具有任何可识别的模式,例如线性。 虽然,为什么我们要扰乱随机性或相关性? 这是因为许多时间序列模型都基于线性回归,并且一个假设是不相关的(特别是没有自相关)。

Let’s try with an example data. In this case, I would use a specific package to scrap stock data from Yahoo Finance called yahoo_historical.

让我们尝试一个示例数据。 在这种情况下,我将使用一个名为yahoo_historical的特定程序包从Yahoo Finance抓取股票数据。

pip install yahoo_historical

With this package, we could scrap a specific stock data history. Let’s try it.

有了这个软件包,我们可以抓取特定的库存数据历史记录。 让我们尝试一下。

from yahoo_historical import Fetcher#We would scrap the Apple stock data. I would take the data between 1 January 2007 to 1 January 2017 
data = Fetcher("AAPL", [2007,1,1], [2017,1,1])
apple_df = data.getHistorical()#Set the date as the index
apple_df['Date'] = pd.to_datetime(apple_df['Date'])
apple_df = apple_df.set_index('Date')
Image for post

Above is our Apple stock dataset with the date as the index. We could try to plot the data to see the pattern over time with a simple method.

上面是我们的Apple股票数据集,其中以日期为索引。 我们可以尝试使用一种简单的方法来绘制数据以查看随时间变化的模式。

apple_df['Adj Close'].plot()
Image for post

We can see the Adj Close is increasing over time but is the data itself shown any pattern in with their lag? In this case, we would use the lag_plot.

我们可以看到,随着时间的推移,“关闭收盘价”(Adj Close)不断增加,但是数据本身是否显示出任何与滞后有关的模式? 在这种情况下,我们将使用lag_plot。

#Try lag 1 day
pd.plotting.lag_plot(apple_df['Adj Close'], lag = 1)
Image for post

As we can see in the plot above, it is almost near linear. It means there is a correlation between daily Adj Close. It is expected as the daily price of the stock would not be varied much in each day.

如上图所示,它几乎接近线性。 这意味着每日调整关闭之间存在相关性。 可以预期,因为股票的每日价格每天不会有太大变化。

How about a weekly basis? Let’s try to plot it

每周一次如何? 让我们尝试绘制它

#The data only consist of work days, so one week is 5 dayspd.plotting.lag_plot(apple_df['Adj Close'], lag = 5)
Image for post

We can see the pattern is similar to the lag 1 plot. How about 365 days? would it have any differences?

我们可以看到该模式类似于滞后1图。 365天怎么样? 有什么区别吗?

pd.plotting.lag_plot(apple_df['Adj Close'], lag = 365)
Image for post

We can see right now the pattern becomes more random, although the non-linear pattern still exists.

现在我们可以看到模式变得更加随机,尽管非线性模式仍然存在。

4. scatter_matrix (4. scatter_matrix)

The scatter_matrix is just like the name implies; it creates a matrix of scatter plot. Let’s try it with an example at once.

顾名思义, scatter_matrix就是一样。 它创建了散点图矩阵。 让我们立即尝试一个示例。

import matplotlib.pyplot as plttips = sns.load_dataset('tips')
pd.plotting.scatter_matrix(tips, figsize = (8,8))
plt.show()
Image for post

We can see the scatter_matrix function automatically detects the numerical features within the Data Frame we passed to the function and create a matrix of the scatter plot.

我们可以看到scatter_matrix函数自动检测我们传递给该函数的数据框内的数字特征,并创建散点图的矩阵。

In the example above, between two numerical features are plotted together to create a scatter plot (total_bill and size, total_bill and tip, and tip and size). Whereas, the diagonal part is the histogram of the numerical features.

在上面的示例中,两个数字特征之间被绘制在一起以创建散点图(total_bill和size,total_bill和tip,以及tip和size)。 而对角线部分是数值特征的直方图。

This is a simple function but powerful enough as we could get much information with a single line of code.

这是一个简单的功能,但功能足够强大,因为我们可以用一行代码来获取很多信息。

结论 (Conclusion)

Here I have shown you 4 different pandas plotting functions that you should know, that includes:

在这里,我向您展示了您应该了解的4种不同的熊猫绘图功能,其中包括:

  1. radviz

    拉德维兹
  2. bootstrap_plot

    bootstrap_plot
  3. lag_plot

    lag_plot
  4. scatter_matrix

    scatter_matrix

I hope it helps!

希望对您有所帮助!

翻译自: https://towardsdatascience.com/4-pandas-plotting-function-you-should-know-5a788d848963

熊猫tv新功能介绍

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/388564.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

CPP_封装_继承_多态

类的三方法:封装,继承,多态。封装:使用一整套方法去创建一个新的类型,这叫类的封装。继承:从一个现有的类型基础上,稍作改动,得到一个新的类型的方法,叫类的继承。多态&a…

win与linux渊源,微软与Linux从对立走向合作,WSL是如何诞生的

原标题:微软与Linux从对立走向合作,WSL是如何诞生的正文Windows Subsystem for Linux(WSL)的开发,让微软从Linux的对立面走向合作,并且不断加大对开源社区的支持力度。而作为微软历史上的重要转折点,外界对WSL技术在Pr…

文件编辑器 vi

1、关于文本编辑器; 文本编辑器有很多,比如图形模式的gedit、kwrite、OpenOffice ... ... ,文本模式下的编辑器有vi、vim(vi的增强版本)和nano ... ... vi和vim是我们在Linux中最常用的编辑器。我们有必要介绍一下vi&a…

MFC80.DLL复制到程序目录中,也有的说复制到安装目录中

在用VS2005学习C调试程序的时候,按F5键,总提示这个问题, 不晓得什么原因,网上有的说找到MFC80.DLL复制到程序目录中,也有的说复制到安装目录中,可结果很失望,也有的VS2005安装有问题&#xff0…

vs显示堆栈数据分析_什么是“数据分析堆栈”?

vs显示堆栈数据分析A poor craftsman blames his tools. But if all you have is a hammer, everything looks like a nail.一个可怜的工匠责怪他的工具。 但是,如果您只有一把锤子,那么一切看起来都像钉子。 It’s common for web developers or databa…

服务器

服务器主流品牌:华为、浪潮、戴尔、惠普华为服务器:华为FusionServer RH2288 V3 华为FusionServer RH5885 V3 浪潮服务器: 浪潮英信NP3020M4 浪潮英信NF5280M4 戴尔服务器: 戴尔PowerEdge R730 机架式服务器 戴尔PowerEdge R740 机…

树莓派 zero linux,树莓派 zero基本调试

回家之前就从网上购买了一堆设备,回去也不能闲着,可以利用家里相对齐全的准备安装调试。结果人还没回来,东西先到了。购买的核心装备是树莓派zero w,虽然已经知道它比家族大哥树莓派小不少,但拿到手里还是惊奇它的小巧…

error C2440 “static_cast” 无法从“void (__thiscall CPppView )(void)”转换为“LRESULT (__thiscall

error C2440 “static_cast” 无法从“void (__thiscall CPppView )(void)”转换为“LRESULT (__thiscall CWnd )(WPARAM,LPARAM)” 不能转换void (_thiscall CMainFrame::*)(void)to LRESULT (__thiscall CWnd::* )(WPARAM,LPARAM)开发平台由VC6.0升级至VS2005,需要…

简单的编译流程

简易编译器流程图: 一个典型的编译器,可以包含为一个前端,一个后端。前端接收源程序产生一个中间表示,后端接收中间表示继续生成一个目标程序。所以,前端处理的是跟源语言有关的属性,后端处理跟目标机器有关的属性。 复…

广告投手_测量投手隐藏自己的音高的程度

广告投手As the baseball community has recently seen with the Astros 2017 cheating scandal, knowing what pitch is being thrown gives batters a game-breaking advantage. However, unless you have an intricate system of cameras and trash cans set up, knowing wh…

linux事务隔离级别,事务的隔离级别(Transaction isolation levels)2

READ COMMITTEDREAD COMMITTED这是数据库默认的隔离级别。它能保证你不能读取那张表格数据,只要有其它事务还在改变这张表格数据。可是,因为sql server在select操作的时,锁表格时间就那么一小会儿,如果一个事务在READ COMMITTED级…

Asp导出到Excel之二

response.contentType "application/vnd.ms-excel" response.addheader "Content-Disposition", "attachment; filename引出文件.xls" 一、适用于动态和表态表。 二、页面最好只存放数据表,不要有其它内容。 三、对于分页的情…

warning C4996: “strcpy”被声明为否决的解决办法

VC2005中,使用了很多标准的C函数,比如fopen,strcpy之类的。编译时会出现警告,比如这个: d:\xxxx.c(1893) : warning C4996: “strcpy”被声明为否决的 紧接着IDE有提示说:“This function or variable…

验证部分表单是否重复

1. 效果 图片中的名称、机构编码需要进行重复验证2. 思路及实现 表单验证在获取数据将需要验证的表单数据进行保存this.nameChangeTemp response.data.orgName;this.codeChangeTemp response.data.orgCode; 通过rule对表单进行验证 以名字的验证为例rules: {orgName: [// 设置…

python bokeh_提升视觉效果:使用Python和Bokeh制作交互式地图

python bokehLet’s face it, fellow data scientists: our clients LOVE dashboards. Why wouldn’t they? Visualizing our data helps us tell a story. Visualization turns thousands of rows of data into a compelling and beautiful narrative. In fact, dashboard vi…

用C#写 四舍五入函数(原理版)

doubled 0.06576523;inti (int)(d/0.01);//0.01决定了精度 doubledd (double)i/100;//还原 if(d-dd>0.005)dd0.01;//四舍五入 MessageBox.Show((dd*100).ToString()"%");//7%,dd*100就变成百分的前面那一部分了

C++设计UDP协议通讯示例

UDP是一种面向非连接,不可靠的通讯协议,相对于TCP来说,虽然可靠性不及,但传输效率较高   一、绪言   UDP是一种面向非连接,不可靠的通讯协议,相对于TCP来说,虽然可靠性不及,但…

浪里个浪 FZU - 2261

TonyY是一个喜欢到处浪的男人,他的梦想是带着兰兰姐姐浪遍天朝的各个角落,不过在此之前,他需要做好规划。 现在他的手上有一份天朝地图,上面有n个城市,m条交通路径,每条交通路径都是单行道。他已经预先规划…

C#设计模式(9)——装饰者模式(Decorator Pattern)

一、引言 在软件开发中,我们经常想要对一类对象添加不同的功能,例如要给手机添加贴膜,手机挂件,手机外壳等,如果此时利用继承来实现的话,就需要定义无数的类,如StickerPhone(贴膜是手…

北大青鸟c语言课后答案,北大青鸟C语言教程--第一章 C语言基础.ppt

《北大青鸟C语言教程--第一章 C语言基础.ppt》由会员分享,可在线阅读,更多相关《北大青鸟C语言教程--第一章 C语言基础.ppt(20页珍藏版)》请在人人文库网上搜索。1、第一章,C 语言基础,2,课程地位,.Net ,以 # 开始的语句称为预处理器指令,#include语句不…