熊猫数据集_用熊猫掌握数据聚合

熊猫数据集

Data aggregation is the process of gathering data and expressing it in a summary form. This typically corresponds to summary statistics for numerical and categorical variables in a data set. In this post we will discuss how to aggregate data using pandas and generate insightful summary statistics.

数据聚合是收集数据并以摘要形式表示的过程。 这通常对应于数据集中数字和分类变量的摘要统计量。 在这篇文章中,我们将讨论如何使用熊猫聚合数据并生成有洞察力的摘要统计信息。

Let’s get started!

让我们开始吧!

For our purposes, we will be working with The Wines Reviews data set, which can be found here.

为了我们的目的,我们将使用“葡萄酒评论”数据集,可在此处找到。

To start, let’s read our data into a Pandas data frame:

首先,让我们将数据读取到Pandas数据框中:

import pandas as pd
df = pd.read_csv("winemag-data-130k-v2.csv")

Next, let’s print the first five rows of data:

接下来,让我们打印数据的前五行:

print(df.head())
Image for post

使用DESCRIBE()方法 (USING THE DESCRIBE() METHOD)

The ‘describe()’ method is a basic method that will allow us to pull summary statistics for columns in our data. Let’s use the ‘describe()’ method on the prices of wines:

'describe()'方法是一种基本方法,它使我们能够提取数据中列的摘要统计信息。 让我们对葡萄酒的价格使用'describe()'方法:

print(df['price'].describe())
Image for post

We see that the ‘count’, number of non-null values, of wine prices is 120,975. The mean price of wines is $35 with a standard deviation of $41. The minimum value of the price of wine is $4 and the maximum is $3300. The ‘describe()’ method also provides percentiles. Here, 25% of wines prices are below $17, 50% are below $25, and 75% are below $42.

我们看到葡萄酒价格的“计数”(非空值数量)为120,975。 葡萄酒的平ASP格为35美元,标准差为41美元。 葡萄酒价格的最小值为$ 4,最大值为$ 3300。 'describe()'方法还提供百分位数。 在这里,有25%的葡萄酒价格低于17美元,有50%的葡萄酒低于25美元,有75%的葡萄酒低于42美元。

Let’s look at the summary statistics using ‘describe()’ on the ‘points’ column:

让我们在“点”列上使用“ describe()”查看摘要统计信息:

print(df['points'].describe())
Image for post

We see that the number of non-null values of points is 129,971, which happens to be the length of the data frame. The mean points is 88 with a standard deviation of 3. The minimum value of the points of wine is 80 and the maximum is 100. For the percentiles, 25% of wines points are below 86, 50% are below 88, and 75% are below 91.

我们看到点的非空值的数量是129,971,恰好是数据帧的长度。 平均值为88,标准偏差为3。葡萄酒的最小值为80,最大值为100。对于百分位数,25%的葡萄酒分数低于86,50%的分数低于88,而75%低于91。

使用GROUPBY()方法 (USING THE GROUPBY() METHOD)

You can also use the ‘groupby()’ to aggregate data. For example, if we wanted to look at the average price of wine for each variety of wine, we can do the following:

您也可以使用“ groupby()”来汇总数据。 例如,如果我们要查看每种葡萄酒的平ASP格,我们可以执行以下操作:

print(df['price'].groupby(df['variety']).mean().head())
Image for post

We see that the ‘Abouriou’ wine variety has a mean of $35, ‘Agiorgitiko’ has a mean of $23 and so forth. We can also display the sorted values:

我们看到“ Abouriou”葡萄酒的ASP为35美元,“ Agiorgitiko”葡萄酒的ASP为23美元,依此类推。 我们还可以显示排序后的值:

print(df['price'].groupby(df['variety']).mean().sort_values(ascending = False).head())
Image for post

Let’s look at the sorted mean prices for each ‘province’:

让我们看一下每个“省”的排序平ASP格:

print(df['price'].groupby(df['province']).mean().sort_values(ascending = False).head())
Image for post

We can also look at more than one column. Let’s look at the mean prices and points across ‘provinces’:

我们还可以查看不止一列。 让我们看一下“省”的平ASP格和点数:

print(df[['price', 'points']].groupby(df.province).mean().head())
Image for post

I’ll stop here but I encourage you to play around with the data and code yourself.

我将在这里停止,但我鼓励您尝试使用数据并自己编写代码。

结论 (CONCLUSION)

To summarize, in this post we discussed how to aggregate data using pandas. First, we went over how to use the ‘describe()’ method to generate summary statistics such as mean, standard deviation, minimum, maximum and percentiles for data columns. We then went over how to use the ‘groupby()’ method to generate statistics for specific categorical variables, such as the mean price in each province and the mean price for each variety. I hope you found this post useful/interesting. The code from this post is available on GitHub. Thank you for reading!

总而言之,在本文中,我们讨论了如何使用熊猫聚合数据。 首先,我们讨论了如何使用“ describe()”方法生成汇总统计信息,例如数据列的均值,标准差,最小值,最大值和百分位数。 然后,我们讨论了如何使用“ groupby()”方法来生成特定类别变量的统计信息,例如每个省的平ASP格和每个品种的平ASP格。 我希望您发现这篇文章有用/有趣。 这篇文章中的代码可在GitHub上找到 。 感谢您的阅读!

翻译自: https://towardsdatascience.com/mastering-data-aggregation-with-pandas-36d485fb613c

熊猫数据集

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/389303.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

IOS CALayer的属性和使用

一、CALayer的常用属性 1、propertyCGPoint position; 图层中心点的位置,类似与UIView的center;用来设置CALayer在父层中的位置;以父层的左上角为原点(0,0); 2、 property CGPoint anchorPoint…

QZEZ第一届“饭吉圆”杯程序设计竞赛

终于到了饭吉圆杯的开赛,这是EZ我参与的历史上第一场ACM赛制的题目然而没有罚时 不过题目很好,举办地也很成功,为法老点赞!!! 这次和翰爷,吴骏达 dalao,陈乐扬dalao组的队&#xff0…

谈谈数据分析 caoz_让我们谈谈开放数据…

谈谈数据分析 caozAccording to the International Open Data Charter(1), it defines open data as those digital data that are made available with the technical and legal characteristics necessary so that they can be freely used, reused and redistributed by any…

数据创造价值_展示数据并创造价值

数据创造价值To create the maximum value, urgency, and leverage in a data partnership, you must present the data available for sale or partnership in a clear and comprehensive way. Partnerships are based upon the concept that you are offering value for valu…

卷积神经网络——各种网络的简洁介绍和实现

各种网络模型:来源《动手学深度学习》 一,卷积神经网络(LeNet) LeNet分为卷积层块和全连接层块两个部分。下面我们分别介绍这两个模块。 卷积层块里的基本单位是卷积层后接最大池化层:卷积层用来识别图像里的空间模…

数据中台是下一代大数据_全栈数据科学:下一代数据科学家群体

数据中台是下一代大数据重点 (Top highlight)Data science has been an eye-catching field for many years now to young individuals having formal education with a bachelors, masters or Ph.D. in computer science, statistics, business analytics, engineering manage…

pwn学习之四

本来以为应该能出一两道ctf的pwn了,结果又被sctf打击了一波。 bufoverflow_a 做这题时libc和堆地址都泄露完成了,卡在了unsorted bin attack上,由于delete会清0变量导致无法写,一直没构造出unsorted bin attack,后面根…

北方工业大学gpa计算_北方大学联盟仓库的探索性分析

北方工业大学gpa计算This is my firts publication here and i will start simple.这是我的第一篇出版物,这里我将简单介绍 。 I want to make an exploratory data analysis of UFRN’s warehouse and answer some questions about the data using Python and Pow…

泰坦尼克数据集预测分析_探索性数据分析-泰坦尼克号数据集案例研究(第二部分)

泰坦尼克数据集预测分析Data is simply useless until you don’t know what it’s trying to tell you.除非您不知道数据在试图告诉您什么,否则数据将毫无用处。 With this quote we’ll continue on our quest to find the hidden secrets of the Titanic. ‘The …

关于我

我是谁? Who am I?这是个哲学问题。。 简单来说,我是Light,一个靠前端吃饭,又不想单单靠前端吃饭的Coder。 用以下几点稍微给自己打下标签: 工作了两三年,对,我是16年毕业的90后一直…

基于PyTorch搭建CNN实现视频动作分类任务代码详解

数据及具体讲解来源: 基于PyTorch搭建CNN实现视频动作分类任务 import torch import torch.nn as nn import torchvision.transforms as T import scipy.io from torch.utils.data import DataLoader,Dataset import os from PIL import Image from torch.autograd…

missforest_missforest最佳丢失数据插补算法

missforestMissing data often plagues real-world datasets, and hence there is tremendous value in imputing, or filling in, the missing values. Unfortunately, standard ‘lazy’ imputation methods like simply using the column median or average don’t work wel…

华硕猛禽1080ti_F-22猛禽动力回路的视频分析

华硕猛禽1080tiThe F-22 Raptor has vectored thrust. This means that the engines don’t just push towards the front of the aircraft. Instead, the thrust can be directed upward or downward (from the rear of the jet). With this vectored thrust, the Raptor can …

Memory-Associated Differential Learning论文及代码解读

Memory-Associated Differential Learning论文及代码解读 论文来源: 论文PDF: Memory-Associated Differential Learning论文 论文代码: Memory-Associated Differential Learning代码 论文解读: 1.Abstract Conventional…

大数据技术 学习之旅_如何开始您的数据科学之旅?

大数据技术 学习之旅Machine Learning seems to be fascinating to a lot of beginners but they often get lost into the pool of information available across different resources. This is true that we have a lot of different algorithms and steps to learn but star…

数据可视化工具_数据可视化

数据可视化工具Visualizations are a great way to show the story that data wants to tell. However, not all visualizations are built the same. My rule of thumb is stick to simple, easy to understand, and well labeled graphs. Line graphs, bar charts, and histo…

Android Studio调试时遇见Install Repository and sync project的问题

我们可以看到,报的错是“Failed to resolve: com.android.support:appcompat-v7:16.”,也就是我们在build.gradle中最后一段中的compile项内容。 AS自动生成的“com.android.support:appcompat-v7:16.”实际上是根据我们的最低版本16来选择16.x.x及以上编…

VGAE(Variational graph auto-encoders)论文及代码解读

一,论文来源 论文pdf Variational graph auto-encoders 论文代码 github代码 二,论文解读 理论部分参考: Variational Graph Auto-Encoders(VGAE)理论参考和源码解析 VGAE(Variational graph auto-en…

tableau大屏bi_Excel,Tableau,Power BI ...您应该使用什么?

tableau大屏biAfter publishing my previous article on data visualization with Power BI, I received quite a few questions about the abilities of Power BI as opposed to those of Tableau or Excel. Data, when used correctly, can turn into digital gold. So what …

网络编程 socket介绍

Socket介绍 Socket是应用层与TCP/IP协议族通信的中间软件抽象层,它是一组接口。在设计模式中,Socket其实就是一个门面模式,它把复杂的TCP/IP协议族隐藏在Socket接口后面,对用户来说,一组简单的接口就是全部。 Socket通…