用腾讯云做淘宝客购物网站视频软件界面设计app
news/
2025/9/22 18:58:21/
文章来源:
用腾讯云做淘宝客购物网站视频,软件界面设计app,aso应用优化,内蒙古乌海建设局网站4.6高级处理-缺失值处理 点击标题即可获取文章源代码和笔记 数据集#xff1a;https://download.csdn.net/download/weixin_44827418/12548095 Pandas高级处理缺失值处理数据离散化合并交叉表与透视表分组与聚合综合案例4.6 高级处理-缺失值处理1#xff09;如何进行缺失值处…4.6高级处理-缺失值处理 点击标题即可获取文章源代码和笔记 数据集https://download.csdn.net/download/weixin_44827418/12548095 Pandas高级处理缺失值处理数据离散化合并交叉表与透视表分组与聚合综合案例4.6 高级处理-缺失值处理1如何进行缺失值处理两种思路1删除含有缺失值的样本2替换/插补4.6.1 如何处理nan1判断数据中是否存在NaNpd.isnull(df)pd.notnull(df)2删除含有缺失值的样本df.dropna(inplaceFalse)替换/插补df.fillna(value, inplaceFalse)4.6.2 不是缺失值nan有默认标记的1替换 - np.nandf.replace(to_replace?, valuenp.nan)2处理np.nan缺失值的步骤2缺失值处理实例
4.7 高级处理-数据离散化性别 年龄
A 1 23
B 2 30
C 1 18物种 毛发
A 1
B 2
C 3男 女 年龄
A 1 0 23
B 0 1 30
C 1 0 18狗 猪 老鼠 毛发
A 1 0 0 2
B 0 1 0 1
C 0 0 1 1
one-hot编码哑变量
4.7.1 什么是数据的离散化原始的身高数据165174160180159163192184
4.7.2 为什么要离散化
4.7.3 如何实现数据的离散化1分组自动分组srpd.qcut(data, bins)自定义分组srpd.cut(data, [])2将分组好的结果转换成one-hot编码pd.get_dummies(sr, prefix)
4.8 高级处理-合并numpynp.concatnate((a, b), axis)水平拼接np.hstack()竖直拼接np.vstack()1按方向拼接pd.concat([data1, data2], axis1)2按索引拼接pd.merge实现合并pd.merge(left, right, howinner, on[索引])
4.9 高级处理-交叉表与透视表找到、探索两个变量之间的关系4.9.1 交叉表与透视表什么作用4.9.2 使用crosstab(交叉表)实现pd.crosstab(value1, value2)4.9.3 pivot_table
4.10 高级处理-分组与聚合4.10.1 什么是分组与聚合4.10.2 分组与聚合APIdataframesr
4.6.1如何处理nan
import pandas as pd movie pd.read_csv(./datas/IMDB-Movie-Data.csv)
movieRankTitleGenreDescriptionDirectorActorsYearRuntime (Minutes)RatingVotesRevenue (Millions)Metascore01Guardians of the GalaxyAction,Adventure,Sci-FiA group of intergalactic criminals are forced ...James GunnChris Pratt, Vin Diesel, Bradley Cooper, Zoe S...20141218.1757074333.1376.012PrometheusAdventure,Mystery,Sci-FiFollowing clues to the origin of mankind, a te...Ridley ScottNoomi Rapace, Logan Marshall-Green, Michael Fa...20121247.0485820126.4665.023SplitHorror,ThrillerThree girls are kidnapped by a man with a diag...M. Night ShyamalanJames McAvoy, Anya Taylor-Joy, Haley Lu Richar...20161177.3157606138.1262.034SingAnimation,Comedy,FamilyIn a city of humanoid animals, a hustling thea...Christophe LourdeletMatthew McConaughey,Reese Witherspoon, Seth Ma...20161087.260545270.3259.045Suicide SquadAction,Adventure,FantasyA secret government agency recruits some of th...David AyerWill Smith, Jared Leto, Margot Robbie, Viola D...20161236.2393727325.0240.0.......................................995996Secret in Their EyesCrime,Drama,MysteryA tight-knit team of rising investigators, alo...Billy RayChiwetel Ejiofor, Nicole Kidman, Julia Roberts...20151116.227585NaN45.0996997Hostel: Part IIHorrorThree American college students studying abroa...Eli RothLauren German, Heather Matarazzo, Bijou Philli...2007945.57315217.5446.0997998Step Up 2: The StreetsDrama,Music,RomanceRomantic sparks occur between two dance studen...Jon M. ChuRobert Hoffman, Briana Evigan, Cassie Ventura,...2008986.27069958.0150.0998999Search PartyAdventure,ComedyA pair of friends embark on a mission to reuni...Scot ArmstrongAdam Pally, T.J. Miller, Thomas Middleditch,Sh...2014935.64881NaN22.09991000Nine LivesComedy,Family,FantasyA stuffy businessman finds himself trapped ins...Barry SonnenfeldKevin Spacey, Jennifer Garner, Robbie Amell,Ch...2016875.31243519.6411.0
1000 rows × 12 columns
# 1. 判断是否存在NaN类型的缺失值,为True的就是缺失值
movie.isnull()RankTitleGenreDescriptionDirectorActorsYearRuntime (Minutes)RatingVotesRevenue (Millions)Metascore0FalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalse1FalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalse2FalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalse3FalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalse4FalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalse.......................................995FalseFalseFalseFalseFalseFalseFalseFalseFalseFalseTrueFalse996FalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalse997FalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalse998FalseFalseFalseFalseFalseFalseFalseFalseFalseFalseTrueFalse999FalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
1000 rows × 12 columns
import numpy as np# any() 只要有一个True就会返回True
# 返回结果为True说明数据中存在缺失值
np.any(movie.isnull())True# 为False的就是缺失值
pd.notnull(movie)RankTitleGenreDescriptionDirectorActorsYearRuntime (Minutes)RatingVotesRevenue (Millions)Metascore0TrueTrueTrueTrueTrueTrueTrueTrueTrueTrueTrueTrue1TrueTrueTrueTrueTrueTrueTrueTrueTrueTrueTrueTrue2TrueTrueTrueTrueTrueTrueTrueTrueTrueTrueTrueTrue3TrueTrueTrueTrueTrueTrueTrueTrueTrueTrueTrueTrue4TrueTrueTrueTrueTrueTrueTrueTrueTrueTrueTrueTrue.......................................995TrueTrueTrueTrueTrueTrueTrueTrueTrueTrueFalseTrue996TrueTrueTrueTrueTrueTrueTrueTrueTrueTrueTrueTrue997TrueTrueTrueTrueTrueTrueTrueTrueTrueTrueTrueTrue998TrueTrueTrueTrueTrueTrueTrueTrueTrueTrueFalseTrue999TrueTrueTrueTrueTrueTrueTrueTrueTrueTrueTrueTrue
1000 rows × 12 columns
# all()只要有一个False就返回False
# 返回结果为False说明数据中存在缺失值
np.all(pd.notnull(movie))Falsepd.isnull(movie).any()Rank False
Title False
Genre False
Description False
Director False
Actors False
Year False
Runtime (Minutes) False
Rating False
Votes False
Revenue (Millions) True
Metascore True
dtype: boolpd.notnull(movie).all()Rank True
Title True
Genre True
Description True
Director True
Actors True
Year True
Runtime (Minutes) True
Rating True
Votes True
Revenue (Millions) False
Metascore False
dtype: bool# 缺失值处理
# 方法1 删除含有缺失值的样本
movie_full movie.dropna()movie_full.isnull().any()Rank False
Title False
Genre False
Description False
Director False
Actors False
Year False
Runtime (Minutes) False
Rating False
Votes False
Revenue (Millions) False
Metascore False
dtype: bool# 方法2 替换
movie.head()RankTitleGenreDescriptionDirectorActorsYearRuntime (Minutes)RatingVotesRevenue (Millions)Metascore01Guardians of the GalaxyAction,Adventure,Sci-FiA group of intergalactic criminals are forced ...James GunnChris Pratt, Vin Diesel, Bradley Cooper, Zoe S...20141218.1757074333.1376.012PrometheusAdventure,Mystery,Sci-FiFollowing clues to the origin of mankind, a te...Ridley ScottNoomi Rapace, Logan Marshall-Green, Michael Fa...20121247.0485820126.4665.023SplitHorror,ThrillerThree girls are kidnapped by a man with a diag...M. Night ShyamalanJames McAvoy, Anya Taylor-Joy, Haley Lu Richar...20161177.3157606138.1262.034SingAnimation,Comedy,FamilyIn a city of humanoid animals, a hustling thea...Christophe LourdeletMatthew McConaughey,Reese Witherspoon, Seth Ma...20161087.260545270.3259.045Suicide SquadAction,Adventure,FantasyA secret government agency recruits some of th...David AyerWill Smith, Jared Leto, Margot Robbie, Viola D...20161236.2393727325.0240.0
movie[Revenue (Millions)].mean()82.95637614678897# 含有缺失值的字段
# Revenue (Millions) False
# Metascore False
movie[Revenue (Millions)].fillna(movie[Revenue (Millions)].mean(),inplaceTrue)movie[Revenue (Millions)].isnull().any()False# inplaceTrue ,直接在原数据上进行填充
movie[Metascore].fillna(movie[Metascore].mean(),inplaceTrue)movie[Metascore].isnull().any()Falsemovie.isnull().any() # 缺失值已经处理完毕Rank False
Title False
Genre False
Description False
Director False
Actors False
Year False
Runtime (Minutes) False
Rating False
Votes False
Revenue (Millions) False
Metascore False
dtype: bool不是缺失值nan有默认标记的处理方法
data pd.read_csv(./datas/GBvideos.csv,encodingGBK)datavideo_idtitlechannel_titlecategory_idtagsviewslikesdislikescomment_totalthumbnail_linkdate0jt2OHQh0HoQLive Apple Event - Apple September Event 2017 ...Apple Event28apple events|apple event|iphone 8|iphone x|iph...74263937824013548705https://i.ytimg.com/vi/jt2OHQh0HoQ/default_liv...13.091AqokkXoa7uEHolly and Phillip Meet Samantha the Sex Robot ...This Morning24this morning|interview|holly willoughby|philli...494203265113090https://i.ytimg.com/vi/AqokkXoa7uE/default.jpg13.092YPVcg45W0z4My DNA Test Results? Im WHAT??emmablackery24emmablackery|emma blackery|emma|blackery|briti...142819131191511141https://i.ytimg.com/vi/YPVcg45W0z4/default.jpg13.093T_PuZBdT2iMgetting into a conversation in a language you ...ProZD1skit|korean|language|conversation|esl|japanese...15800286572915293598https://i.ytimg.com/vi/T_PuZBdT2iM/default.jpg13.094NsjsmgmbCfcBaby Name Challenge?Sprinkleofglitter26sprinkleofglitter|sprinkle of glitter|baby gli...40592501957490https://i.ytimg.com/vi/NsjsmgmbCfc/default.jpg13.09....................................1595w8fAellnPnsJuicy Chicken Breast - You Suck at Cooking (ep...You Suck At Cooking26how to|cooking|recipe|kitchen|chicken|chicken ...788466319459452274https://i.ytimg.com/vi/w8fAellnPns/default.jpg20.091596RsG37JcEQNwWeezer - Beach Boysweezer10weezer|pacific daydream|pacificdaydream|beach ...1079272435412641https://i.ytimg.com/vi/RsG37JcEQNw/default.jpg20.091597htSiIA2g7G8Berry Frozen Yogurt Bark RecipeSORTEDfood26frozen yogurt bark|frozen yoghurt bark|frozen ...109222484035212https://i.ytimg.com/vi/htSiIA2g7G8/default.jpg20.091598ZQK1F0wz6z4What Do You Want to Eat??Wong Fu Productions24panda|what should we eat|buzzfeed|comedy|boyfr...626223229625321559https://i.ytimg.com/vi/ZQK1F0wz6z4/default.jpg20.091599DuPXdnSWoLkThe Child in Time: Trailer - BBC OneBBC24BBC|iPlayer|bbc one|bbc 1|bbc1|trailer|the chi...992281699?135https://i.ytimg.com/vi/DuPXdnSWoLk/default.jpg20.09
1600 rows × 11 columns
# 1. 将 替换为np.nan
new_data data.replace(to_replace?,valuenp.nan)new_datavideo_idtitlechannel_titlecategory_idtagsviewslikesdislikescomment_totalthumbnail_linkdate0jt2OHQh0HoQLive Apple Event - Apple September Event 2017 ...Apple Event28apple events|apple event|iphone 8|iphone x|iph...74263937824013548705https://i.ytimg.com/vi/jt2OHQh0HoQ/default_liv...13.091AqokkXoa7uEHolly and Phillip Meet Samantha the Sex Robot ...This Morning24this morning|interview|holly willoughby|philli...494203265113090https://i.ytimg.com/vi/AqokkXoa7uE/default.jpg13.092YPVcg45W0z4My DNA Test Results? Im WHAT??emmablackery24emmablackery|emma blackery|emma|blackery|briti...142819131191511141https://i.ytimg.com/vi/YPVcg45W0z4/default.jpg13.093T_PuZBdT2iMgetting into a conversation in a language you ...ProZD1skit|korean|language|conversation|esl|japanese...15800286572915293598https://i.ytimg.com/vi/T_PuZBdT2iM/default.jpg13.094NsjsmgmbCfcBaby Name Challenge?Sprinkleofglitter26sprinkleofglitter|sprinkle of glitter|baby gli...40592501957490https://i.ytimg.com/vi/NsjsmgmbCfc/default.jpg13.09....................................1595w8fAellnPnsJuicy Chicken Breast - You Suck at Cooking (ep...You Suck At Cooking26how to|cooking|recipe|kitchen|chicken|chicken ...788466319459452274https://i.ytimg.com/vi/w8fAellnPns/default.jpg20.091596RsG37JcEQNwWeezer - Beach Boysweezer10weezer|pacific daydream|pacificdaydream|beach ...1079272435412641https://i.ytimg.com/vi/RsG37JcEQNw/default.jpg20.091597htSiIA2g7G8Berry Frozen Yogurt Bark RecipeSORTEDfood26frozen yogurt bark|frozen yoghurt bark|frozen ...109222484035212https://i.ytimg.com/vi/htSiIA2g7G8/default.jpg20.091598ZQK1F0wz6z4What Do You Want to Eat??Wong Fu Productions24panda|what should we eat|buzzfeed|comedy|boyfr...626223229625321559https://i.ytimg.com/vi/ZQK1F0wz6z4/default.jpg20.091599DuPXdnSWoLkThe Child in Time: Trailer - BBC OneBBC24BBC|iPlayer|bbc one|bbc 1|bbc1|trailer|the chi...992281699NaN135https://i.ytimg.com/vi/DuPXdnSWoLk/default.jpg20.09
1600 rows × 11 columns
new_data.isnull().any() # 说明dislikes列中的已经替换成了NaNvideo_id False
title False
channel_title False
category_id False
tags False
views False
likes False
dislikes True
comment_total False
thumbnail_link False
date False
dtype: boolnew_data.dropna(inplaceTrue)new_data.isnull().any()video_id False
title False
channel_title False
category_id False
tags False
views False
likes False
dislikes False
comment_total False
thumbnail_link False
date False
dtype: bool4.7 高级处理-数据离散化
import pandas as pd # 准备数据
data pd.Series([165,174,160,180,159,163,192,184],index[No1:165,No2:174,No3:160,No4:180,No5:159,No6:163,No7:192,No8:184])
dataNo1:165 165
No2:174 174
No3:160 160
No4:180 180
No5:159 159
No6:163 163
No7:192 192
No8:184 184
dtype: int64自动分组
# 1. 分组# 自动分组
#qcut(data,组数)
sr pd.qcut(data,3)
srNo1:165 (163.667, 178.0]
No2:174 (163.667, 178.0]
No3:160 (158.999, 163.667]
No4:180 (178.0, 192.0]
No5:159 (158.999, 163.667]
No6:163 (158.999, 163.667]
No7:192 (178.0, 192.0]
No8:184 (178.0, 192.0]
dtype: category
Categories (3, interval[float64]): [(158.999, 163.667] (163.667, 178.0] (178.0, 192.0]]# 查看分组情况
sr.value_counts()(178.0, 192.0] 3
(158.999, 163.667] 3
(163.667, 178.0] 2
dtype: int64type(sr)pandas.core.series.Series# 2. 将分组好的结果转换成独热编码
# prefix,设置列名的前缀
pd.get_dummies(sr,prefixheight)height_(158.999, 163.667]height_(163.667, 178.0]height_(178.0, 192.0]No1:165010No2:174010No3:160100No4:180001No5:159100No6:163100No7:192001No8:184001
自定义分组
# 自定义分组
# pd.cut(data,包含全部分界值的列表)
sr pd.cut(data,[150,165,180,195])
srNo1:165 (150, 165]
No2:174 (165, 180]
No3:160 (150, 165]
No4:180 (165, 180]
No5:159 (150, 165]
No6:163 (150, 165]
No7:192 (180, 195]
No8:184 (180, 195]
dtype: category
Categories (3, interval[int64]): [(150, 165] (165, 180] (180, 195]]sr.value_counts()(150, 165] 4
(180, 195] 2
(165, 180] 2
dtype: int64pd.get_dummies(sr,prefix身高)身高_(150, 165]身高_(165, 180]身高_(180, 195]No1:165100No2:174010No3:160100No4:180010No5:159100No6:163100No7:192001No8:184001
4.8 高级处理-合并
4.8.1 pd.concat实现合并按方向拼接
data1 np.arange(0,20,1).reshape(4,5)
data1 pd.DataFrame(data1)
data1012340012341567892101112131431516171819
data2 np.arange(100,120,1).reshape(4,5)
data2 pd.DataFrame(data2)
data2012340100101102103104110510610710810921101111121131143115116117118119
# 将data1 和 data2 进行水平拼接
data_concat pd.concat([data1,data2],axis1)data_concat01234012340012341001011021031041567891051061071081092101112131411011111211311431516171819115116117118119
data2.T012301001051101151101106111116210210711211731031081131184104109114119
# 将data1 和 data2 进行竖直拼接
data_concat1 pd.concat([data1,data2.T],axis0)data_concat101234001234.0156789.021011121314.031516171819.00100105110115NaN1101106111116NaN2102107112117NaN3103108113118NaN4104109114119NaN
4.8.2 pd.merge实现合并按索引拼接
leftpd.DataFrame({key1:[K0,K0,K1,K2],
key2:[K0,K1,K0,K1],
A:[A0,A1,A2,A3],
B:[B0,B1,B2,B3]})
leftkey1key2AB0K0K0A0B01K0K1A1B12K1K0A2B23K2K1A3B3
rightpd.DataFrame({key1:[K0,K1,K1,K2], key2:[K0,K0,K0,K0], C:[Co,C1,C2,C3],D:[DO,D1,D2,D3]})
rightkey1key2CD0K0K0CoDO1K1K0C1D12K1K0C2D23K2K0C3D3
# 默认内连接inner
# inner 保留共有的key
result pd.merge(left,right,on[key1,key2],howinner)
resultkey1key2ABCD0K0K0A0B0CoDO1K1K0A2B2C1D12K1K0A2B2C2D2
# left ,左连接
# 左表中所有的key都保留以左表为主进行合并
result_left pd.merge(left,right,on[key1,key2],howleft)
result_leftkey1key2ABCD0K0K0A0B0CoDO1K0K1A1B1NaNNaN2K1K0A2B2C1D13K1K0A2B2C2D24K2K1A3B3NaNNaN
# right ,右连接
# 右表中所有的key都保留以右表为主进行合并
result_right pd.merge(left,right,on[key1,key2],howright)
result_rightkey1key2ABCD0K0K0A0B0CoDO1K1K0A2B2C1D12K1K0A2B2C2D23K2K0NaNNaNC3D3
# outer ,外连接
# 左右两表中所有的key都保留进行合并
result_outer pd.merge(left,right,on[key1,key2],howouter)
result_outerkey1key2ABCD0K0K0A0B0CoDO1K0K1A1B1NaNNaN2K1K0A2B2C1D13K1K0A2B2C2D24K2K1A3B3NaNNaN5K2K0NaNNaNC3D3
4.9 高级处理-交叉表与透视表
用来探索两个变量之间的关系
4.9.2 使用crosstab交叉表实现
data pd.read_excel(./datas/szfj_baoan.xls)
datadistrictroomnumhallAREAC_floorfloor_numschoolsubwayper_price0baoan3289.3middle31007.07731baoan42127.0high31006.92912baoan1128.0low39003.92863baoan1128.0middle30003.35684baoan2278.0middle8115.0769..............................1246baoan4289.3low8004.25531247baoan2167.0middle30003.80601248baoan2267.4middle29105.34121249baoan2273.1low15105.95081250baoan3286.2middle32014.5244
1251 rows × 9 columns
time 2020-06-23
# pandas日期类型
date pd.to_datetime(time)
dateTimestamp(2020-06-23 00:00:00)type(date)pandas._libs.tslibs.timestamps.Timestampdate.year2020date.month6data[week] date.weekdaydata.drop(week,axis1,inplaceTrue)datadistrictroomnumhallAREAC_floorfloor_numschoolsubwayper_price0baoan3289.3middle31007.07731baoan42127.0high31006.92912baoan1128.0low39003.92863baoan1128.0middle30003.35684baoan2278.0middle8115.0769..............................1246baoan4289.3low8004.25531247baoan2167.0middle30003.80601248baoan2267.4middle29105.34121249baoan2273.1low15105.95081250baoan3286.2middle32014.5244
1251 rows × 9 columns
data[feature] np.where(data[per_price] 5.0000,1,0)datadistrictroomnumhallAREAC_floorfloor_numschoolsubwayper_pricefeature0baoan3289.3middle31007.077311baoan42127.0high31006.929112baoan1128.0low39003.928603baoan1128.0middle30003.356804baoan2278.0middle8115.07691.................................1246baoan4289.3low8004.255301247baoan2167.0middle30003.806001248baoan2267.4middle29105.341211249baoan2273.1low15105.950811250baoan3286.2middle32014.52440
1251 rows × 10 columns
# 交叉表# 查看楼层 和 每平方米单价是否50000的关系
# 返回值为每个楼层中为0的个数和为1的个数
data0 pd.crosstab(data[floor_num],data[feature])
data0feature01floor_num168301401063771625819329211104911811121313420140515833169191720211817351911520242116220123482410262543726957275382863529266830307831415132211263334203415351236043711380139510401343014406450747015001510352025301
data0.sum(axis1) # 按行求和floor_num
1 14
3 1
4 10
6 10
7 41
8 51
9 13
10 13
11 19
12 4
13 24
14 5
15 41
16 28
17 41
18 52
19 16
20 6
21 7
22 1
23 12
24 36
25 41
26 66
27 43
28 41
29 94
30 108
31 155
32 147
33 54
34 6
35 3
36 4
37 2
38 1
39 15
40 4
43 1
44 6
45 7
47 1
50 1
51 3
52 2
53 1
dtype: int64data0.div(data0.sum(axis1),axis0) # 按行做除法feature01floor_num10.4285710.57142930.0000001.00000040.0000001.00000060.3000000.70000070.3902440.60975680.3725490.62745190.1538460.846154100.3076920.692308110.4210530.578947120.2500000.750000130.1666670.833333140.0000001.000000150.1951220.804878160.3214290.678571170.4878050.512195180.3269230.673077190.6875000.312500200.3333330.666667210.1428570.857143220.0000001.000000230.3333330.666667240.2777780.722222250.0975610.902439260.1363640.863636270.1162790.883721280.1463410.853659290.2765960.723404300.2777780.722222310.0258060.974194320.1428570.857143330.6296300.370370340.1666670.833333350.3333330.666667360.0000001.000000370.5000000.500000380.0000001.000000390.3333330.666667400.2500000.750000430.0000001.000000440.0000001.000000450.0000001.000000470.0000001.000000500.0000001.000000510.0000001.000000520.0000001.000000530.0000001.000000
data_percent data0.div(data0.sum(axis1),axis0)
data_percentfeature01floor_num10.4285710.57142930.0000001.00000040.0000001.00000060.3000000.70000070.3902440.60975680.3725490.62745190.1538460.846154100.3076920.692308110.4210530.578947120.2500000.750000130.1666670.833333140.0000001.000000150.1951220.804878160.3214290.678571170.4878050.512195180.3269230.673077190.6875000.312500200.3333330.666667210.1428570.857143220.0000001.000000230.3333330.666667240.2777780.722222250.0975610.902439260.1363640.863636270.1162790.883721280.1463410.853659290.2765960.723404300.2777780.722222310.0258060.974194320.1428570.857143330.6296300.370370340.1666670.833333350.3333330.666667360.0000001.000000370.5000000.500000380.0000001.000000390.3333330.666667400.2500000.750000430.0000001.000000440.0000001.000000450.0000001.000000470.0000001.000000500.0000001.000000510.0000001.000000520.0000001.000000530.0000001.000000
# stackedTrue 是否重叠显示
data_percent.plot(kindbar,stackedTrue)matplotlib.axes._subplots.AxesSubplot at 0x24719dd7488data_percent data0.div(data0.sum(axis1),axis0)
data_percenttrth50/thtd0.000000/tdtd1.000000/td
/tr
trth51/thtd0.000000/tdtd1.000000/td
/tr
trth52/thtd0.000000/tdtd1.000000/td
/tr
trth53/thtd0.000000/tdtd1.000000/td
/trfeature01floor_num10.4285710.57142930.0000001.00000040.0000001.00000060.3000000.70000070.3902440.60975680.3725490.62745190.1538460.846154100.3076920.692308110.4210530.578947120.2500000.750000130.1666670.833333140.0000001.000000150.1951220.804878160.3214290.678571170.4878050.512195180.3269230.673077190.6875000.312500200.3333330.666667210.1428570.857143220.0000001.000000230.3333330.666667240.2777780.722222250.0975610.902439260.1363640.863636270.1162790.883721280.1463410.853659290.2765960.723404300.2777780.722222
4.9.3使用pivot_table透视表实现
# 通过透视表整个过程会变得更加简单些
# 结果直接就是值为1的百分比
data.pivot_table([feature],index[floor_num])...
featurefloor_num10.57142931.00000041.00000060.700000501.000000511.000000521.000000531.000000
4.10 高级处理-分组与聚合
4.10.2 分组与聚合API
col pd.DataFrame({color:[white,red,green,red,green],object:[pen,pencil,pencil,ashtray,pen],price1:[4.56,4.20,1.30,0.56,2.75],price2:[4.75,4.12,1.68,0.75,3.15]})
colcolorobjectprice1price20whitepen4.564.751redpencil4.204.122greenpencil1.301.683redashtray0.560.754greenpen2.753.15
# 进行分组对颜色进行分组对价格price1进行聚合
# 用DataFrame的方法进行分组
col.groupby(bycolor)[price1].max()color
green 2.75
red 4.20
white 4.56
Name: price1, dtype: float64# 用Series的方法进行分组
col[price1].groupby(col[color])pandas.core.groupby.generic.SeriesGroupBy object at 0x000002471D178D08col[price1].groupby(col[color]).max()color
green 2.75
red 4.20
white 4.56
Name: price1, dtype: float644.11 综合案例
# 1. 准备数据
movie pd.read_csv(./datas/IMDB-Movie-Data.csv)
movieRankTitleGenreDescriptionDirectorActorsYearRuntime (Minutes)RatingVotesRevenue (Millions)Metascore01Guardians of the GalaxyAction,Adventure,Sci-FiA group of intergalactic criminals are forced ...James GunnChris Pratt, Vin Diesel, Bradley Cooper, Zoe S...20141218.1757074333.1376.012PrometheusAdventure,Mystery,Sci-FiFollowing clues to the origin of mankind, a te...Ridley ScottNoomi Rapace, Logan Marshall-Green, Michael Fa...20121247.0485820126.4665.023SplitHorror,ThrillerThree girls are kidnapped by a man with a diag...M. Night ShyamalanJames McAvoy, Anya Taylor-Joy, Haley Lu Richar...20161177.3157606138.1262.034SingAnimation,Comedy,FamilyIn a city of humanoid animals, a hustling thea...Christophe LourdeletMatthew McConaughey,Reese Witherspoon, Seth Ma...20161087.260545270.3259.045Suicide SquadAction,Adventure,FantasyA secret government agency recruits some of th...David AyerWill Smith, Jared Leto, Margot Robbie, Viola D...20161236.2393727325.0240.0.......................................995996Secret in Their EyesCrime,Drama,MysteryA tight-knit team of rising investigators, alo...Billy RayChiwetel Ejiofor, Nicole Kidman, Julia Roberts...20151116.227585NaN45.0996997Hostel: Part IIHorrorThree American college students studying abroa...Eli RothLauren German, Heather Matarazzo, Bijou Philli...2007945.57315217.5446.0997998Step Up 2: The StreetsDrama,Music,RomanceRomantic sparks occur between two dance studen...Jon M. ChuRobert Hoffman, Briana Evigan, Cassie Ventura,...2008986.27069958.0150.0998999Search PartyAdventure,ComedyA pair of friends embark on a mission to reuni...Scot ArmstrongAdam Pally, T.J. Miller, Thomas Middleditch,Sh...2014935.64881NaN22.09991000Nine LivesComedy,Family,FantasyA stuffy businessman finds himself trapped ins...Barry SonnenfeldKevin Spacey, Jennifer Garner, Robbie Amell,Ch...2016875.31243519.6411.0
1000 rows × 12 columns
#问题1我们想知道这些电影数据中评分的平均分导演的人数等信息
# 我们应该怎么获取
movie[Rating].mean()6.723200000000003movie[Director]0 James Gunn
1 Ridley Scott
2 M. Night Shyamalan
3 Christophe Lourdelet
4 David Ayer...
995 Billy Ray
996 Eli Roth
997 Jon M. Chu
998 Scot Armstrong
999 Barry Sonnenfeld
Name: Director, Length: 1000, dtype: object# np.unique()去重因为导演可能是多个电影的导演
np.unique(movie[Director])array([Aamir Khan, Abdellatif Kechiche, Adam Leon, Adam McKay,Adam Shankman, Adam Wingard, Afonso Poyart, Aisling Walsh,Akan Satayev, Akiva Schaffer, Alan Taylor, Albert Hughes,Alejandro Amenábar, Alejandro González Iñárritu,...Tomas Alfredson, Tony Gilroy, Tony Scott, Travis Knight,Tyler Shields, Wally Pfister, Walt Dohrn, Walter Hill,Warren Beatty, Werner Herzog, Wes Anderson, Wes Ball,Wes Craven, Whit Stillman, Will Gluck, Will Slocombe,William Brent Bell, William Oldroyd, Woody Allen,Xavier Dolan, Yimou Zhang, Yorgos Lanthimos, Zack Snyder,Zackary Adler], dtypeobject)# 导演的人数
np.unique(movie[Director]).size644# 问题2 对于这一组电影数据如果我们先rating,runtime的分布情况应该如何呈现数据
movie[Rating].plot(kindhist,figsize(20,8),fontsize40)matplotlib.axes._subplots.AxesSubplot at 0x2471ce18708import matplotlib.pyplot as plt# 1. 创建画布
plt.figure(figsize(20,8),dpi100)# 2. 绘制直方图
plt.hist(movie[Rating],20)# 修改刻度
plt.xticks(np.linspace(movie[Rating].min(),movie[Rating].max(),21))# 添加网格
plt.grid(linestyle--,alpha0.5)# 3. 显示图像
plt.show()movie[Rating]0 8.1
1 7.0
2 7.3
3 7.2
4 6.2...
995 6.2
996 5.5
997 6.2
998 5.6
999 5.3
Name: Rating, Length: 1000, dtype: float64# 问题3对于这一组电影数据如果我们希望统计电影分类genre的情况应该如何处理数据# 先统计电影类别有哪些
movie_genre [i.split(,) for i in movie[Genre]]
movie_genre[[Action, Adventure, Sci-Fi],[Adventure, Mystery, Sci-Fi],[Horror, Thriller],[Animation, Comedy, Family],[Action, Adventure, Fantasy],...[Horror],[Drama, Music, Romance],[Adventure, Comedy],[Comedy, Family, Fantasy]][j for i in movie_genre for j in i][Action,Adventure,Sci-Fi,Adventure,Mystery,Sci-Fi,
...Animation,Action,Adventure,Action,Adventure,Drama,...]movie_class np.unique([j for i in movie_genre for j in i])movie_classarray([Action, Adventure, Animation, Biography, Comedy, Crime,Drama, Family, Fantasy, History, Horror, Music,Musical, Mystery, Romance, Sci-Fi, Sport, Thriller,War, Western], dtypeU9)len(movie_class) # 20 个电影类别20# 统计每个类别有几个电影# 先创建一个空的DataFrame表
count pd.DataFrame(np.zeros(shape[1000,20],dtypeint32),columnsmovie_class)count.head()ActionAdventureAnimationBiographyComedyCrimeDramaFamilyFantasyHistoryHorrorMusicMusicalMysteryRomanceSci-FiSportThrillerWarWestern000000000000000000000100000000000000000000200000000000000000000300000000000000000000400000000000000000000
count.loc[0,movie_genre[0]]Action 0
Adventure 0
Sci-Fi 0
Name: 0, dtype: int32movie_genre[0][Action, Adventure, Sci-Fi]# 计数填表
for i in range(1000):count.loc[i,movie_genre[i]] 1countActionAdventureAnimationBiographyComedyCrimeDramaFamilyFantasyHistoryHorrorMusicMusicalMysteryRomanceSci-FiSportThrillerWarWestern011000000000000010000101000000000001010000200000000001000000100300101001000000000000411000000100000000000...............................................................9950000011000000100000099600000000001000000000997000000100001001000009980100100000000000000099900001001100000000000
1000 rows × 20 columns
# 按列求和
count.sum(axis0)Action 303
Adventure 259
Animation 49
Biography 81
Comedy 279
Crime 150
Drama 513
Family 51
Fantasy 101
History 29
Horror 119
Music 16
Musical 5
Mystery 106
Romance 141
Sci-Fi 120
Sport 18
Thriller 195
War 13
Western 7
dtype: int64count.sum(axis0).sort_values(ascendingFalse)Drama 513
Action 303
Comedy 279
Adventure 259
Thriller 195
Crime 150
Romance 141
Sci-Fi 120
Horror 119
Mystery 106
Fantasy 101
Biography 81
Family 51
Animation 49
History 29
Sport 18
Music 16
War 13
Western 7
Musical 5
dtype: int64count.sum(axis0).sort_values(ascendingFalse).plot(kindbar,fontsize20,figsize(20,9),colormapcool)matplotlib.axes._subplots.AxesSubplot at 0x2472450c1c8
本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/910050.shtml
如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!