点击标题即可获取文章源代码和笔记 
 
  
 
Pandas高级处理缺失值处理数据离散化合并交叉表与透视表分组与聚合综合案例4.6  高级处理- 缺失值处理1 )如何进行缺失值处理两种思路:1 )删除含有缺失值的样本2 )替换/ 插补4.6 .1  如何处理nan1 )判断数据中是否存在NaNpd. isnull( df) pd. notnull( df) 2 )删除含有缺失值的样本df. dropna( inplace= False ) 替换/ 插补df. fillna( value,  inplace= False ) 4.6 .2  不是缺失值nan,有默认标记的1 )替换 ?- >  np. nandf. replace( to_replace= "?" ,  value= np. nan) 2 )处理np. nan缺失值的步骤2 )缺失值处理实例
4.7  高级处理- 数据离散化性别 年龄
A    1    23 
B    2    30 
C    1    18 物种 毛发
A    1 
B    2 
C    3 男 女 年龄
A   1   0   23 
B   0   1   30 
C   1   0   18 狗  猪  老鼠 毛发
A   1    0    0    2 
B   0    1    0    1 
C   0    0    1    1 
one- hot编码& 哑变量
4.7 .1  什么是数据的离散化原始的身高数据:165 ,174 ,160 ,180 ,159 ,163 ,192 ,184 
4.7 .2  为什么要离散化
4.7 .3  如何实现数据的离散化1 )分组自动分组sr= pd. qcut( data,  bins) 自定义分组sr= pd. cut( data,  [ ] ) 2 )将分组好的结果转换成one- hot编码pd. get_dummies( sr,  prefix= ) 
4.8  高级处理- 合并numpynp. concatnate( ( a,  b) ,  axis= ) 水平拼接np. hstack( ) 竖直拼接np. vstack( ) 1 )按方向拼接pd. concat( [ data1,  data2] ,  axis= 1 ) 2 )按索引拼接pd. merge实现合并pd. merge( left,  right,  how= "inner" ,  on= [ 索引] ) 
4.9  高级处理- 交叉表与透视表找到、探索两个变量之间的关系4.9 .1  交叉表与透视表什么作用4.9 .2  使用crosstab( 交叉表) 实现pd. crosstab( value1,  value2) 4.9 .3  pivot_table
4.10  高级处理- 分组与聚合4.10 .1  什么是分组与聚合4.10 .2  分组与聚合APIdataframesrimport  pandas as  pd movie =  pd. read_csv( "./datas/IMDB-Movie-Data.csv" ) 
movie
 
 
Rank Title Genre Description Director Actors Year Runtime (Minutes) Rating Votes Revenue (Millions) Metascore 0 1 Guardians of the Galaxy Action,Adventure,Sci-Fi A group of intergalactic criminals are forced ... James Gunn Chris Pratt, Vin Diesel, Bradley Cooper, Zoe S... 2014 121 8.1 757074 333.13 76.0 1 2 Prometheus Adventure,Mystery,Sci-Fi Following clues to the origin of mankind, a te... Ridley Scott Noomi Rapace, Logan Marshall-Green, Michael Fa... 2012 124 7.0 485820 126.46 65.0 2 3 Split Horror,Thriller Three girls are kidnapped by a man with a diag... M. Night Shyamalan James McAvoy, Anya Taylor-Joy, Haley Lu Richar... 2016 117 7.3 157606 138.12 62.0 3 4 Sing Animation,Comedy,Family In a city of humanoid animals, a hustling thea... Christophe Lourdelet Matthew McConaughey,Reese Witherspoon, Seth Ma... 2016 108 7.2 60545 270.32 59.0 4 5 Suicide Squad Action,Adventure,Fantasy A secret government agency recruits some of th... David Ayer Will Smith, Jared Leto, Margot Robbie, Viola D... 2016 123 6.2 393727 325.02 40.0 ... ... ... ... ... ... ... ... ... ... ... ... ... 995 996 Secret in Their Eyes Crime,Drama,Mystery A tight-knit team of rising investigators, alo... Billy Ray Chiwetel Ejiofor, Nicole Kidman, Julia Roberts... 2015 111 6.2 27585 NaN 45.0 996 997 Hostel: Part II Horror Three American college students studying abroa... Eli Roth Lauren German, Heather Matarazzo, Bijou Philli... 2007 94 5.5 73152 17.54 46.0 997 998 Step Up 2: The Streets Drama,Music,Romance Romantic sparks occur between two dance studen... Jon M. Chu Robert Hoffman, Briana Evigan, Cassie Ventura,... 2008 98 6.2 70699 58.01 50.0 998 999 Search Party Adventure,Comedy A pair of friends embark on a mission to reuni... Scot Armstrong Adam Pally, T.J. Miller, Thomas Middleditch,Sh... 2014 93 5.6 4881 NaN 22.0 999 1000 Nine Lives Comedy,Family,Fantasy A stuffy businessman finds himself trapped ins... Barry Sonnenfeld Kevin Spacey, Jennifer Garner, Robbie Amell,Ch... 2016 87 5.3 12435 19.64 11.0 
 
1000 rows × 12 columns
 
movie. isnull( ) 
 
 
Rank Title Genre Description Director Actors Year Runtime (Minutes) Rating Votes Revenue (Millions) Metascore 0 False False False False False False False False False False False False 1 False False False False False False False False False False False False 2 False False False False False False False False False False False False 3 False False False False False False False False False False False False 4 False False False False False False False False False False False False ... ... ... ... ... ... ... ... ... ... ... ... ... 995 False False False False False False False False False False True False 996 False False False False False False False False False False False False 997 False False False False False False False False False False False False 998 False False False False False False False False False False True False 999 False False False False False False False False False False False False 
 
1000 rows × 12 columns
 
import  numpy as  np
np. any ( movie. isnull( ) ) 
True
pd. notnull( movie) 
 
 
Rank Title Genre Description Director Actors Year Runtime (Minutes) Rating Votes Revenue (Millions) Metascore 0 True True True True True True True True True True True True 1 True True True True True True True True True True True True 2 True True True True True True True True True True True True 3 True True True True True True True True True True True True 4 True True True True True True True True True True True True ... ... ... ... ... ... ... ... ... ... ... ... ... 995 True True True True True True True True True True False True 996 True True True True True True True True True True True True 997 True True True True True True True True True True True True 998 True True True True True True True True True True False True 999 True True True True True True True True True True True True 
 
1000 rows × 12 columns
 
np. all ( pd. notnull( movie) ) 
False
pd. isnull( movie) . any ( ) 
Rank                  False
Title                 False
Genre                 False
Description           False
Director              False
Actors                False
Year                  False
Runtime (Minutes)     False
Rating                False
Votes                 False
Revenue (Millions)     True
Metascore              True
dtype: bool
pd. notnull( movie) . all ( ) 
Rank                   True
Title                  True
Genre                  True
Description            True
Director               True
Actors                 True
Year                   True
Runtime (Minutes)      True
Rating                 True
Votes                  True
Revenue (Millions)    False
Metascore             False
dtype: bool
movie_full =  movie. dropna( ) 
movie_full. isnull( ) . any ( ) 
Rank                  False
Title                 False
Genre                 False
Description           False
Director              False
Actors                False
Year                  False
Runtime (Minutes)     False
Rating                False
Votes                 False
Revenue (Millions)    False
Metascore             False
dtype: bool
movie. head( ) 
 
 
Rank Title Genre Description Director Actors Year Runtime (Minutes) Rating Votes Revenue (Millions) Metascore 0 1 Guardians of the Galaxy Action,Adventure,Sci-Fi A group of intergalactic criminals are forced ... James Gunn Chris Pratt, Vin Diesel, Bradley Cooper, Zoe S... 2014 121 8.1 757074 333.13 76.0 1 2 Prometheus Adventure,Mystery,Sci-Fi Following clues to the origin of mankind, a te... Ridley Scott Noomi Rapace, Logan Marshall-Green, Michael Fa... 2012 124 7.0 485820 126.46 65.0 2 3 Split Horror,Thriller Three girls are kidnapped by a man with a diag... M. Night Shyamalan James McAvoy, Anya Taylor-Joy, Haley Lu Richar... 2016 117 7.3 157606 138.12 62.0 3 4 Sing Animation,Comedy,Family In a city of humanoid animals, a hustling thea... Christophe Lourdelet Matthew McConaughey,Reese Witherspoon, Seth Ma... 2016 108 7.2 60545 270.32 59.0 4 5 Suicide Squad Action,Adventure,Fantasy A secret government agency recruits some of th... David Ayer Will Smith, Jared Leto, Margot Robbie, Viola D... 2016 123 6.2 393727 325.02 40.0 
 
movie[ "Revenue (Millions)" ] . mean( ) 
82.95637614678897
movie[ "Revenue (Millions)" ] . fillna( movie[ "Revenue (Millions)" ] . mean( ) , inplace= True ) 
movie[ "Revenue (Millions)" ] . isnull( ) . any ( ) 
False
movie[ "Metascore" ] . fillna( movie[ "Metascore" ] . mean( ) , inplace= True ) 
movie[ "Metascore" ] . isnull( ) . any ( ) 
False
movie. isnull( ) . any ( )  
Rank                  False
Title                 False
Genre                 False
Description           False
Director              False
Actors                False
Year                  False
Runtime (Minutes)     False
Rating                False
Votes                 False
Revenue (Millions)    False
Metascore             False
dtype: bool
data =  pd. read_csv( "./datas/GBvideos.csv" , encoding= "GBK" ) 
data
 
 
video_id title channel_title category_id tags views likes dislikes comment_total thumbnail_link date 0 jt2OHQh0HoQ Live Apple Event - Apple September Event 2017 ... Apple Event 28 apple events|apple event|iphone 8|iphone x|iph... 7426393 78240 13548 705 https://i.ytimg.com/vi/jt2OHQh0HoQ/default_liv... 13.09 1 AqokkXoa7uE Holly and Phillip Meet Samantha the Sex Robot ... This Morning 24 this morning|interview|holly willoughby|philli... 494203 2651 1309 0 https://i.ytimg.com/vi/AqokkXoa7uE/default.jpg 13.09 2 YPVcg45W0z4 My DNA Test Results? I'm WHAT?? emmablackery 24 emmablackery|emma blackery|emma|blackery|briti... 142819 13119 151 1141 https://i.ytimg.com/vi/YPVcg45W0z4/default.jpg 13.09 3 T_PuZBdT2iM getting into a conversation in a language you ... ProZD 1 skit|korean|language|conversation|esl|japanese... 1580028 65729 1529 3598 https://i.ytimg.com/vi/T_PuZBdT2iM/default.jpg 13.09 4 NsjsmgmbCfc Baby Name Challenge? Sprinkleofglitter 26 sprinkleofglitter|sprinkle of glitter|baby gli... 40592 5019 57 490 https://i.ytimg.com/vi/NsjsmgmbCfc/default.jpg 13.09 ... ... ... ... ... ... ... ... ... ... ... ... 1595 w8fAellnPns Juicy Chicken Breast - You Suck at Cooking (ep... You Suck At Cooking 26 how to|cooking|recipe|kitchen|chicken|chicken ... 788466 31945 945 2274 https://i.ytimg.com/vi/w8fAellnPns/default.jpg 20.09 1596 RsG37JcEQNw Weezer - Beach Boys weezer 10 weezer|pacific daydream|pacificdaydream|beach ... 107927 2435 412 641 https://i.ytimg.com/vi/RsG37JcEQNw/default.jpg 20.09 1597 htSiIA2g7G8 Berry Frozen Yogurt Bark Recipe SORTEDfood 26 frozen yogurt bark|frozen yoghurt bark|frozen ... 109222 4840 35 212 https://i.ytimg.com/vi/htSiIA2g7G8/default.jpg 20.09 1598 ZQK1F0wz6z4 What Do You Want to Eat?? Wong Fu Productions 24 panda|what should we eat|buzzfeed|comedy|boyfr... 626223 22962 532 1559 https://i.ytimg.com/vi/ZQK1F0wz6z4/default.jpg 20.09 1599 DuPXdnSWoLk The Child in Time: Trailer - BBC One BBC 24 BBC|iPlayer|bbc one|bbc 1|bbc1|trailer|the chi... 99228 1699 ? 135 https://i.ytimg.com/vi/DuPXdnSWoLk/default.jpg 20.09 
 
1600 rows × 11 columns
 
new_data =  data. replace( to_replace= "?" , value= np. nan) 
new_data
 
 
video_id title channel_title category_id tags views likes dislikes comment_total thumbnail_link date 0 jt2OHQh0HoQ Live Apple Event - Apple September Event 2017 ... Apple Event 28 apple events|apple event|iphone 8|iphone x|iph... 7426393 78240 13548 705 https://i.ytimg.com/vi/jt2OHQh0HoQ/default_liv... 13.09 1 AqokkXoa7uE Holly and Phillip Meet Samantha the Sex Robot ... This Morning 24 this morning|interview|holly willoughby|philli... 494203 2651 1309 0 https://i.ytimg.com/vi/AqokkXoa7uE/default.jpg 13.09 2 YPVcg45W0z4 My DNA Test Results? I'm WHAT?? emmablackery 24 emmablackery|emma blackery|emma|blackery|briti... 142819 13119 151 1141 https://i.ytimg.com/vi/YPVcg45W0z4/default.jpg 13.09 3 T_PuZBdT2iM getting into a conversation in a language you ... ProZD 1 skit|korean|language|conversation|esl|japanese... 1580028 65729 1529 3598 https://i.ytimg.com/vi/T_PuZBdT2iM/default.jpg 13.09 4 NsjsmgmbCfc Baby Name Challenge? Sprinkleofglitter 26 sprinkleofglitter|sprinkle of glitter|baby gli... 40592 5019 57 490 https://i.ytimg.com/vi/NsjsmgmbCfc/default.jpg 13.09 ... ... ... ... ... ... ... ... ... ... ... ... 1595 w8fAellnPns Juicy Chicken Breast - You Suck at Cooking (ep... You Suck At Cooking 26 how to|cooking|recipe|kitchen|chicken|chicken ... 788466 31945 945 2274 https://i.ytimg.com/vi/w8fAellnPns/default.jpg 20.09 1596 RsG37JcEQNw Weezer - Beach Boys weezer 10 weezer|pacific daydream|pacificdaydream|beach ... 107927 2435 412 641 https://i.ytimg.com/vi/RsG37JcEQNw/default.jpg 20.09 1597 htSiIA2g7G8 Berry Frozen Yogurt Bark Recipe SORTEDfood 26 frozen yogurt bark|frozen yoghurt bark|frozen ... 109222 4840 35 212 https://i.ytimg.com/vi/htSiIA2g7G8/default.jpg 20.09 1598 ZQK1F0wz6z4 What Do You Want to Eat?? Wong Fu Productions 24 panda|what should we eat|buzzfeed|comedy|boyfr... 626223 22962 532 1559 https://i.ytimg.com/vi/ZQK1F0wz6z4/default.jpg 20.09 1599 DuPXdnSWoLk The Child in Time: Trailer - BBC One BBC 24 BBC|iPlayer|bbc one|bbc 1|bbc1|trailer|the chi... 99228 1699 NaN 135 https://i.ytimg.com/vi/DuPXdnSWoLk/default.jpg 20.09 
 
1600 rows × 11 columns
 
new_data. isnull( ) . any ( )  
video_id          False
title             False
channel_title     False
category_id       False
tags              False
views             False
likes             False
dislikes           True
comment_total     False
thumbnail_link    False
date              False
dtype: bool
new_data. dropna( inplace= True ) 
new_data. isnull( ) . any ( ) 
video_id          False
title             False
channel_title     False
category_id       False
tags              False
views             False
likes             False
dislikes          False
comment_total     False
thumbnail_link    False
date              False
dtype: bool
import  pandas as  pd 
data =  pd. Series( [ 165 , 174 , 160 , 180 , 159 , 163 , 192 , 184 ] , index= [ "No1:165" , "No2:174" , "No3:160" , "No4:180" , "No5:159" , "No6:163" , "No7:192" , "No8:184" ] ) 
data
No1:165    165
No2:174    174
No3:160    160
No4:180    180
No5:159    159
No6:163    163
No7:192    192
No8:184    184
dtype: int64
sr =  pd. qcut( data, 3 ) 
sr
No1:165      (163.667, 178.0]
No2:174      (163.667, 178.0]
No3:160    (158.999, 163.667]
No4:180        (178.0, 192.0]
No5:159    (158.999, 163.667]
No6:163    (158.999, 163.667]
No7:192        (178.0, 192.0]
No8:184        (178.0, 192.0]
dtype: category
Categories (3, interval[float64]): [(158.999, 163.667] < (163.667, 178.0] < (178.0, 192.0]]
sr. value_counts( ) 
(178.0, 192.0]        3
(158.999, 163.667]    3
(163.667, 178.0]      2
dtype: int64
type ( sr) 
pandas.core.series.Series
pd. get_dummies( sr, prefix= "height" ) 
 
 
height_(158.999, 163.667] height_(163.667, 178.0] height_(178.0, 192.0] No1:165 0 1 0 No2:174 0 1 0 No3:160 1 0 0 No4:180 0 0 1 No5:159 1 0 0 No6:163 1 0 0 No7:192 0 0 1 No8:184 0 0 1 
 
sr =  pd. cut( data, [ 150 , 165 , 180 , 195 ] ) 
sr
No1:165    (150, 165]
No2:174    (165, 180]
No3:160    (150, 165]
No4:180    (165, 180]
No5:159    (150, 165]
No6:163    (150, 165]
No7:192    (180, 195]
No8:184    (180, 195]
dtype: category
Categories (3, interval[int64]): [(150, 165] < (165, 180] < (180, 195]]
sr. value_counts( ) 
(150, 165]    4
(180, 195]    2
(165, 180]    2
dtype: int64
pd. get_dummies( sr, prefix= "身高" ) 
 
 
身高_(150, 165] 身高_(165, 180] 身高_(180, 195] No1:165 1 0 0 No2:174 0 1 0 No3:160 1 0 0 No4:180 0 1 0 No5:159 1 0 0 No6:163 1 0 0 No7:192 0 0 1 No8:184 0 0 1 
 
data1 =  np. arange( 0 , 20 , 1 ) . reshape( 4 , 5 ) 
data1 =  pd. DataFrame( data1) 
data1
 
 
0 1 2 3 4 0 0 1 2 3 4 1 5 6 7 8 9 2 10 11 12 13 14 3 15 16 17 18 19 
 
data2 =  np. arange( 100 , 120 , 1 ) . reshape( 4 , 5 ) 
data2 =  pd. DataFrame( data2) 
data2
 
 
0 1 2 3 4 0 100 101 102 103 104 1 105 106 107 108 109 2 110 111 112 113 114 3 115 116 117 118 119 
 
data_concat =  pd. concat( [ data1, data2] , axis= 1 ) 
data_concat
 
 
0 1 2 3 4 0 1 2 3 4 0 0 1 2 3 4 100 101 102 103 104 1 5 6 7 8 9 105 106 107 108 109 2 10 11 12 13 14 110 111 112 113 114 3 15 16 17 18 19 115 116 117 118 119 
 
data2. T
 
 
0 1 2 3 0 100 105 110 115 1 101 106 111 116 2 102 107 112 117 3 103 108 113 118 4 104 109 114 119 
 
data_concat1 =  pd. concat( [ data1, data2. T] , axis= 0 ) 
data_concat1
 
 
0 1 2 3 4 0 0 1 2 3 4.0 1 5 6 7 8 9.0 2 10 11 12 13 14.0 3 15 16 17 18 19.0 0 100 105 110 115 NaN 1 101 106 111 116 NaN 2 102 107 112 117 NaN 3 103 108 113 118 NaN 4 104 109 114 119 NaN 
 
left= pd. DataFrame( { 'key1' : [ 'K0' , 'K0' , 'K1' , 'K2' ] , 
'key2' : [ 'K0' , 'K1' , 'K0' , 'K1' ] , 
'A' : [ 'A0' , 'A1' , 'A2' , 'A3' ] , 
'B' : [ 'B0' , 'B1' , 'B2' , 'B3' ] } ) 
left
 
 
key1 key2 A B 0 K0 K0 A0 B0 1 K0 K1 A1 B1 2 K1 K0 A2 B2 3 K2 K1 A3 B3 
 
right= pd. DataFrame( { 'key1' : [ 'K0' , 'K1' , 'K1' , 'K2' ] ,  'key2' : [ 'K0' , 'K0' , 'K0' , 'K0' ] ,  'C' : [ 'Co' , 'C1' , 'C2' , 'C3' ] , 'D' : [ 'DO' , 'D1' , 'D2' , 'D3' ] } ) 
right
 
 
key1 key2 C D 0 K0 K0 Co DO 1 K1 K0 C1 D1 2 K1 K0 C2 D2 3 K2 K0 C3 D3 
 
result =  pd. merge( left, right, on= [ 'key1' , 'key2' ] , how= "inner" ) 
result
 
 
key1 key2 A B C D 0 K0 K0 A0 B0 Co DO 1 K1 K0 A2 B2 C1 D1 2 K1 K0 A2 B2 C2 D2 
 
result_left =  pd. merge( left, right, on= [ 'key1' , 'key2' ] , how= "left" ) 
result_left
 
 
key1 key2 A B C D 0 K0 K0 A0 B0 Co DO 1 K0 K1 A1 B1 NaN NaN 2 K1 K0 A2 B2 C1 D1 3 K1 K0 A2 B2 C2 D2 4 K2 K1 A3 B3 NaN NaN 
 
result_right =  pd. merge( left, right, on= [ 'key1' , 'key2' ] , how= "right" ) 
result_right
 
 
key1 key2 A B C D 0 K0 K0 A0 B0 Co DO 1 K1 K0 A2 B2 C1 D1 2 K1 K0 A2 B2 C2 D2 3 K2 K0 NaN NaN C3 D3 
 
result_outer =  pd. merge( left, right, on= [ 'key1' , 'key2' ] , how= "outer" ) 
result_outer
 
 
key1 key2 A B C D 0 K0 K0 A0 B0 Co DO 1 K0 K1 A1 B1 NaN NaN 2 K1 K0 A2 B2 C1 D1 3 K1 K0 A2 B2 C2 D2 4 K2 K1 A3 B3 NaN NaN 5 K2 K0 NaN NaN C3 D3 
 
data =  pd. read_excel( "./datas/szfj_baoan.xls" ) 
data
 
 
district roomnum hall AREA C_floor floor_num school subway per_price 0 baoan 3 2 89.3 middle 31 0 0 7.0773 1 baoan 4 2 127.0 high 31 0 0 6.9291 2 baoan 1 1 28.0 low 39 0 0 3.9286 3 baoan 1 1 28.0 middle 30 0 0 3.3568 4 baoan 2 2 78.0 middle 8 1 1 5.0769 ... ... ... ... ... ... ... ... ... ... 1246 baoan 4 2 89.3 low 8 0 0 4.2553 1247 baoan 2 1 67.0 middle 30 0 0 3.8060 1248 baoan 2 2 67.4 middle 29 1 0 5.3412 1249 baoan 2 2 73.1 low 15 1 0 5.9508 1250 baoan 3 2 86.2 middle 32 0 1 4.5244 
 
1251 rows × 9 columns
 
time =  "2020-06-23" 
date =  pd. to_datetime( time) 
date
Timestamp('2020-06-23 00:00:00')
type ( date) 
pandas._libs.tslibs.timestamps.Timestamp
date. year
2020
date. month
6
data[ "week" ]  =  date. weekday
data. drop( "week" , axis= 1 , inplace= True ) 
data
 
 
district roomnum hall AREA C_floor floor_num school subway per_price 0 baoan 3 2 89.3 middle 31 0 0 7.0773 1 baoan 4 2 127.0 high 31 0 0 6.9291 2 baoan 1 1 28.0 low 39 0 0 3.9286 3 baoan 1 1 28.0 middle 30 0 0 3.3568 4 baoan 2 2 78.0 middle 8 1 1 5.0769 ... ... ... ... ... ... ... ... ... ... 1246 baoan 4 2 89.3 low 8 0 0 4.2553 1247 baoan 2 1 67.0 middle 30 0 0 3.8060 1248 baoan 2 2 67.4 middle 29 1 0 5.3412 1249 baoan 2 2 73.1 low 15 1 0 5.9508 1250 baoan 3 2 86.2 middle 32 0 1 4.5244 
 
1251 rows × 9 columns
 
data[ "feature" ]  =  np. where( data[ "per_price" ]  >  5.0000 , 1 , 0 ) 
data
 
 
district roomnum hall AREA C_floor floor_num school subway per_price feature 0 baoan 3 2 89.3 middle 31 0 0 7.0773 1 1 baoan 4 2 127.0 high 31 0 0 6.9291 1 2 baoan 1 1 28.0 low 39 0 0 3.9286 0 3 baoan 1 1 28.0 middle 30 0 0 3.3568 0 4 baoan 2 2 78.0 middle 8 1 1 5.0769 1 ... ... ... ... ... ... ... ... ... ... ... 1246 baoan 4 2 89.3 low 8 0 0 4.2553 0 1247 baoan 2 1 67.0 middle 30 0 0 3.8060 0 1248 baoan 2 2 67.4 middle 29 1 0 5.3412 1 1249 baoan 2 2 73.1 low 15 1 0 5.9508 1 1250 baoan 3 2 86.2 middle 32 0 1 4.5244 0 
 
1251 rows × 10 columns
 
data0 =  pd. crosstab( data[ "floor_num" ] , data[ "feature" ] ) 
data0
 
 
feature 0 1 floor_num 1 6 8 3 0 1 4 0 10 6 3 7 7 16 25 8 19 32 9 2 11 10 4 9 11 8 11 12 1 3 13 4 20 14 0 5 15 8 33 16 9 19 17 20 21 18 17 35 19 11 5 20 2 4 21 1 6 22 0 1 23 4 8 24 10 26 25 4 37 26 9 57 27 5 38 28 6 35 29 26 68 30 30 78 31 4 151 32 21 126 33 34 20 34 1 5 35 1 2 36 0 4 37 1 1 38 0 1 39 5 10 40 1 3 43 0 1 44 0 6 45 0 7 47 0 1 50 0 1 51 0 3 52 0 2 53 0 1 
 
data0. sum ( axis= 1 )  
floor_num
1      14
3       1
4      10
6      10
7      41
8      51
9      13
10     13
11     19
12      4
13     24
14      5
15     41
16     28
17     41
18     52
19     16
20      6
21      7
22      1
23     12
24     36
25     41
26     66
27     43
28     41
29     94
30    108
31    155
32    147
33     54
34      6
35      3
36      4
37      2
38      1
39     15
40      4
43      1
44      6
45      7
47      1
50      1
51      3
52      2
53      1
dtype: int64
data0. div( data0. sum ( axis= 1 ) , axis= 0 )  
 
 
feature 0 1 floor_num 1 0.428571 0.571429 3 0.000000 1.000000 4 0.000000 1.000000 6 0.300000 0.700000 7 0.390244 0.609756 8 0.372549 0.627451 9 0.153846 0.846154 10 0.307692 0.692308 11 0.421053 0.578947 12 0.250000 0.750000 13 0.166667 0.833333 14 0.000000 1.000000 15 0.195122 0.804878 16 0.321429 0.678571 17 0.487805 0.512195 18 0.326923 0.673077 19 0.687500 0.312500 20 0.333333 0.666667 21 0.142857 0.857143 22 0.000000 1.000000 23 0.333333 0.666667 24 0.277778 0.722222 25 0.097561 0.902439 26 0.136364 0.863636 27 0.116279 0.883721 28 0.146341 0.853659 29 0.276596 0.723404 30 0.277778 0.722222 31 0.025806 0.974194 32 0.142857 0.857143 33 0.629630 0.370370 34 0.166667 0.833333 35 0.333333 0.666667 36 0.000000 1.000000 37 0.500000 0.500000 38 0.000000 1.000000 39 0.333333 0.666667 40 0.250000 0.750000 43 0.000000 1.000000 44 0.000000 1.000000 45 0.000000 1.000000 47 0.000000 1.000000 50 0.000000 1.000000 51 0.000000 1.000000 52 0.000000 1.000000 53 0.000000 1.000000 
 
data_percent =  data0. div( data0. sum ( axis= 1 ) , axis= 0 ) 
data_percent
 
 
feature 0 1 floor_num 1 0.428571 0.571429 3 0.000000 1.000000 4 0.000000 1.000000 6 0.300000 0.700000 7 0.390244 0.609756 8 0.372549 0.627451 9 0.153846 0.846154 10 0.307692 0.692308 11 0.421053 0.578947 12 0.250000 0.750000 13 0.166667 0.833333 14 0.000000 1.000000 15 0.195122 0.804878 16 0.321429 0.678571 17 0.487805 0.512195 18 0.326923 0.673077 19 0.687500 0.312500 20 0.333333 0.666667 21 0.142857 0.857143 22 0.000000 1.000000 23 0.333333 0.666667 24 0.277778 0.722222 25 0.097561 0.902439 26 0.136364 0.863636 27 0.116279 0.883721 28 0.146341 0.853659 29 0.276596 0.723404 30 0.277778 0.722222 31 0.025806 0.974194 32 0.142857 0.857143 33 0.629630 0.370370 34 0.166667 0.833333 35 0.333333 0.666667 36 0.000000 1.000000 37 0.500000 0.500000 38 0.000000 1.000000 39 0.333333 0.666667 40 0.250000 0.750000 43 0.000000 1.000000 44 0.000000 1.000000 45 0.000000 1.000000 47 0.000000 1.000000 50 0.000000 1.000000 51 0.000000 1.000000 52 0.000000 1.000000 53 0.000000 1.000000 
 
data_percent. plot( kind= "bar" , stacked= True ) 
<matplotlib.axes._subplots.AxesSubplot at 0x24719dd7488>
 
data_percent =  data0. div( data0. sum ( axis= 1 ) , axis= 0 ) 
data_percent
 
 
<tr><th>50</th><td>0.000000</td><td>1.000000</td>
</tr>
<tr><th>51</th><td>0.000000</td><td>1.000000</td>
</tr>
<tr><th>52</th><td>0.000000</td><td>1.000000</td>
</tr>
<tr><th>53</th><td>0.000000</td><td>1.000000</td>
</tr>
feature 0 1 floor_num 1 0.428571 0.571429 3 0.000000 1.000000 4 0.000000 1.000000 6 0.300000 0.700000 7 0.390244 0.609756 8 0.372549 0.627451 9 0.153846 0.846154 10 0.307692 0.692308 11 0.421053 0.578947 12 0.250000 0.750000 13 0.166667 0.833333 14 0.000000 1.000000 15 0.195122 0.804878 16 0.321429 0.678571 17 0.487805 0.512195 18 0.326923 0.673077 19 0.687500 0.312500 20 0.333333 0.666667 21 0.142857 0.857143 22 0.000000 1.000000 23 0.333333 0.666667 24 0.277778 0.722222 25 0.097561 0.902439 26 0.136364 0.863636 27 0.116279 0.883721 28 0.146341 0.853659 29 0.276596 0.723404 30 0.277778 0.722222 
 
data. pivot_table( [ "feature" ] , index= [ "floor_num" ] ) 
 
 
...
feature floor_num 1 0.571429 3 1.000000 4 1.000000 6 0.700000 50 1.000000 51 1.000000 52 1.000000 53 1.000000 
 
col =  pd. DataFrame( { 'color' : [ 'white' , 'red' , 'green' , 'red' , 'green' ] , 'object' : [ "pen" , "pencil" , "pencil" , "ashtray" , "pen" ] , 'price1' : [ 4.56 , 4.20 , 1.30 , 0.56 , 2.75 ] , 'price2' : [ 4.75 , 4.12 , 1.68 , 0.75 , 3.15 ] } ) 
col
 
 
color object price1 price2 0 white pen 4.56 4.75 1 red pencil 4.20 4.12 2 green pencil 1.30 1.68 3 red ashtray 0.56 0.75 4 green pen 2.75 3.15 
 
col. groupby( by= "color" ) [ "price1" ] . max ( ) 
color
green    2.75
red      4.20
white    4.56
Name: price1, dtype: float64
col[ 'price1' ] . groupby( col[ "color" ] ) 
<pandas.core.groupby.generic.SeriesGroupBy object at 0x000002471D178D08>
col[ 'price1' ] . groupby( col[ "color" ] ) . max ( ) 
color
green    2.75
red      4.20
white    4.56
Name: price1, dtype: float64
movie =  pd. read_csv( "./datas/IMDB-Movie-Data.csv" ) 
movie
 
 
Rank Title Genre Description Director Actors Year Runtime (Minutes) Rating Votes Revenue (Millions) Metascore 0 1 Guardians of the Galaxy Action,Adventure,Sci-Fi A group of intergalactic criminals are forced ... James Gunn Chris Pratt, Vin Diesel, Bradley Cooper, Zoe S... 2014 121 8.1 757074 333.13 76.0 1 2 Prometheus Adventure,Mystery,Sci-Fi Following clues to the origin of mankind, a te... Ridley Scott Noomi Rapace, Logan Marshall-Green, Michael Fa... 2012 124 7.0 485820 126.46 65.0 2 3 Split Horror,Thriller Three girls are kidnapped by a man with a diag... M. Night Shyamalan James McAvoy, Anya Taylor-Joy, Haley Lu Richar... 2016 117 7.3 157606 138.12 62.0 3 4 Sing Animation,Comedy,Family In a city of humanoid animals, a hustling thea... Christophe Lourdelet Matthew McConaughey,Reese Witherspoon, Seth Ma... 2016 108 7.2 60545 270.32 59.0 4 5 Suicide Squad Action,Adventure,Fantasy A secret government agency recruits some of th... David Ayer Will Smith, Jared Leto, Margot Robbie, Viola D... 2016 123 6.2 393727 325.02 40.0 ... ... ... ... ... ... ... ... ... ... ... ... ... 995 996 Secret in Their Eyes Crime,Drama,Mystery A tight-knit team of rising investigators, alo... Billy Ray Chiwetel Ejiofor, Nicole Kidman, Julia Roberts... 2015 111 6.2 27585 NaN 45.0 996 997 Hostel: Part II Horror Three American college students studying abroa... Eli Roth Lauren German, Heather Matarazzo, Bijou Philli... 2007 94 5.5 73152 17.54 46.0 997 998 Step Up 2: The Streets Drama,Music,Romance Romantic sparks occur between two dance studen... Jon M. Chu Robert Hoffman, Briana Evigan, Cassie Ventura,... 2008 98 6.2 70699 58.01 50.0 998 999 Search Party Adventure,Comedy A pair of friends embark on a mission to reuni... Scot Armstrong Adam Pally, T.J. Miller, Thomas Middleditch,Sh... 2014 93 5.6 4881 NaN 22.0 999 1000 Nine Lives Comedy,Family,Fantasy A stuffy businessman finds himself trapped ins... Barry Sonnenfeld Kevin Spacey, Jennifer Garner, Robbie Amell,Ch... 2016 87 5.3 12435 19.64 11.0 
 
1000 rows × 12 columns
 
movie[ "Rating" ] . mean( ) 
6.723200000000003
movie[ "Director" ] 
0                James Gunn
1              Ridley Scott
2        M. Night Shyamalan
3      Christophe Lourdelet
4                David Ayer...         
995               Billy Ray
996                Eli Roth
997              Jon M. Chu
998          Scot Armstrong
999        Barry Sonnenfeld
Name: Director, Length: 1000, dtype: object
np. unique( movie[ "Director" ] ) 
array(['Aamir Khan', 'Abdellatif Kechiche', 'Adam Leon', 'Adam McKay','Adam Shankman', 'Adam Wingard', 'Afonso Poyart', 'Aisling Walsh','Akan Satayev', 'Akiva Schaffer', 'Alan Taylor', 'Albert Hughes','Alejandro Amenábar', 'Alejandro González Iñárritu',...'Tomas Alfredson', 'Tony Gilroy', 'Tony Scott', 'Travis Knight','Tyler Shields', 'Wally Pfister', 'Walt Dohrn', 'Walter Hill','Warren Beatty', 'Werner Herzog', 'Wes Anderson', 'Wes Ball','Wes Craven', 'Whit Stillman', 'Will Gluck', 'Will Slocombe','William Brent Bell', 'William Oldroyd', 'Woody Allen','Xavier Dolan', 'Yimou Zhang', 'Yorgos Lanthimos', 'Zack Snyder','Zackary Adler'], dtype=object)
np. unique( movie[ "Director" ] ) . size
644
movie[ "Rating" ] . plot( kind= "hist" , figsize= ( 20 , 8 ) , fontsize= 40 ) 
<matplotlib.axes._subplots.AxesSubplot at 0x2471ce18708>
 
import  matplotlib. pyplot as  plt
plt. figure( figsize= ( 20 , 8 ) , dpi= 100 ) 
plt. hist( movie[ "Rating" ] , 20 ) 
plt. xticks( np. linspace( movie[ "Rating" ] . min ( ) , movie[ "Rating" ] . max ( ) , 21 ) ) 
plt. grid( linestyle= "--" , alpha= 0.5 ) 
plt. show( ) 
 
movie[ "Rating" ] 
0      8.1
1      7.0
2      7.3
3      7.2
4      6.2... 
995    6.2
996    5.5
997    6.2
998    5.6
999    5.3
Name: Rating, Length: 1000, dtype: float64
movie_genre =  [ i. split( "," )  for  i in  movie[ "Genre" ] ] 
movie_genre
[['Action', 'Adventure', 'Sci-Fi'],['Adventure', 'Mystery', 'Sci-Fi'],['Horror', 'Thriller'],['Animation', 'Comedy', 'Family'],['Action', 'Adventure', 'Fantasy'],...['Horror'],['Drama', 'Music', 'Romance'],['Adventure', 'Comedy'],['Comedy', 'Family', 'Fantasy']]
[ j for  i in  movie_genre for  j in  i] 
['Action','Adventure','Sci-Fi','Adventure','Mystery','Sci-Fi',
...'Animation','Action','Adventure','Action','Adventure','Drama',...]
movie_class =  np. unique( [ j for  i in  movie_genre for  j in  i] ) 
movie_class
array(['Action', 'Adventure', 'Animation', 'Biography', 'Comedy', 'Crime','Drama', 'Family', 'Fantasy', 'History', 'Horror', 'Music','Musical', 'Mystery', 'Romance', 'Sci-Fi', 'Sport', 'Thriller','War', 'Western'], dtype='<U9')
len ( movie_class)  
20
count =  pd. DataFrame( np. zeros( shape= [ 1000 , 20 ] , dtype= "int32" ) , columns= movie_class) 
count. head( ) 
 
 
Action Adventure Animation Biography Comedy Crime Drama Family Fantasy History Horror Music Musical Mystery Romance Sci-Fi Sport Thriller War Western 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
 
count. loc[ 0 , movie_genre[ 0 ] ] 
Action       0
Adventure    0
Sci-Fi       0
Name: 0, dtype: int32
movie_genre[ 0 ] 
['Action', 'Adventure', 'Sci-Fi']
for  i in  range ( 1000 ) : count. loc[ i, movie_genre[ i] ]  =  1 
count
 
 
Action Adventure Animation Biography Comedy Crime Drama Family Fantasy History Horror Music Musical Mystery Romance Sci-Fi Sport Thriller War Western 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 3 0 0 1 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 4 1 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... 995 0 0 0 0 0 1 1 0 0 0 0 0 0 1 0 0 0 0 0 0 996 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 997 0 0 0 0 0 0 1 0 0 0 0 1 0 0 1 0 0 0 0 0 998 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 999 0 0 0 0 1 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 
 
1000 rows × 20 columns
 
count. sum ( axis= 0 ) 
Action       303
Adventure    259
Animation     49
Biography     81
Comedy       279
Crime        150
Drama        513
Family        51
Fantasy      101
History       29
Horror       119
Music         16
Musical        5
Mystery      106
Romance      141
Sci-Fi       120
Sport         18
Thriller     195
War           13
Western        7
dtype: int64
count. sum ( axis= 0 ) . sort_values( ascending= False ) 
Drama        513
Action       303
Comedy       279
Adventure    259
Thriller     195
Crime        150
Romance      141
Sci-Fi       120
Horror       119
Mystery      106
Fantasy      101
Biography     81
Family        51
Animation     49
History       29
Sport         18
Music         16
War           13
Western        7
Musical        5
dtype: int64
count. sum ( axis= 0 ) . sort_values( ascending= False ) . plot( kind= "bar" , fontsize= 20 , figsize= ( 20 , 9 ) , colormap= "cool" ) 
<matplotlib.axes._subplots.AxesSubplot at 0x2472450c1c8>