创建两张表,通过一种是parquet , 一种使用parquet snappy压缩
创建表
使用snappy
CREATE EXTERNAL TABLE IF NOT EXISTS tableName(xxx string)
partitioned by
(pt_xvc string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\001'
STORED AS PARQUET TBLPROPERTIES('parquet.compression'='SNAPPY');使用gzip
CREATE EXTERNAL TABLE IF NOT EXISTS tableName(xxx string)
partitioned by
(pt_xvc string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\001'
STORED AS PARQUET TBLPROPERTIES('parquet.compression'='GZIP');使用uncompressed
CREATE EXTERNAL TABLE IF NOT EXISTS tableName(xxx string)
partitioned by
(pt_xvc string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\001'
STORED AS PARQUET TBLPROPERTIES('parquet.compression'='UNCOMPRESSED');使用默认
CREATE EXTERNAL TABLE IF NOT EXISTS tableName(xxx string)
partitioned by
(pt_xvc string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\001'
STORED AS PARQUET;也可以在执行语句前执行 set parquet.compression=SNAPPY; 会对之后跑的数据进行压缩,之前已经存在的不会进行snappy压缩通过 desc formatted tableName 查看表结构使用parquet snappy
Table Type:             EXTERNAL_TABLE           
Table Parameters:                EXTERNAL                TRUE                numFiles                25                  numPartitions           1                   numRows                 0                   parquet.compression     SNAPPY              rawDataSize             0                   totalSize               4570350557          transient_lastDdlTime   1552269085          # Storage Information            
SerDe Library:          org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe      
InputFormat:            org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat    
OutputFormat:           org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat   
Compressed:             No                       
Num Buckets:            -1                       
Bucket Columns:         []                       
Sort Columns:           []                       
Storage Desc Params:             field.delim             \u0001              serialization.format    \u0001              使用parquet默认
Table Type:             EXTERNAL_TABLE           
Table Parameters:                EXTERNAL                TRUE                numFiles                25                  numPartitions           1                   numRows                 0                   rawDataSize             0                   totalSize               4570650197          transient_lastDdlTime   1552269039          # Storage Information            
SerDe Library:          org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe      
InputFormat:            org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat    
OutputFormat:           org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat   
Compressed:             No                       
Num Buckets:            -1                       
Bucket Columns:         []                       
Sort Columns:           []                       
Storage Desc Params:             field.delim             \u0001              serialization.format    \u0001      测试数据量:20208432
UNCOMPRESSED    :4570325699
PARQUET 默认    :4570650197
parquet gzip    :4570314033
parquet snappy  :4570350557
textfile        :10356207038
通过对比发现,当数据量较少时parquet各压缩方式差别不大,但相比TEXTFILE压缩减少了1倍以上,后续再做一下性能对比测试一下。