
group by
sql 查询时,我们常将聚合函数和group by 结合起来对某一个或多个字段进行分组查询,例如:
 select addcode,count(distinct sbtid)uv from tb_hive_window group by addcode;+----------+-----+| addcode  | uv  |+----------+-----+| 0002     | 2   || 000201   | 1   || 000202   | 1   || 000205   | 1   || 000206   | 1   || 000208   | 2   |+----------+-----+group by fields ... grouping sets ()
有时候因业务需要,我们需要group by 多个字段,例如:
select addcode,count(distinct sbtid)uv from tb_hive_window group by addcode;select rscode,count(distinct sbtid)uv from tb_hive_window group by rscode;select addcode,rscode,count(distinct sbtid)uv from tb_hive_window group by addcode,rscode;//这种情况我们需要写3条sql 语句通过grouping sets (), 我们可以通过一条sql 完成,相当于是将上面三条语句执行的结果通过union all 组合起来。
select addcode,rscode,count(*)pv,count(distinct sbtid)uv,grouping__id from tb_hive_window group by addcode,rscode grouping sets ((addcode,rscode),(addcode),(rscode));+----------+---------+-----+-----+---------------+| addcode  | rscode  | pv  | uv  | grouping__id  |+----------+---------+-----+-----+---------------+| NULL     | 34      | 1   | 1   | 2             || NULL     | 35      | 5   | 3   | 2             || NULL     | 40      | 1   | 1   | 2             || NULL     | 65      | 1   | 1   | 2             || NULL     | 351     | 1   | 1   | 2             || NULL     | 352     | 1   | 1   | 2             || NULL     | 395     | 1   | 1   | 2             || 0002     | NULL    | 2   | 2   | 1             || 0002     | 34      | 1   | 1   | 3             || 0002     | 352     | 1   | 1   | 3             || 000201   | NULL    | 1   | 1   | 1             || 000201   | 35      | 1   | 1   | 3             || 000202   | NULL    | 1   | 1   | 1             || 000202   | 35      | 1   | 1   | 3             || 000205   | NULL    | 3   | 1   | 1             || 000205   | 35      | 2   | 1   | 3             || 000205   | 395     | 1   | 1   | 3             || 000206   | NULL    | 2   | 1   | 1             || 000206   | 40      | 1   | 1   | 3             || 000206   | 65      | 1   | 1   | 3             || 000208   | NULL    | 2   | 2   | 1             || 000208   | 35      | 1   | 1   | 3             || 000208   | 351     | 1   | 1   | 3             |+----------+---------+-----+-----+---------------+注意,
- 当我们没有统计某一列值时,此时此列的值用null 表示,这可能与该列本身就是null 冲突,没关系下面我们通过grouping__id 就可以区分。
- 此处需要说明的是grouping __id(两个下划线) 是十进制数,将其转换成二进制表示后可以明确的知道此grouping __id 所对应的是哪些group by 字段。
grouping by 后面的字段排序处理成二进制数,靠近group by 的是低位,远离group by 的是高位。查询出的每一行结果中,如果有统计此列即此列值不为null,二进制位用1表示,否则用0表示。
+----------+---------+-----+-----+---------------+| addcode  | rscode  | pv  | uv  | grouping__id  |+----------+---------+-----+-----+---------------+| NULL     | 34      | 1   | 1   | 2             |--->> 转换成二进制数:01 -->2 (注意远离group by 的字段是高位)| 0002     | NULL    | 2   | 2   | 1             |--->> 转换成二进制数:10 -->1| 0002     | 34      | 1   | 1   | 3             | --->> 转换成二进制数:11 -->1+2=3| 000201   | NULL    | 1   | 1   | 1             || 000201   | 35      | 1   | 1   | 3             |注意:此处开源版hive 的grouping__id 计算方式与华为云的MRS 集群hive的grouping__id计算方式不同。mrs 中的grouping__id 计算时是以靠近group by 的字段为高位,远离的为低位,并且字段值为null 时 二进制位为1,否则为0
| 20200114        | 2020011406      | fengmizhibonew      | NULL              | NULL            | NULL           | NULL              | NULL                 | NULL                | 836          | 613          | 63                    group by fields ... with cube
通过group by fields ... with cube 可是让hive 实现所有组合维度的查询,例如
select a,b,c,d,count(e) pv,count(distinct e) uv, grouping__id from tb_test group by a,b,c,d with cube;根据排列组合计算,最终的组合有 C(4,1) + C(4,2) + C(4,3) + C(4,4) + 1 = 4 + 6 + 4 + 1 + 1 = 16 种组合,如果是通过单条group by 来实现,需要写16 条sql 才能完成,因此这种方案极大地降低了程序复杂度。
select addcode,rscode,count(*)pv,count(distinct sbtid)uv,grouping__id from tb_hive_window group by addcode,rscode with cube;+----------+---------+-----+-----+---------------+| addcode  | rscode  | pv  | uv  | grouping__id  |+----------+---------+-----+-----+---------------+| NULL     | NULL    | 11  | 6   | 0             || NULL     | 34      | 1   | 1   | 2             || NULL     | 35      | 5   | 3   | 2             || NULL     | 40      | 1   | 1   | 2             || NULL     | 65      | 1   | 1   | 2             || NULL     | 351     | 1   | 1   | 2             || NULL     | 352     | 1   | 1   | 2             || NULL     | 395     | 1   | 1   | 2             || 0002     | NULL    | 2   | 2   | 1             || 0002     | 34      | 1   | 1   | 3             || 0002     | 352     | 1   | 1   | 3             || 000201   | NULL    | 1   | 1   | 1             || 000201   | 35      | 1   | 1   | 3             || 000202   | NULL    | 1   | 1   | 1             || 000202   | 35      | 1   | 1   | 3             || 000205   | NULL    | 3   | 1   | 1             || 000205   | 35      | 2   | 1   | 3             || 000205   | 395     | 1   | 1   | 3             || 000206   | NULL    | 2   | 1   | 1             || 000206   | 40      | 1   | 1   | 3             || 000206   | 65      | 1   | 1   | 3             || 000208   | NULL    | 2   | 2   | 1             || 000208   | 35      | 1   | 1   | 3             || 000208   | 351     | 1   | 1   | 3             |+----------+---------+-----+-----+---------------+group by fields ... with rollup
rollup 是cube 的子集,通过group by fields ... with rollup 可以实现以左侧维度为准,计算某一层次维度的聚合
select a,b,c,d,count(e) pv,count(distinct e) uv, grouping__id from tb_test group by a,b,c,d with rollup;//等效于select a,b,c,d,count(e) pv,count(distinct e) uv, grouping__id from tb_test group by a,b,c,d grouping sets((a,b,c,d),(a,b,c),(a,b),(a),());select addcode,rscode,count(*)pv,count(distinct sbtid)uv,grouping__id from tb_hive_window group by addcode,rscode with rollup;+----------+---------+-----+-----+---------------+| addcode  | rscode  | pv  | uv  | grouping__id  |+----------+---------+-----+-----+---------------+| NULL     | NULL    | 11  | 6   | 0             || 0002     | NULL    | 2   | 2   | 1             || 0002     | 34      | 1   | 1   | 3             || 0002     | 352     | 1   | 1   | 3             || 000201   | NULL    | 1   | 1   | 1             || 000201   | 35      | 1   | 1   | 3             || 000202   | NULL    | 1   | 1   | 1             || 000202   | 35      | 1   | 1   | 3             || 000205   | NULL    | 3   | 1   | 1             || 000205   | 35      | 2   | 1   | 3             || 000205   | 395     | 1   | 1   | 3             || 000206   | NULL    | 2   | 1   | 1             || 000206   | 40      | 1   | 1   | 3             || 000206   | 65      | 1   | 1   | 3             || 000208   | NULL    | 2   | 2   | 1             || 000208   | 35      | 1   | 1   | 3             || 000208   | 351     | 1   | 1   | 3             |+----------+---------+-----+-----+---------------+//通过grouping sets 实现得到同样的结果select addcode,rscode,count(*)pv,count(distinct sbtid)uv,grouping__id from tb_hive_window group by addcode,rscode grouping sets((addcode,rscode),(addcode),());+----------+---------+-----+-----+---------------+| addcode  | rscode  | pv  | uv  | grouping__id  |+----------+---------+-----+-----+---------------+| NULL     | NULL    | 11  | 6   | 0             || 0002     | NULL    | 2   | 2   | 1             || 0002     | 34      | 1   | 1   | 3             || 0002     | 352     | 1   | 1   | 3             || 000201   | NULL    | 1   | 1   | 1             || 000201   | 35      | 1   | 1   | 3             || 000202   | NULL    | 1   | 1   | 1             || 000202   | 35      | 1   | 1   | 3             || 000205   | NULL    | 3   | 1   | 1             || 000205   | 35      | 2   | 1   | 3             || 000205   | 395     | 1   | 1   | 3             || 000206   | NULL    | 2   | 1   | 1             || 000206   | 40      | 1   | 1   | 3             || 000206   | 65      | 1   | 1   | 3             || 000208   | NULL    | 2   | 2   | 1             || 000208   | 35      | 1   | 1   | 3             || 000208   | 351     | 1   | 1   | 3             |+----------+---------+-----+-----+---------------+