如有一张表stock_hq,表格中的数据如下所示:
TDATE | SECCODE | BLOCK | AMOUNT | PRICE |
---|---|---|---|---|
20221113 | 000001.sz | A01 | 210 | 10.10 |
20221113 | 000002.sz | A02 | 210 | 9.10 |
20221113 | 000003.sz | A01 | 210 | 8.10 |
20221114 | 000001.sz | A01 | 210 | 10.10 |
20221114 | 000002.sz | A02 | 210 | 9.10 |
20221114 | 000003.sz | A01 | 210 | 8.10 |
20221115 | 000001.sz | A01 | 210 | 10.10 |
20221115 | 000002.sz | A02 | 210 | 9.10 |
20221115 | 000003.sz | A01 | 210 | 8.10 |
20221116 | 000001.sz | A01 | 210 | 10.10 |
20221116 | 000002.sz | A02 | 210 | 9.10 |
20221116 | 000003.sz | A01 | 210 | 8.10 |
20221117 | 000001.sz | A01 | 110 | 10.10 |
20221117 | 000002.sz | A02 | 110 | 9.10 |
20221117 | 000003.sz | A01 | 110 | 9.10 |
20221118 | 000001.sz | A01 | 100 | 10.10 |
20221118 | 000002.sz | A02 | 100 | 10.10 |
20221118 | 000003.sz | A01 | 100 | 10.10 |
窗口其实就是一个数据范围,它指定了我们统计计算分些数据范围。在Spark和Flink中我们知道,窗口有全局窗口和滚动窗口之分,同样在SQL的窗口中也有类似的概念。在SQL中,窗口是通过Over来实现的。
OVER( [PARTITION BY xx] [ORDER BY XX] [Window specifications ])
在Over()中可以由上述3者进行不同的组合,或者3者都可以不指定。
PARTITON BY 选项是可选的,它可以指定一个或者多个字段进行开窗,如果不指定开窗字段,则只有“一个窗口”。
SELECT SECCODE,TDATE,AMOUNT,SUM(AMOUNT) over() AS SUM_AMOUNT FROM stock_hq;
输出:
TDATE | SECCODE | AMOUNT | SUM_AMOUNT |
---|---|---|---|
20221113 | 000001.sz | 210 | 3150 |
20221113 | 000002.sz | 210 | 3150 |
20221113 | 000003.sz | 210 | 3150 |
20221114 | 000001.sz | 210 | 3150 |
20221114 | 000002.sz | 210 | 3150 |
20221114 | 000003.sz | 210 | 3150 |
20221115 | 000001.sz | 210 | 3150 |
20221115 | 000002.sz | 210 | 3150 |
20221115 | 000003.sz | 210 | 3150 |
20221116 | 000001.sz | 210 | 3150 |
20221116 | 000002.sz | 210 | 3150 |
20221116 | 000003.sz | 210 | 3150 |
20221117 | 000001.sz | 110 | 3150 |
20221117 | 000002.sz | 110 | 3150 |
20221117 | 000003.sz | 110 | 3150 |
20221118 | 000001.sz | 100 | 3150 |
20221118 | 000002.sz | 100 | 3150 |
20221118 | 000003.sz | 100 | 3150 |
很明显,上述数据中Sum(Amount)统计的是所有记录的和,也就是说所有数据在同一窗口中。
SELECT SECCODE,TDATE,AMOUNT,SUM(AMOUNT) over(PARTITION BY SECCODE) AS SUM_AMOUNT FROM stock_hq;
上述语句中指定了一个开窗字段,没有指定排序和窗口大小,则分成多个窗口,每个窗口中的数据是每个SECCODE的所有数据。
SECCODE | TDATE | AMOUNT | SUM_AMOUNT |
---|---|---|---|
000001.sz | 20221113 | 210 | 1050 |
000001.sz | 20221114 | 210 | 1050 |
000001.sz | 20221118 | 100 | 1050 |
000001.sz | 20221115 | 210 | 1050 |
000001.sz | 20221117 | 110 | 1050 |
000001.sz | 20221116 | 210 | 1050 |
000002.sz | 20221113 | 210 | 1050 |
000002.sz | 20221118 | 100 | 1050 |
000002.sz | 20221117 | 110 | 1050 |
000002.sz | 20221116 | 210 | 1050 |
000002.sz | 20221115 | 210 | 1050 |
000002.sz | 20221114 | 210 | 1050 |
000003.sz | 20221115 | 210 | 1050 |
000003.sz | 20221116 | 210 | 1050 |
000003.sz | 20221114 | 210 | 1050 |
000003.sz | 20221117 | 110 | 1050 |
000003.sz | 20221113 | 210 | 1050 |
000003.sz | 20221118 | 100 | 1050 |
select BLOCK,SECCODE,TDATE,AMOUNT,SUM(AMOUNT) over(PARTITION BY BLOCK,SECCODE) AS SUM_AMOUNT FROM stock_hq;
输出:
BLOCK | SECCODE | TDATE | AMOUNT | SUM_AMOUNT |
---|---|---|---|---|
A01 | 000001.sz | 20221113 | 210 | 1050 |
A01 | 000001.sz | 20221118 | 100 | 1050 |
A01 | 000001.sz | 20221114 | 210 | 1050 |
A01 | 000001.sz | 20221117 | 110 | 1050 |
A01 | 000001.sz | 20221115 | 210 | 1050 |
A01 | 000001.sz | 20221116 | 210 | 1050 |
A01 | 000003.sz | 20221118 | 100 | 1050 |
A01 | 000003.sz | 20221117 | 110 | 1050 |
A01 | 000003.sz | 20221116 | 210 | 1050 |
A01 | 000003.sz | 20221115 | 210 | 1050 |
A01 | 000003.sz | 20221114 | 210 | 1050 |
A01 | 000003.sz | 20221113 | 210 | 1050 |
A02 | 000002.sz | 20221116 | 210 | 1050 |
A02 | 000002.sz | 20221115 | 210 | 1050 |
A02 | 000002.sz | 20221117 | 110 | 1050 |
A02 | 000002.sz | 20221114 | 210 | 1050 |
A02 | 000002.sz | 20221118 | 100 | 1050 |
A02 | 000002.sz | 20221113 | 210 | 1050 |
开窗字段值相同的数据分在同一个窗口中(类似于group by).
指定了Order By后,那么窗口就不再是“静态”的了,而成了一个动态滚动的窗口,废话不多说,上菜:
select TDATE,SECCODE, AMOUNT,SUM(AMOUNT) over(order by SECCODE) AS SUM_AMOUNT FROM stock_hq;
输出:
TDATE | SECCODE | AMOUNT | SUM_AMOUNT |
---|---|---|---|
20221113 | 000001.sz | 210 | 1050 |
20221114 | 000001.sz | 210 | 1050 |
20221118 | 000001.sz | 100 | 1050 |
20221115 | 000001.sz | 210 | 1050 |
20221117 | 000001.sz | 110 | 1050 |
20221116 | 000001.sz | 210 | 1050 |
20221113 | 000002.sz | 210 | 2100 |
20221118 | 000002.sz | 100 | 2100 |
20221117 | 000002.sz | 110 | 2100 |
20221116 | 000002.sz | 210 | 2100 |
20221115 | 000002.sz | 210 | 2100 |
20221114 | 000002.sz | 210 | 2100 |
20221115 | 000003.sz | 210 | 3150 |
20221116 | 000003.sz | 210 | 3150 |
20221114 | 000003.sz | 210 | 3150 |
20221117 | 000003.sz | 110 | 3150 |
20221113 | 000003.sz | 210 | 3150 |
20221118 | 000003.sz | 100 | 3150 |
如果只单独指定了Order By,Order By字段相同的数据会先分成一组做一个统计,然后再到下一个组如000002.sz的数据时会将000002.sz的所有的数据先做个统计,再累加上一个分组000001.sz的统计结果。
select TDATE,SECCODE, AMOUNT,SUM(AMOUNT) over(partition by SECCODE order by TDATE) AS SUM_AMOUNT FROM stock_hq;
TDATE | SECCODE | AMOUNT | SUM_AMOUNT |
---|---|---|---|
20221113 | 000001.sz | 210 | 210 |
20221114 | 000001.sz | 210 | 420 |
20221115 | 000001.sz | 210 | 630 |
20221116 | 000001.sz | 210 | 840 |
20221117 | 000001.sz | 110 | 950 |
20221118 | 000001.sz | 100 | 1050 |
20221113 | 000002.sz | 210 | 210 |
20221114 | 000002.sz | 210 | 420 |
20221115 | 000002.sz | 210 | 630 |
20221116 | 000002.sz | 210 | 840 |
20221117 | 000002.sz | 110 | 950 |
20221118 | 000002.sz | 100 | 1050 |
20221113 | 000003.sz | 210 | 210 |
20221114 | 000003.sz | 210 | 420 |
20221115 | 000003.sz | 210 | 630 |
20221116 | 000003.sz | 210 | 840 |
20221117 | 000003.sz | 110 | 950 |
20221118 | 000003.sz | 100 | 1050 |
在每一个窗口中,每一行的统计结果为上一行的统计结果加上当前行的值。
窗口的定义主要用于指定窗口的大小,有如下几种语义进行指定:
(ROWS | RANGE) BETWEEN (UNBOUNDED | [num]) PRECEDING AND ([num] PRECEDING | CURRENT ROW | (UNBOUNDED | [num]) FOLLOWING)
(ROWS | RANGE) BETWEEN CURRENT ROW AND (CURRENT ROW | (UNBOUNDED | [num]) FOLLOWING)
(ROWS | RANGE) BETWEEN [num] FOLLOWING AND (UNBOUNDED | [num]) FOLLOWING
UNBOUNDED 无边界的
PRECEDING 当前行的前
FOLLOWING 当前行后跟多少行
示例组合:
由于我们上面的日期是连续的,所以需要删除某一天的数据,让效果看起来更明显。
delete from stock_hq where tdate = 20221115;
如:ROWS BETWEEN 3 PRECEDING AND CURRENT ROW.
在000001.sz窗口中,假如当前行的日期是20221117:到当前行时统计的是当前行前3行的值加当前行的值也就是740。统计的范围是固定的,与当前行的值无关。
示例:
select SECCODE,TDATE,AMOUNT,SUM(AMOUNT) over( partition BY SECCODE order by TDATE ROWS BETWEEN 3 PRECEDING AND CURRENT Row) AS SUM_AMOUNT FROM stock_hq;
输出:
SECCODE | TDATE | AMOUNT | SUM_AMOUNT |
---|---|---|---|
000001.sz | 20221113 | 210 | 210 |
000001.sz | 20221114 | 210 | 420 |
000001.sz | 20221116 | 210 | 630 |
000001.sz | 20221117 | 110 | 740 |
000001.sz | 20221118 | 100 | 630 |
000002.sz | 20221113 | 210 | 210 |
000002.sz | 20221114 | 210 | 420 |
000002.sz | 20221116 | 210 | 630 |
000002.sz | 20221117 | 110 | 740 |
000002.sz | 20221118 | 100 | 630 |
000003.sz | 20221113 | 210 | 210 |
000003.sz | 20221114 | 210 | 420 |
000003.sz | 20221116 | 210 | 630 |
000003.sz | 20221117 | 110 | 740 |
000003.sz | 20221118 | 100 | 630 |
如:RANGE BETWEEN 3 PRECEDING AND CURRENT ROW.
假如当前行的日期是20221117,到当前行时统计的是日期大于等于20221114到当前行20221117这一范围上的值,如下示例中是530,是一个逻辑上的窗口设定,与当前值有关。
select SECCODE,TDATE,AMOUNT,SUM(AMOUNT) over( partition BY SECCODE order by TDATE RANGE BETWEEN 3 PRECEDING AND CURRENT Row) AS SUM_AMOUNT FROM stock_hq;
输出:
SECCODE | TDATE | AMOUNT | SUM_AMOUNT |
---|---|---|---|
000001.sz | 20221113 | 210 | 210 |
000001.sz | 20221114 | 210 | 420 |
000001.sz | 20221116 | 210 | 630 |
000001.sz | 20221117 | 110 | 530 |
000001.sz | 20221118 | 100 | 420 |
000002.sz | 20221113 | 210 | 210 |
000002.sz | 20221114 | 210 | 420 |
000002.sz | 20221116 | 210 | 630 |
000002.sz | 20221117 | 110 | 530 |
000002.sz | 20221118 | 100 | 420 |
000003.sz | 20221113 | 210 | 210 |
000003.sz | 20221114 | 210 | 420 |
000003.sz | 20221116 | 210 | 630 |
000003.sz | 20221117 | 110 | 530 |
000003.sz | 20221118 | 100 | 420 |
示例:每个窗口的数据整体往上移一行,空缺的值默认为NULL
select TDATE,SECCODE, AMOUNT,LEAD(AMOUNT,1) over(partition by SECCODE order by TDATE) AS LEAD_AMOUNT FROM stock_hq;
输出:
TDATE | SECCODE | AMOUNT | LEAD_AMOUNT |
---|---|---|---|
20221113 | 000001.sz | 210 | 210 |
20221114 | 000001.sz | 210 | 210 |
20221116 | 000001.sz | 210 | 110 |
20221117 | 000001.sz | 110 | 100 |
20221118 | 000001.sz | 100 | |
20221113 | 000002.sz | 210 | 210 |
20221114 | 000002.sz | 210 | 210 |
20221116 | 000002.sz | 210 | 110 |
20221117 | 000002.sz | 110 | 100 |
20221118 | 000002.sz | 100 | |
20221113 | 000003.sz | 210 | 210 |
20221114 | 000003.sz | 210 | 210 |
20221116 | 000003.sz | 210 | 110 |
20221117 | 000003.sz | 110 | 100 |
20221118 | 000003.sz | 100 |
select TDATE,SECCODE, AMOUNT,LAG(AMOUNT,1) over(partition by SECCODE order by TDATE) AS LEAD_AMOUNT FROM stock_hq;
输出:
TDATE | SECCODE | AMOUNT | LAG_AMOUNT |
---|---|---|---|
20221113 | 000001.sz | 210 | |
20221114 | 000001.sz | 210 | 210 |
20221116 | 000001.sz | 210 | 210 |
20221117 | 000001.sz | 110 | 210 |
20221118 | 000001.sz | 100 | 110 |
20221113 | 000002.sz | 210 | |
20221114 | 000002.sz | 210 | 210 |
20221116 | 000002.sz | 210 | 210 |
20221117 | 000002.sz | 110 | 210 |
20221118 | 000002.sz | 100 | 110 |
20221113 | 000003.sz | 210 | |
20221114 | 000003.sz | 210 | 210 |
20221116 | 000003.sz | 210 | 210 |
20221117 | 000003.sz | 110 | 210 |
20221118 | 000003.sz | 100 | 110 |
最多两个参数,第一个参数是列名,第二个参数是一个bool值,默认是false。如果设为true,则会跳过NULL值找第一个不为NULL的值。
Tips:有些关系型数据库中只有一个参数,请注意。
最多两个参数,第一个参数是列名,第二个参数是一个bool值,默认是false。如果设为true,则会跳过NULL值找第一个不为NULL的值。