当前位置：首页 > news >正文

Hive SQL ——窗口函数源码阅读

news 2025/7/15 3:47:45

前言

使用Starrocks引擎中的窗口函数 row_number() over( )对10亿的数据集进行去重操作，BE内存溢出问题频发（忘记当时指定的BE内存上限是多少了.....），此时才意识到，开窗操作，如果使用不当，反而更容易引发性能问题。下文是对Hive中的窗口函数底层源码进行初步学习，若有问题，请指正~

一、窗口函数的执行步骤

（1）将数据分割成多个分区；

（2）在各个分区上调用窗口函数；

由于窗口函数的返回结果不是一个聚合值，而是另一张表的格式（table-in, table-out），因此Hive社区引入分区表函数 Partitioned Table Function（PTF）。

简略的代码流转图：

hive会把QueryBlock，翻译成执行操作数OperatorTree，其中每个operator都会有三个重要的方法：

initializeOp() ：初始化算子
process() ：执行每一行数据
forward() ：把处理好的每一行数据发送到下个Operator

当遇到窗口函数时，会生成PTFOperator，PTFOperator依赖PTFInvocation 读取已经排好序的数据，创建相应的输入分区：PTFPartition inputPart;

WindowTableFunction 负责管理窗口帧、调用窗口函数（UDAF）、并将结果写入输出分区: PTFPartition outputPart。

二、源码分析

2.1 PTFOperator 类

是PartitionedTableFunction的运算符，继承Operator抽象类（Hive运算符基类）

重写process(Object row, int tag) 方法，该方法来处理一行数据Row

@Overridepublic void process(Object row, int tag) throws HiveException {if (!isMapOperator) {/** check if current row belongs to the current accumulated Partition:* - If not:*  - process the current Partition*  - reset input Partition* - set currentKey to the newKey if it is null or has changed.*/newKeys.getNewKey(row, inputObjInspectors[0]);//会判断当前row所属的Key（newKeys）是否等于当前正在累积数据的partition所属的key（currentKeys）boolean keysAreEqual = (currentKeys != null && newKeys != null) ?newKeys.equals(currentKeys) : false;// 如果不相等，就结束当前partition分区的数据累积，触发窗口计算if (currentKeys != null && !keysAreEqual) {// 关闭正在积累的分区ptfInvocation.finishPartition();}// 如果currentKeys为空或者被改变，就将newKeys赋值给currentKeysif (currentKeys == null || !keysAreEqual) {// 开启一个新的分区partitionptfInvocation.startPartition();if (currentKeys == null) {currentKeys = newKeys.copyKey();} else {currentKeys.copyKey(newKeys);}}} else if (firstMapRow) { // 说明当前row是进入的第一行ptfInvocation.startPartition();firstMapRow = false;}// 将数据row添加到分区中，积累数据ptfInvocation.processRow(row);}

上面的代码可以看出，所有数据应该是按照分区排好了序，排队进入process方法，当遇到进入的row和当前分区不是同一个key时，当前分区就可以关闭了，然后在打开下一个分区。

2.2 PTFInvocation类

PTFInvocation是PTFOperator类的内部类

在PTFOperator的初始化方法中创建了实例。

@Overrideprotected void initializeOp(Configuration jobConf) throws HiveException {...ptfInvocation = setupChain();ptfInvocation.initializeStreaming(jobConf, isMapOperator);...}

它的主要作用是负责PTF 数据链中行（ row）的流动，通过 ptfInvocation.processRow(row) 方法调用传递链中的每一行，并且通过ptfInvocation.startPartition()、ptfInvocation.finishPartition()方法来通知分区何时开始何时结束。

该类中包含TableFunction，用来处理分区数据。

PTFPartition inputPart; // inputPart理解为：分区对象，一直是在复用一个inputPart
TableFunctionEvaluator tabFn; // tabFn理解为：窗口函数的实例//向分区中添加一行数据
void processRow(Object row) throws HiveException {if (isStreaming()) {// tabFn是窗口函数的实例handleOutputRows(tabFn.processRow(row));} else {// inputPart就是当前正在累积数据的分区inputPart.append(row);}
}// 开启一个分区
void startPartition() throws HiveException {if (isStreaming()) {tabFn.startPartition();} else {if (prev == null || prev.isOutputIterator()) {if (inputPart == null) {// 创建新分区对象：PTFPartition对象createInputPartition();} else {// 重置分区inputPart.reset();}}}if (next != null) {next.startPartition();}
}// 关闭一个分区
void finishPartition() throws HiveException {if (isStreaming()) {handleOutputRows(tabFn.finishPartition());} else {if (tabFn.canIterateOutput()) {outputPartRowsItr = inputPart == null ? null :tabFn.iterator(inputPart.iterator());} else {// tabFn是窗口函数的实例，execute方法：执行窗口函数逻辑的计算，返回outputPart依旧是一个分区对象outputPart = inputPart == null ? null : tabFn.execute(inputPart);outputPartRowsItr = outputPart == null ? null : outputPart.iterator();}if (next != null) {if (!next.isStreaming() && !isOutputIterator()) {next.inputPart = outputPart;} else {if (outputPartRowsItr != null) {while (outputPartRowsItr.hasNext()) {next.processRow(outputPartRowsItr.next());}}}}if (next != null) {next.finishPartition();} else {if (!isStreaming()) {if (outputPartRowsItr != null) {while (outputPartRowsItr.hasNext()) {// 将窗口函数计算结果逐条输出到下一个Operator中forward(outputPartRowsItr.next(), outputObjInspector);}}}}
}

2.3 PTFPartition类

该类表示由TableFunction或WindowFunction来处理的行集合，使用PTFRowContainer来保存数据。

private final PTFRowContainer<List<Object>> elems; // 存放数据的容器public void append(Object o) throws HiveException {//在往PTFPartition中添加数据时，如果当前累计条数超过了Int最大值(21亿)，会抛异常。if (elems.rowCount() == Integer.MAX_VALUE) {throw new HiveException(String.format("Cannot add more than %d elements to a PTFPartition",Integer.MAX_VALUE));}@SuppressWarnings("unchecked")List<Object> l = (List<Object>)ObjectInspectorUtils.copyToStandardObject(o, inputOI, ObjectInspectorCopyOption.WRITABLE);elems.addRow(l);
}

2.4 TableFunctionEvaluator类

该类负责对分区内的数据做实际的窗口计算

public abstract class TableFunctionEvaluator {
transient protected PTFPartition outputPartition; // transient瞬态变量，该属性可以不参与序列化// iPart理解为：分区对象
public PTFPartition execute(PTFPartition iPart)throws HiveException {if (ptfDesc.isMapSide()) {return transformRawInput(iPart);}PTFPartitionIterator<Object> pItr = iPart.iterator();PTFOperator.connectLeadLagFunctionsToPartition(ptfDesc.getLlInfo(), pItr);if (outputPartition == null) {outputPartition = PTFPartition.create(ptfDesc.getCfg(),tableDef.getOutputShape().getSerde(),OI, tableDef.getOutputShape().getOI());} else {outputPartition.reset();}// 入参1：输入PTFPartition转换的迭代器；入参2：输出PTFPartitionexecute(pItr, outputPartition);return outputPartition;
}protected abstract void execute(PTFPartitionIterator<Object> pItr, PTFPartition oPart) throws HiveException;
}

抽象方法 execute(PTFPartitionIterator pItr, PTFPartition oPart) 方法的具体实现在子类WindowingTableFunction中

public class WindowingTableFunction extends TableFunctionEvaluator {@Override
public void execute(PTFPartitionIterator<Object> pItr, PTFPartition outP) throws HiveException {ArrayList<List<?>> oColumns = new ArrayList<List<?>>();PTFPartition iPart = pItr.getPartition();StructObjectInspector inputOI = iPart.getOutputOI();WindowTableFunctionDef wTFnDef = (WindowTableFunctionDef) getTableDef();for (WindowFunctionDef wFn : wTFnDef.getWindowFunctions()) {// 这里是判断逻辑：如果该窗口定义是一个从第一行到最后一行的全局无限窗口就返回false，反之trueboolean processWindow = processWindow(wFn.getWindowFrame());pItr.reset();if (!processWindow) {Object out = evaluateFunctionOnPartition(wFn, iPart);if (!wFn.isPivotResult()) {out = new SameList(iPart.size(), out);}oColumns.add((List<?>) out);} else {oColumns.add(executeFnwithWindow(wFn, iPart));}}/** Output Columns in the following order* - the columns representing the output from Window Fns* - the input Rows columns*/for (int i = 0; i < iPart.size(); i++) {ArrayList oRow = new ArrayList();Object iRow = iPart.getAt(i);for (int j = 0; j < oColumns.size(); j++) {oRow.add(oColumns.get(j).get(i));}for (StructField f : inputOI.getAllStructFieldRefs()) {oRow.add(inputOI.getStructFieldData(iRow, f));}//最终将处理好的数据逐条添加到输出PTFPartition中outP.append(oRow);}
}// Evaluate the function result for each row in the partition
ArrayList<Object> executeFnwithWindow(WindowFunctionDef wFnDef,PTFPartition iPart)throws HiveException {ArrayList<Object> vals = new ArrayList<Object>();for (int i = 0; i < iPart.size(); i++) {// 入参：1.窗口函数、2.当前行的行号、3.输入PTFPartition对象Object out = evaluateWindowFunction(wFnDef, i, iPart);vals.add(out);}return vals;
}// Evaluate the result given a partition and the row number to process
private Object evaluateWindowFunction(WindowFunctionDef wFn, int rowToProcess, PTFPartition partition)throws HiveException {BasePartitionEvaluator partitionEval = wFn.getWFnEval().getPartitionWindowingEvaluator(wFn.getWindowFrame(), partition, wFn.getArgs(), wFn.getOI(), nullsLast);// 给定当前行，获取窗口的聚合return partitionEval.iterate(rowToProcess, ptfDesc.getLlInfo());
}}

注：WindowingTableFunction类中的execute方法，没怎么理解清楚，待补充~

三、Hive SQL窗口函数实现原理

window Funtion的使用语法：

select col1,col2,row_number() over (partition by col1 order by col2 窗口子句) as rnfrom tableA

上面的语句主要分两部分

window函数部分（window_func）
窗口定义部分

3.1 window函数部分

windows函数部分即是：在窗口上执行的函数。主要有count 、sum、avg聚合类窗口函数、还有常用的row_number、rank这样的排序函数。

3.2 窗口定义部分

即为： over里面的三部分内容（均可省略不写）

partition by 分区
order by 排序
（rows | range ）between ... and ..... 窗口子句

ps ：Hive 窗口函数的详细介绍：

(07)Hive——窗口函数详解_hive 窗口函数-CSDN博客

3.3 window Function实现原理

窗口函数的实现，主要借助 Partitioned Table Function （即PTF）；

（1）PTF的输入可以是：表、子查询或另一个PTF函数输出；

（2）PTF输出是一张表。

写一个相对复杂的sql，来看一下执行窗口函数时，数据的流转情况：

select id,sq,cell_type,rank,row_number() over(partition by id  order by rank ) as rn ,rank() over(partition by id order by rank) as r,dense_rank() over(partition by  cell_type order by id) as dr  from window_test_table group byid,sq,cell_type,rank;

数据流转如下图：

以上代码实现主要有三个阶段：

计算除窗口函数以外所有的其他运算，如：group by，join ，having等。上面的代码的第一阶段即为：

selectid, sq, cell_type, rank
from window_test_table
group byid, sq, cell_type, rank;

将第一步的输出作为第一个 PTF 的输入，计算对应的窗口函数值。上面代码的第二阶段即为：

select id,sq,cell_type,rank,rn,r 
from 
window(<w>,--将第一阶段输出记为wpartition by id, --分区order by rank, --窗口函数的order[rn:row_number(),r:rank()] --窗口函数调用)

由于row_number()，rank() 两个函数对应的窗口是相同的（partition by id order by rank），因此，这两个函数可以在一次shuffle中完成。

将第二步的输出结果作为第二个PTF 的输入，计算对应的窗口函数值。上面代码的第三阶段即为：

select id,sq,cell_type,rank,rn,r,dr
from 
window(<w1>,--将第二阶段输出记为w1partition by cell_type, --分区order by id, --窗口函数的order[dr:dense_rank()] --窗口函数调用)

由于dense_rank()的窗口与前两个函数不同，因此需要再partition一次，得到最终的输出结果。

总结：上述代码显示需要shuffle三次才能得到最终的结果（第一阶段的group by ，第二阶段，第三阶段的开窗操作）。对应到MapReduce程序，即需要经历三次 map->reduce组合；对应到spark sql上，需要Exchange三次，再加上中间排序操作，在数据量很大的情况下，效率上确实会有较大的影响。

四、窗口函数的性能问题

在使用Hive进行数据处理时，借助窗口函数可以对数据进行分组、排序等操作，但是在使用row_number这类窗口函数时，会遇到性能较慢的问题，j即比普通的聚合函数（ sum，min，max等）运行成本更高，为啥？

4.1 性能问题产生原因

4.1.1 第一个版本

小破站一个up主给出的答案：

原因：

（1）开窗函数不能做预聚合，数据量很多，shuffle慢，计算慢，并且会有

数据倾斜的风险；

（2）开窗多一步order by ，更耗时间；

4.1.2 第二个版本

原因：

（1）普通的聚合函数语句，可以根据函数不同，采用partial + merge 的方式运行，也即是map端预聚合；但那是window 窗口语句只能在reduce 端一次性聚合，即只有complete 执行模式。

（2）普通聚合函数的物理执行计划分为SortBased和HashBased的；而window是SortBased。

（3）window语句作用于对行，并为每行返回一个聚合结果，这决定了window在执行过程中需要更大的buffer 进行汇总。

4.2 性能问题的优化方法

4.2.1 用聚合函数替代排序开窗函数

例如：假设需要求出历史至今用户粒度末次交易的sku名称或者交易金额等，这种情况下，可以将交易时间和sku名称拼接起来，取max ，之后再将sku名称拆解开，即能达到预期效果。

在Hive 中，row_number是一个常用的窗口函数，用于为结果集中的每一行分配一个唯一的数字。通常会搭配over子句来指定窗口的范围和排序方式。例如:

select col1,col2,row_number() over (partition by col1 order by col2  窗口子句) as rnfrom tableA

上述示例row_number 函数将根据col1进行分组，并按照col2的值进行排序，为每一组数据分配一个唯一的行号。然而，在处理大规模数据时，使用row_number可能会导致性能下降，这是因为row_number 需要对数据进行排序和标记，而这些操作在大数据量下会消耗较多的计算资源。

注：以下都是row_number() over () 开窗函数性能优化的几种方式：

4.2.2 减少数据量

一种最直接的优化方法是减少需要进行row_number计算的数据量。可以通过在where子句中添加条件、对数据进行分区等方式来减小数据规模，从而提升计算性能。

ps：这种方式在生产环境中用过。

4.2.3 避免多次排序

在使用row_number时，尽量避免多次排序操作。可以将row_number 函数应用在子查询中，然后再进行排序操作，避免重复的排序过程。

selectcol1,col2,rn
from 
( select col1,col2,row_number() over (partition by col1 order by col2) as rnfrom tableA) tmp1
order by col1,col2;

参考文章：

常用的SQL优化方式, 用聚合函数替代排序开窗求最值, sparksql, hivesql_哔哩哔哩_bilibili

https://blog.51cto.com/u_16213435/9877979

Hive学习（一）窗口函数源码阅读_hive 源码阅读-CSDN博客

https://mp.weixin.qq.com/s/WBryrbpHGO9jmzMp0e7jhw

Hive SQL ——窗口函数源码阅读

前言使用Starrocks引擎中的窗口函数 row_number() over( )对10亿的数据集进行去重操作，BE内存溢出问题频发（忘记当时指定的BE内存上限是多少了.....），此时才意识到，开窗操作，如果使用不当，反而…...

编程日记 2024/8/12 15:05:43

用python的Manim 创建大括号

Brace 是 Manim 中用于创建大括号（curly braces）的一个对象类。它有几个子类，自定义了不同的功能。下面是每个类的简要解释： 1. ArcBrace 功能: 创建一个环绕弧线的括号。适用于需要围绕弧形线条的场景。用法: 通常用于图形中有…...

编程日记 2024/8/12 15:04:42

白骑士的Matlab教学附加篇 5.2 代码规范与最佳实践

系列目录上一篇：白骑士的Matlab教学附加篇 5.1 MATLAB开发工具在 MATLAB 编程中，遵循良好的代码规范和最佳实践有助于提高代码的可读性、可维护性和可重用性。无论是变量命名、注释风格，还是代码格式化，合理的规…...

编程日记 2024/8/12 15:03:41

Javaweb--SpringBoot

1.SpringBoot入门简化Spring开发的一个框架，Spring Boot 旨在帮助开发者快速搭建 Spring 框架。整个Spring的一个合集，可以简化配置 2.微服务 （1）微服务就是一种架构风格 （2）微服务就是把一个项目拆…...

编程日记 2024/8/12 15:02:40

【数据结构】算法的时间复杂度与空间复杂度

计算机考研408-数据结构笔记本之——第一章绪论 1.2 算法和算法评价 1.2.2 算法效率的度量算法效率的度量是通过时间复杂度和空间复杂度来描述的。 1.空间复杂度算法的空间复杂度S(n)定义为该算法所需的存储空间，它是问题规模n的函数，记为 S(n) …...

编程日记 2024/8/12 15:00:37

PyCharm环境python开发上位机

目录前言： 一、pycharm新建工程 1、打开 pycharm软件，新建工程二、配置UI界面 1、新建UI界面 1）创建 Main Window 2）拖动控件到 MainWindow 中 3）设置信号与槽 4）ctrlS 保存ui文件 2、将ui文件转…...

编程日记 2024/8/12 14:59:36

ROS 2 参数使用

ROS 2 参数使用介绍 ROS 2 (Robot Operating System 2) 是一个为机器人开发提供支持的开源框架。它继承了 ROS 1 的优点，并且在架构上做了许多改进以支持分布式系统、实时性、安全性等要求。ROS 2 中的参数是用于配置节点行为的关键部分，允许我们动态…...

编程日记 2024/8/12 14:58:34

QT的Model-View实现大批量数据展示

一、完整源代码 1.项目结构 2.各文件代码展示 define.h #pragma once #include <QVector>//学生信息 typedef struct _STUDENT {QString name; //姓名int score1; //语文成绩int score2; //数学成绩int score3; //外语成绩_STUDENT(){name ""…...

编程日记 2024/8/12 14:57:33

回顾主服务器 [rootmaster_mysql ~]# yum -y install rsync [rootmaster_mysql ~]# tar -xf mysql-8.0.33-linux-glibc2.12-x86_64.tar [rootmaster_mysql ~]# tar -xf mysql-8.0.33-linux-glibc2.12-x86_64.tar.xz [rootmaster_mysql ~]# cp -r mysql-8.0.33-linux-glibc2.…...

编程日记 2024/8/12 14:56:32

前言