PostgreSQL 结合 LAG 和 LEAD 查询 n 前后行

Posted

技术标签:

【中文标题】PostgreSQL 结合 LAG 和 LEAD 查询 n 前后行【英文标题】:PostgreSQL combine LAG and LEAD to query n previous and following rows 【发布时间】:2018-11-01 09:40:56 【问题描述】:

我有一个 PostgreSQL 表,我们称之为 tokens,在文本行中包含每个标记的语法注释,基本上是这样的:

idx | line | tno | token   | annotation      | lemma
----+------+-----+---------+-----------------+---------
  1 | I.01 | 1   | This    | DEM.PROX        | this
  2 | I.01 | 2   | is      | VB.COP.3SG.PRES | be
  3 | I.01 | 3   | an      | ART.INDEF       | a
  4 | I.01 | 4   | example | NN.INAN         | example

我想做一个允许我搜索语法上下文的查询,在这种情况下,一个查询是否在当前大小为 n 的窗口中存在某个注释排。从我读到的内容来看,PostgreSQL 的窗口函数LEADLAG 适合实现这一目标。作为第一次尝试,我根据我可以找到的有关这些函数的文档编写了以下查询:

SELECT *
FROM (
    SELECT token, annotation, lemma,
        -- LAG(annotation) OVER prev_rows AS prev_anno, -- ?????
        LEAD(annotation) OVER next_rows AS next_anno
    FROM tokens
    WINDOW next_rows AS (
        ORDER BY line, tno ASC
        ROWS BETWEEN CURRENT ROW AND 2 FOLLOWING
    )
    ORDER BY line, tno ASC
) AS "window"
WHERE
    lemma LIKE '...'
    AND "window".next_anno LIKE '...'
;

但是,这仅搜索以下 2 行。我的问题是,如何改写查询以使窗口同时包含表中的前一行和后一行?显然,我不能有 2 个 WINDOW 语句或做类似的事情

ROWS BETWEEN 2 PRECEDING AND CURRENT ROW
AND ROWS BETWEEN CURRENT ROW AND 2 FOLLOWING

【问题讨论】:

【参考方案1】:

我不确定我是否正确理解了您的用例:您想检查一个给定的注释是否在 5 行之一中(前面 2 行,当前行,后面 2 行)。对吗?


    可以定义像BETWEEN 2 PRECEDING AND 2 FOLLOWING这样的窗口 LEADLAG 只给出一个值,在这种情况下是当前行之后或之前的一个值 - 如果窗口支持它;无论您的窗口包含多少行。但是您想签入这五行中的任何一行。

实现此目的的一种方法:


demo: db<>fiddle
SELECT *
FROM (
    SELECT token, annotation, lemma,
        unnest(array_agg(annotation) OVER w) as surrounded_annos      -- 2
    FROM tokens
    WINDOW w AS (                                                     -- 1
        ORDER BY line, tno ASC
        ROWS BETWEEN 2 PRECEDING AND 2 FOLLOWING
    )
    ORDER BY line, tno ASC
) AS "window"
WHERE
    lemma LIKE '...'
    AND "window".surrounded_annos LIKE '...'
;
    如上所述定义窗口
      array_agg 聚合这五行中的所有注释(如果可能),这会给出一个数组 unnest 将此数组扩展为每个元素的一行,因为恕我直言,无法使用 LIKE 搜索数组元素。这会给你这个结果(可以在下一步中过滤):

结果子查询:

token     annotation        lemma     surrounded_annos
This      DEM.PROX          this      DEM.PROX
This      DEM.PROX          this      VB.COP.3SG.PRES
This      DEM.PROX          this      ART.INDEF
is        VB.COP.3SG.PRES   be        DEM.PROX
is        VB.COP.3SG.PRES   be        VB.COP.3SG.PRES
is        VB.COP.3SG.PRES   be        ART.INDEF
is        VB.COP.3SG.PRES   be        NN.INAN
an        ART.INDEF         a         DEM.PROX
an        ART.INDEF         a         VB.COP.3SG.PRES
an        ART.INDEF         a         ART.INDEF
an        ART.INDEF         a         NN.INAN
example   NN.INAN           example   VB.COP.3SG.PRES
example   NN.INAN           example   ART.INDEF
example   NN.INAN           example   NN.

【讨论】:

这似乎基本上做我想要的(谢谢!),但是,如果包围的annos过滤条件为负(不喜欢),如果过滤器的谓词有没有办法消除标记在窗口中找到条件?所以:WHERE lemma LIKE 'an' AND "window".surrounded_annos NOT LIKE '%VB.COP%' 应该返回一个空结果,因为包含“an”的行一个邻居,LIKE '%VB.COP%' 是真的。【参考方案2】:

另一种方法是计算句子中每个标记的相对位置,并执行标记的自连接标记(这将允许您选择基于 skip-grams关于距离):


WITH www AS (   -- enumerate word posision with sentences
    SELECT line, tno    -- candidate key
        , row_number() OVER sentence AS rn
    FROM tokens
    WINDOW sentence AS ( ORDER BY line ASC, tno ASC)
        )
SELECT t0.line AS line
        , t0.token AS this
        , t1.tno AS tno
        , w1.rn - w0.rn AS rel  -- relative position
        , t1.token AS that
        , t1.annotation AS anno
FROM tokens t0
JOIN tokens t1 ON t1.line = t0.line     -- same sentence
JOIN www w0 ON t0.line = w0.line AND t0.tno= w0.tno -- PK1
JOIN www w1 ON t1.line = w1.line AND t1.tno= w1.tno -- PK2
WHERE 1=1
AND t0.lemma LIKE 'be'
    -- AND t1.annotation LIKE '.PROX' AND w1.rn - w0.rn  = -1
        ;

-- But, if you rno is consecutive(gapless) within lines,
-- you can omit the enumeration step, and do a plain self-join:

SELECT t0.line AS line
        , t0.token AS this
        , t1.tno AS tno
        , t1.tno - t0.tno AS rel        -- relative position
        , t1.token AS that
        , t1.annotation AS anno
FROM tokens t0
JOIN tokens t1 ON t1.line = t0.line     -- same sentence
WHERE 1=1
AND t0.lemma LIKE 'be'
    -- AND t1.annotation LIKE '.PROX' AND w1.rn - w0.rn  = -1
        ;

【讨论】:

以上是关于PostgreSQL 结合 LAG 和 LEAD 查询 n 前后行的主要内容,如果未能解决你的问题,请参考以下文章

Hive 分析函数lead、lag实例应用

ORACLE 偏移分析函数 lag()与lead() 用法

ORACLE 偏移分析函数 lag()与lead() 用法

SQL查询获取同一字段前/后n行的值_lag/lead

SQL SERVER LEAD和LAG使用

Hive分析函数LAG和LEAD详解