在 where 子句中使用子查询从表中选择第二大日期

Posted 2023-03-31

技术标签:

【中文标题】在 where 子句中使用子查询从表中选择第二大日期【英文标题】：Using a subquery in the where clause to select 2nd highest date from a table 【发布时间】：2020-11-11 18:21:45 【问题描述】：

我需要做（在伪代码中）

where yyyy_mm_dd >= '2019-02-01' 
and yyyy_mm_dd <= second highest date in a table

为此，我使用了以下代码：

where
    p.yyyy_mm_dd >= "2019-02-02"
    and p.yyyy_mm_dd <= (select max(yyyy_mm_dd) from schema.table1 where yyyy_mm_dd < (select max(yyyy_mm_dd) from schema.table1 where yyyy_mm_dd is not null))

上面的方法在包裹在spark.sql() 中时有效，但是当我在没有 Spark 的情况下运行查询时，即作为原始 HQL，我遇到了这个错误：

编译语句时出错：FAILED: ParseException line 102:25 cannot identify input near 'select' 'max' '(' in expression specification

我尝试通过像这样为子查询中的所有列设置别名来修复它：

where
    p.yyyy_mm_dd >= "2019-02-02"
    and p.yyyy_mm_dd <= (select max(t1.yyyy_mm_dd) from schema.table1 t1 where t1.yyyy_mm_dd < (select max(t2.yyyy_mm_dd) from schema.table2 t2 where t2.yyyy_mm_dd is not null))

不过，我还是遇到了同样的错误。

编辑以包含示例数据和查询：

表1：

| yyyy_mm_dd | company_id | account_manager |
|------------|------------|-----------------|
| 2020-11-10 | 321        | Peter           |
| 2020-11-09 | 632        | John            |
| 2020-11-08 | 598        | Doe             |
| 2020-11-07 | 104        | Bob             |
| ...        | ...        | ...             |
| ...        | ...        | ...             |

表2：

| yyyy_mm_dd        | company_id | tier   |
|-------------------|------------|--------|
| 2020-11-10        | 321        | Bronze |
| 2020-11-09        | 632        | Silver |
| 2020-11-08        | 598        | Gold   |
| 2020-11-07        | 104        | Bob    |
| ...               | ...        | ...    |
| ...               | ...        | ...    |
| 2019_12_13_backup | 321        | Bronze |
| 2019_12_13_backup | 632        | Silver |
| ...               |            |        |

查询：

select
    p.yyyy_mm_dd,
    p.company_id,
    p.account_manager,
 t.tier
from
    table1 p
left join(
    select
        yyyy_mm_dd,
        company_id,
        max(tier) as tier
    from 
        table2
    where
        yyyy_mm_dd >= "2019-02-02"
    group by
        1,2
) t on (t.company_id = p.company_id and t.yyyy_mm_dd = p.yyyy_mm_dd)

where
    p.yyyy_mm_dd >= "2019-02-02"
    and p.yyyy_mm_dd <= (select max(yyyy_mm_dd) from table2 where yyyy_mm_dd < (select max(yyyy_mm_dd) from table2 where yyyy_mm_dd is not null))

由于table2 在yyyy_mm_dd 列中包含backup_2019_12_31，因此在表上执行max() 时将返回这些行。所以我需要得到第二高的值，这里的数据集中是2020-11-10。每个yyyy_mm_dd 有多个company_ids。

本质上，我想查询table1，其中yyyy_mm_dd 介于table1 起点（硬编码为2019-02-02）和table2 的真正最大日期之间

【问题讨论】：

请提供完整选择，以便更好地了解您选择的是哪个表我已经用示例数据和查询更新了问题。我希望现在更清楚了 【参考方案1】：

要从 table3 中获取第二高的日期，您可以使用 dense_rank。日期第二高的所有行都将分配 rn=2。使用 LIMIT 获取单行或使用 max() 或不同的聚合，然后使用 max_date 和过滤器交叉连接您的表。

with max_date as(
select yyyy_mm_dd
from
(
select yyyy_mm_dd, 
       dense_rank() over(order by yyyy_mm_dd desc) rn
 from table2
)s 
where rn=2 --second max date
limit 1    --need only one record
)

select t1.*   
   from table1 t1
        cross join max_date t2
 where t1.yyyy_mm_dd <= t2.yyyy_mm_dd

【讨论】：

我的 Hive 处于严格模式，因此不允许交叉连接。有其他方法吗？另外，如果我没记错的话，我认为你应该离开 table2 加入 table2.yyyy_mm_dd <= max_date.yyyy_mm_dd，这样 tier 数据将每天保留。对不起，如果我不清楚

以上是关于在 where 子句中使用子查询从表中选择第二大日期的主要内容，如果未能解决你的问题，请参考以下文章