基于另一个表的计数的表中的红移样本

Posted 2023-03-31

技术标签:

【中文标题】基于另一个表的计数的表中的红移样本【英文标题】：Redshift sample from table based on count of another table 【发布时间】：2020-06-30 15:41:44 【问题描述】：

我的 TableA 有 3000 行（可以是任何小于 10000 的数字）。我需要创建 10000 行的 TableX。所以我需要从 TableB 中选择随机 10000 - （TableA 中的行数）（并添加到 TableA 中）来创建 TableX。请问有什么想法吗？像这样的东西（显然行不通）：

Create table TableX as
select * from TableA
union
select * from TableB limit (10000 - count(*) from TableA);

【问题讨论】：

select * from tablex limit 10000? TableX 不存在。 TableB 有数百万行。我需要一个来自 TableB 的随机样本。该样本的数量基于 TableA 中的行数。然后将 TableA 和 TableB 中的随机选择组合成新表 TableX 如果你不使用order by limit 有点随机 【参考方案1】：

您可以使用union all 和窗口函数。你没有列出表格列，所以我假设col1 和col2：

insert into tableX (col1, col2)
select col1, col2 from table1
union all 
select t2.col1, t2.col2
from (select t2.*, row_number() over(order by random()) from table2 t2) t2
inner join (select count(*) cnt from table1) t1 on t2.rn <= 10000 - t1.cnt

union all 中的第一个查询选择 table1 中的所有行。第二个查询将随机行号分配给table2 中的行，然后根据需要选择尽可能多的行以达到10000 的总数。

实际上从两个表中选择所有行可能更简单，然后在外部查询中选择order by 和limit：

insert into tableX (col1, col2)
select col1, col2
from (
    select col1, col2, 't1' which from table1
    union all 
    select col1, col2, 't2' from table2
) t
order by which, random()
limit 10000

【讨论】：

【参考方案2】：

with inparms as (
  select 10000 as target_rows
), acount as (
  select count(*) as acount, inparms.target_rows 
    from tablea
   cross join inparms
), btag as (
  select b.*, 'tableb' as tabsource, 
         row_number() over (order by random()) as rnum
    from tableb
)
select a.*, 'tablea', row_number() over (order by 1) as rnum
  from tablea
union all
select b.*
  from btag b
  join acount a on b.rnum <= a.target_rows - a.acount
;

【讨论】：

以上是关于基于另一个表的计数的表中的红移样本的主要内容，如果未能解决你的问题，请参考以下文章