Bigquery 根据另一个表在列中查找文本
Posted
技术标签:
【中文标题】Bigquery 根据另一个表在列中查找文本【英文标题】:Bigquery finding text in a column based on another table 【发布时间】:2021-08-30 11:43:28 【问题描述】:我想返回产品描述中包含的所有屏蔽列表字词
with blocklist as (
select 'instagram' as blocklist union all
select 'facebook' as blocklist union all
select 'whatsapp web'
),
products as (
select 'seller1' as seller, 'Tenis Nike 43 call me on instagram or facebook' as product union all
select 'seller1' as seller, 'TV 42 sansung link whatsapp WEB or INSTAGRAM' as product union all
select 'seller2' as seller, 'TV 42 sansung link' as product
)
select
seller
,product
,blocklists
from
?
结果会是这样的
seller | product | blocklists |
---|---|---|
seller 1 | Tenis Nike 43 call me on instagram or facebook | instagram,facebook |
seller 1 | TV 42 sansung link whatsapp WEB | whatsapp web,instagram |
seller 2 | TV 42 sansung link | null |
我是否需要将阻止列表转换为数组,在 select ... 上使用正则表达式?
【问题讨论】:
【参考方案1】:这适用于您的示例:
with blocklist as (
select 'instagran' as blocklist union all
select 'facebook' as blocklist union all
select 'whatsapp web'
),
products as (
select 'seller1' as seller, 'Tenis Nike 43 call me on instagram or facebook' as product union all
select 'seller1' as seller, 'TV 42 sansung link whatsapp WEB or INSTAGRAM' as product union all
select 'seller2' as seller, 'TV 42 sansung link' as product
)
select p.*,
(select array_agg(bl.blocklist)
from blocklist bl
where lower(p.product) like concat('%', lower(bl.blocklist), '%')
)
from products p
【讨论】:
你好,戈登!很多。是否可以将此数组转换为逗号分隔的字符串? 我将 array_agg 更改为 string_agg 并且有效!非常感谢!【参考方案2】:考虑下面的方法
select p.*,
lower(array_to_string(regexp_extract_all(product, r'(?i)' || list), ', ')) blocklists
from products p, (select string_agg(b.blocklist, '|') list from blocklist b)
如果应用于您问题中的样本数据 - 输出是
你可以在下面自己玩
with blocklist as (
select 'instagram' as blocklist union all
select 'facebook' as blocklist union all
select 'whatsapp web'
), products as (
select 'seller1' as seller, 'Tenis Nike 43 call me on instagram or facebook' as product union all
select 'seller1' as seller, 'TV 42 sansung link whatsapp WEB or INSTAGRAM' as product union all
select 'seller2' as seller, 'TV 42 sansung link' as product
)
select p.*,
lower(array_to_string(regexp_extract_all(product, r'(?i)' || list), ', ')) blocklists
from products p, (select string_agg(b.blocklist, '|') list from blocklist b)
【讨论】:
我尝试使用 regexp_extract_all 但 bigquery 告诉我:无法解析正则表达式:重复运算符没有参数:? 几分钟后回来查看 - 将在我的答案中添加测试示例供您使用 在我的回答中添加了示例! 请注意。我的表产品有超过 1b 的十亿行。我想避免使用 Cartezian(产品 x 阻止列表),因为 Bigquery 向我发送了消息“无法查询大于 100MB 限制的行”。有没有办法逐行进行匹配?例如:从产品 p 中选择 p.*、p.product => 在列表或阻止列表数组中以上是关于Bigquery 根据另一个表在列中查找文本的主要内容,如果未能解决你的问题,请参考以下文章
取消嵌套存储在列中的 JSON 字符串 [BigQuery]