BigQuery 逗号分隔的字符串评估

Posted

技术标签:

【中文标题】BigQuery 逗号分隔的字符串评估【英文标题】:BigQuery comma delimited string evaluation 【发布时间】:2018-10-03 18:20:44 【问题描述】:

我正在使用 BigQuery,我正在尝试解析逗号分隔的字符串以查找其中的特定数字。

示例表如下

|--------|-----------------------------------------------------------|
| userID | sequence                                                  |
|--------|-----------------------------------------------------------|
| 123abc | 1,2,3,4,5,6,7,8                                           |                                          
|--------|-----------------------------------------------------------|
| 456bcd | 1,2,3,4,5,6,7,8,9,10,11                                   |
|--------|-----------------------------------------------------------|
| 789def | 1,2,3,4                                                   |
|--------|-----------------------------------------------------------|

我需要创建一个 CASE 语句,其中字符串“序列”的每个值都根据以下逻辑进行评估,并将结果输出到它自己的列。

SELECT userID
,sequence
,CASE WHEN sequence CONTAINS '1' THEN 1 ELSE 0 END AS action1 
,CASE WHEN sequence CONTAINS '2' THEN 1 ELSE 0 END AS action2 
,CASE WHEN sequence CONTAINS '3' THEN 1 ELSE 0 END as action3
....
,CASE WHEN sequence CONTAINS '9' AND '11' THEN 1 ELSE 0 END as action10

这将产生以下输出。

|--------|-------------------------|-------|-------|-------|---------|
| userID | sequence                |action1|action2|action3|action10 |
|--------|-------------------------|-------|-------|-------|---------|
| 123abc | 1,2,3,4,5,6,7,8         |   1   |   1   |   1   |    0    |                                          
|--------|-------------------------|-------|-------|-------|---------|
| 456bcd | 1,2,3,4,5,6,7,8,9,10,11 |   1   |   1   |   1   |    1    |
|--------|-------------------------|-------|-------|-------|---------|
| 789def | 1,2                     |   1   |   1   |   0   |    0    |
|--------|-------------------------|-------|-------|-------|---------|

请不要最后一个 CASE WHEN 语句非常重要,因为我需要将这种非常具体的字符串值组合作为其自己的独特操作。

我相信这可以在 SQL Server 中使用类似的方式实现:

CASE WHEN CHARINDEX('1', 'sequence')>0 THEN 1 ELSE 0 END as action1
,CASE WHEN CHARINDEX('2', 'sequence')>0 THEN 1 ELSE 0 END as action2
,CASE WHEN CHARINDEX('3', 'sequence')>0 THEN 1 ELSE 0 END as action3
...
,CASE WHEN CHARINDEX('9', 'sequence')>0 AND CHARINDEX('11', 'sequence')>0 THEN 1 ELSE 0 END as action10

但是,我在 BigQuery 中找不到可以达到相同结果的等效函数,而且我在 REGEX 上的尝试也失败了。

我将非常感谢您在这里提供一些指导。提前致谢。

【问题讨论】:

【参考方案1】:

请参阅下面的说明(适用于 BigQuery 标准 SQL)

#standardSQL
WITH `project.dataset.table` AS (
  SELECT '123abc' userID, '1,2,3,4,5,6,7,8' sequence UNION ALL
  SELECT '456bcd', '1,2,3,4,5,6,7,8,9,10,11' UNION ALL
  SELECT '789def', '1,2' 
)
SELECT userID, 
  IF((SELECT COUNT(1) FROM UNNEST(SPLIT(sequence)) value WHERE value = '1' ) > 0, 1, 0)  action1,
  IF((SELECT COUNT(1) FROM UNNEST(SPLIT(sequence)) value WHERE value = '2' ) > 0, 1, 0)  action2,
  IF((SELECT COUNT(1) FROM UNNEST(SPLIT(sequence)) value WHERE value = '3' ) > 0, 1, 0)  action3,
  IF((SELECT COUNT(1) FROM UNNEST(SPLIT(sequence)) value WHERE value = '4' ) > 0, 1, 0)  action4,
  IF((SELECT COUNT(1) FROM UNNEST(SPLIT(sequence)) value WHERE value = '5' ) > 0, 1, 0)  action5,
  IF((SELECT COUNT(1) FROM UNNEST(SPLIT(sequence)) value WHERE value = '6' ) > 0, 1, 0)  action6,
  IF((SELECT COUNT(1) FROM UNNEST(SPLIT(sequence)) value WHERE value = '7' ) > 0, 1, 0)  action7,
  IF((SELECT COUNT(1) FROM UNNEST(SPLIT(sequence)) value WHERE value = '8' ) > 0, 1, 0)  action8,
  IF((SELECT COUNT(1) FROM UNNEST(SPLIT(sequence)) value WHERE value = '9' ) > 0, 1, 0)  action9,
  IF((SELECT COUNT(1) FROM UNNEST(SPLIT(sequence)) value WHERE value IN ('10', '11' )) > 0, 1, 0)  action10
FROM `project.dataset.table`
-- ORDER BY userID  

这将为您提供如下所示的内容

Row userID  action1 action2 action3 action4 action5 action6 action7 action8 action9 action10     
1   123abc  1       1       1       1       1       1       1       1       0       0    
2   456bcd  1       1       1       1       1       1       1       1       1       1    
3   789def  1       1       0       0       0       0       0       0       0       0      

它被简化了——但是按照你的要求给你some guidance :o)

请参阅下面的重构想法(通常是无休止的过程),因此至少不那么冗长

#standardSQL
WITH `project.dataset.table` AS (
  SELECT '123abc' userID, '1,2,3,4,5,6,7,8' sequence UNION ALL
  SELECT '456bcd', '1,2,3,4,5,6,7,8,9,10,11' UNION ALL
  SELECT '789def', '1,2' 
)
SELECT userID, 
  IF('1' IN UNNEST(SPLIT(sequence)), 1, 0) AS action1,
  IF('2' IN UNNEST(SPLIT(sequence)), 1, 0) AS action2,
  IF('3' IN UNNEST(SPLIT(sequence)), 1, 0) AS action3,
  IF('4' IN UNNEST(SPLIT(sequence)), 1, 0) AS action4,
  IF('5' IN UNNEST(SPLIT(sequence)), 1, 0) AS action5,
  IF('6' IN UNNEST(SPLIT(sequence)), 1, 0) AS action6,
  IF('7' IN UNNEST(SPLIT(sequence)), 1, 0) AS action7,
  IF('8' IN UNNEST(SPLIT(sequence)), 1, 0) AS action8,
  IF('9' IN UNNEST(SPLIT(sequence)), 1, 0) AS action9,
  IF((SELECT COUNT(1) FROM UNNEST(SPLIT(sequence)) value WHERE value IN ('10', '11' )) > 0, 1, 0)  action10
FROM `project.dataset.table`

更新以解决有关 UNION ALL 的 cmets

上面使用了您问题中的虚拟数据,以便您可以测试、使用它 - 同时解决方案实际上是

#standardSQL
SELECT userID, 
  IF('1' IN UNNEST(SPLIT(sequence)), 1, 0) AS action1,
  IF('2' IN UNNEST(SPLIT(sequence)), 1, 0) AS action2,
  IF('3' IN UNNEST(SPLIT(sequence)), 1, 0) AS action3,
  IF('4' IN UNNEST(SPLIT(sequence)), 1, 0) AS action4,
  IF('5' IN UNNEST(SPLIT(sequence)), 1, 0) AS action5,
  IF('6' IN UNNEST(SPLIT(sequence)), 1, 0) AS action6,
  IF('7' IN UNNEST(SPLIT(sequence)), 1, 0) AS action7,
  IF('8' IN UNNEST(SPLIT(sequence)), 1, 0) AS action8,
  IF('9' IN UNNEST(SPLIT(sequence)), 1, 0) AS action9,
  IF((SELECT COUNT(1) FROM UNNEST(SPLIT(sequence)) value WHERE value IN ('10', '11' )) > 0, 1, 0)  action10
FROM `project.dataset.table`

【讨论】:

感谢@Mikhail Berlyant,这是一个很好的解决方案,虽然并不理想,因为我有很多很多行的用户 ID,所以我更喜欢使用不需要和广泛的 UNION ALL 列表的解决方案 如果您在 WITH clouse 中引用 UNION ALL - 这不是解决方案的一部分,而是使用示例数据的方式 - 请参阅我的回答中的几秒内更新! 查看答案中的更新 - 如果仍然感到困惑,请检查并告诉我! - 显然你应该使用你的项目和数据集和表名而不是project.dataset.table 谢谢@Mikhail Berlyant - 这是漫长的一天,所以我误解了你的答案,但它完全有道理。我不得不将 CASE 语句修改为以下内容:,IF((SELECT COUNT(1) FROM UNNEST(SPLIT(sequence)) value WHERE value IN ('10') AND value NOT IN ('11'))>0, 1, 0) AS action10,IF((SELECT COUNT(1) FROM UNNEST(SPLIT(sequence)) value WHERE value IN ('10') AND value IN ('11')) > 0, 1, 0) AS action11,因为否则IN('','') 本质上是OR,它会产生不正确的结果,但你的逻辑很清晰,你的方法也很容易理解。谢谢!

以上是关于BigQuery 逗号分隔的字符串评估的主要内容,如果未能解决你的问题,请参考以下文章

BigQuery 将列作为逗号分隔值

检查列表字符串 BigQuery 中的元素是不是

Bigquery STRING 数组到 INT 数组

需要将字符串从一列分隔为多列,以';'分隔bigquery中的分隔符

读取字符串直到第一个逗号,然后计算字符串的值?

将逗号分隔值转换为双引号逗号分隔字符串