如何检查 Bigquery 中两个字符串共有多少个单词?

Posted

技术标签:

【中文标题】如何检查 Bigquery 中两个字符串共有多少个单词?【英文标题】:How can I check how many words in common two strings have in Bigquery? 【发布时间】:2018-05-02 15:05:34 【问题描述】:

我有一个 Bigquery 表中的文档列表。其中一些名称非常相似。我需要检查每一对文档,看看它们有多少相同的单词,所以我可以建议删除其中的一个。

例如:

 Spreadsheets
 Quality Control.xlsx
 Product Structure.xlsx
 Invoices Sent April.xslx
 Invoices Sent March.xlsx
 Total Costs April.xlsx
 Total Costs March.xlsx
 Process of Quality Control.xlsx`

我会得到这样的结果

 Spreadsheet                        |Matching Spreadsheet             |Words
 Quality Control.xlsx               |Process of Quality Control.xlsx  |2
 Product Structure.xlsx             |null                             |null
 Invoices Sent April.xslx           |Invoices Sent March.xlsx         |2
 Invoices Sent March.xlsx           |Invoices Sent April.xlsx         |2
 Total Costs April.xlsx             |Total Costs March.xlsx           |2
 Total Costs March.xlsx             |Total Costs April.xlsx           |2
 Process of Quality Control.xlsx    |Quality Control.xlsx             |2

【问题讨论】:

你应该提供更多细节! 刚刚更新了描述。希望我的问题更清楚 【参考方案1】:

以下是 BigQuery 标准 SQL 的示例

#standardSQL
WITH `project.dataset.spreadsheets`  AS (
  SELECT 1 AS id, 'Quality Control.xlsx' AS spreadsheet UNION ALL
  SELECT 2, 'Product Structure.xlsx' UNION ALL
  SELECT 3, 'Invoices Sent April.xslx' UNION ALL
  SELECT 4, 'Invoices Sent March.xlsx' UNION ALL
  SELECT 5, 'Total Costs April.xlsx' UNION ALL
  SELECT 6, 'Total Costs March.xlsx' UNION ALL
  SELECT 7, 'Process of Quality Control.xlsx' 
)
SELECT 
  id, s1 spreadsheet, IF(words = 0, NULL, s2) matching_spreadsheet, words 
FROM (
  SELECT 
    id, s1,
    ARRAY_AGG(STRUCT(s2, words) ORDER BY words DESC LIMIT 1)[OFFSET(0)].* 
  FROM (
    SELECT t1.id, t1.spreadsheet s1, t2.spreadsheet s2,
      ( SELECT COUNTIF(word != 'xlsx') 
        FROM UNNEST(REGEXP_EXTRACT_ALL(t1.spreadsheet, r'\w+')) word
        JOIN UNNEST(REGEXP_EXTRACT_ALL(t2.spreadsheet, r'\w+')) word
        USING(word)) words
    FROM `project.dataset.spreadsheets` t1
    CROSS JOIN `project.dataset.spreadsheets` t2
    WHERE t1.spreadsheet != t2.spreadsheet
  )
  GROUP BY id, s1
)
-- ORDER BY id

结果为

Row id  spreadsheet                     matching_spreadsheet            words
1   1   Quality Control.xlsx            Process of Quality Control.xlsx 2
2   2   Product Structure.xlsx          null                            0
3   3   Invoices Sent April.xslx        Invoices Sent March.xlsx        2
4   4   Invoices Sent March.xlsx        Invoices Sent April.xslx        2
5   5   Total Costs April.xlsx          Total Costs March.xlsx          2 
6   6   Total Costs March.xlsx          Total Costs April.xlsx          2
7   7   Process of Quality Control.xlsx Quality Control.xlsx            2

【讨论】:

以上是关于如何检查 Bigquery 中两个字符串共有多少个单词?的主要内容,如果未能解决你的问题,请参考以下文章

BigQuery:如何计算表格中数字的频率

检查需要删除多少个字符才能在 Python 中生成字谜

十个男生,每两个男生中间站一个女生,一共站多少个女生?怎么列算式

一共有10个男生,让相邻的两个男生之间站一个女生,一共可以站进多少女生

一共有10个男生,相邻两个男生之间站一个女生,问可以站多少女生

我们一共有10男生,老师让相邻两个男生之间站一个女生,一共可以站进多少个女生怎么做这道题?