在 BigQuery 中使用 javascript udf 进行 tf idf 计算时，UDF 工作程序在执行期间超时

Posted 2023-04-13

技术标签:

【中文标题】在 BigQuery 中使用 javascript udf 进行 tf idf 计算时，UDF 工作程序在执行期间超时【英文标题】：UDF worker timed out during execution when using javascript udf in BigQuery for tf idf calculation 【发布时间】：2019-07-30 05:17:10 【问题描述】：

我尝试在 BigQuery 中实现一个查询，该查询使用 tf-idf 分数从更大的文档集合中查找文档的热门关键字。

在计算关键字的 tf-idf 分数之前，我清理文档（例如删除停用词和标点符号），然后从文档中创建 1、2、3 和 4-gram，然后在 n-克。

为了执行这种清理、n-gram 创建和词干提取，我使用了 javascript 库和 js udf。这是示例查询：

CREATE TEMP FUNCTION nlp_compromise_tokens(str STRING)
RETURNS ARRAY<STRUCT<ngram STRING, count INT64>> LANGUAGE js AS '''
  // creating 1,2,3 and 4 grams using compormise js
  // before that I remove stopwords using .removeStopWords
  // function lent from remove_stop_words.js
  tokens_from_compromise = nlp(str.removeStopWords()).normalize().ngrams(max:4).data()

  // The stemming function that stems
  // each space separated tokens inside the n-grams
  // I use snowball.babel.js here
  function stems_from_space_separated_string(tokens_string) 
    var stem = snowballFactory.newStemmer('english').stem;
    splitted_tokens = tokens_string.split(" ");
    splitted_stems = splitted_tokens.map(x => stem(x));
    return splitted_stems.join(" ")
  

  // Returning the n-grams from compromise which are 
  // stemmed internally and at least length of 2
  // alongside the count of the token inside the document
  var ngram_count = tokens_from_compromise.map(function(item) 
    return 
      ngram: stems_from_space_separated_string(item.normal),
      count: item.count
    ;
  );
  return ngram_count
'''
OPTIONS (
  library=["gs://fh-bigquery/js/compromise.min.11.14.0.js","gs://syed_mag/js/snowball.babel.js","gs://syed_mag/js/remove_stop_words.js"]);

with doc_table as (
  SELECT 1 id, "A quick brown 20 fox fox fox jumped over the lazy-dog" doc UNION ALL
  SELECT 2, "another 23rd quicker browner fox jumping over Lazier broken! dogs." UNION ALL
  SELECT 3, "This dog is more than two-feet away." #UNION ALL
),
  ngram_table as(
  select
    id,
    doc,
    nlp_compromise_tokens(doc) as compromise_tokens
  from
    doc_table),
n_docs_table as (
  select count(*) as n_docs from ngram_table
),
df_table as (
SELECT
  compromise_token.ngram,
  count(*) as df
FROM
  ngram_table, UNNEST(compromise_tokens) as compromise_token
GROUP BY
  ngram
),

idf_table as(
SELECT
  ngram,
  df,
  n_docs,
  LN((1+n_docs)/(1+df)) + 1 as idf_smooth
FROM
  df_table
CROSS JOIN
  n_docs_table),

tf_idf_table as (  
SELECT
  id,
  doc,
  compromise_token.ngram,
  compromise_token.count as tf,
  idf_table.ngram as idf_ngram,
  idf_table.idf_smooth,
  compromise_token.count * idf_table.idf_smooth as tf_idf
FROM
  ngram_table, UNNEST(compromise_tokens) as compromise_token
JOIN
  idf_table
ON
  compromise_token.ngram = idf_table.ngram)

SELECT
  id,
  ARRAY_AGG(STRUCT(ngram,tf_idf)) as top_keyword,
  doc
FROM(
  SELECT
    id,
    doc,
    ngram,
    tf_idf,
    ROW_NUMBER() OVER (PARTITION BY id ORDER BY tf_idf DESC) AS rn
  FROM
    tf_idf_table)
WHERE
  rn < 5
group by
  id,
  doc

示例输出如下所示：

在此示例中只有三个示例手工行。

当我尝试使用具有 1000 行的稍大一点的表执行相同的代码时，它再次运行良好，尽管需要更长的时间才能完成（仅 1000 行大约需要 6 分钟）。此示例表 (1MB) 可以是 json 格式的found here。

现在，当我在 a larger dataset (159K rows - 155MB) 上尝试查询时，查询在大约 30 分钟后耗尽，并显示以下消息：

错误：用户定义函数：UDF worker 在执行期间超时。为 worker worker-109498 触发了意外中止：job_timeout （错误代码：超时）

我能否改进我的 udf 函数或整体查询结构，以确保它在更大的数据集（124,783,298 行 - 244GB）上平稳运行？

注意我已经对 google 存储中的 js 文件给予了适当的许可，以便任何人都可以访问这些 javascrip 来运行示例查询。

【问题讨论】：

【参考方案1】：

BigQuery UDF 非常方便，但计算量不大，不会让您的查询变慢或耗尽资源。有关限制和最佳实践，请参阅 doc reference。一般来说，您可以在本机 SQL 中转换的任何 UDF 逻辑都将更快并且使用更少的资源。

我会将您的分析分成多个步骤，将结果保存到每个步骤的新表中：

清理文档（例如删除停用词和标点符号）从文档中创建 1、2、3 和 4-gram，然后在 n-gram 内进行词干提取。计算分数。

旁注：您可以使用多个 CTE 运行它来保存阶段，而不是将每个步骤保存到本机表中，但我不知道这是否会使查询超出资源限制。

【讨论】：

以上是关于在 BigQuery 中使用 javascript udf 进行 tf idf 计算时，UDF 工作程序在执行期间超时的主要内容，如果未能解决你的问题，请参考以下文章