REGEXP_REPLACE 模式必须是 const 吗?比较 BigQuery 中的字符串
Posted
技术标签:
【中文标题】REGEXP_REPLACE 模式必须是 const 吗?比较 BigQuery 中的字符串【英文标题】:REGEXP_REPLACE pattern has to be const? Comparing strings in BigQuery 【发布时间】:2016-03-22 02:05:20 【问题描述】:我正在尝试使用 BigQuery 中的 Dice 系数(又名配对相似度)来测量字符串之间的相似度。有一秒钟,我认为我可以只使用标准函数来做到这一点。
假设我需要比较“gana”和“gano”。然后我会将这两个字符串预先“烹制”成 'ga|an|na' 和 'ga|an|no' (2-grams 列表)并执行以下操作:
REGEXP_REPLACE('ga|an|na', 'ga|an|no', '')
然后根据长度的变化,我可以计算出我的系数。
但是一旦应用到我得到的表上:
REGEXP_REPLACE 第二个参数必须是 const 且非空
有什么解决方法吗?使用简单的 REPLACE() 第二个参数可以是一个字段。
也许有更好的方法来做到这一点?我知道,我可以改为使用 UDF。但我想在这里避开它们。我们正在运行大型任务,UDF 通常较慢(至少根据我的经验)并且受到不同的并发限制。
【问题讨论】:
【参考方案1】:您可以在其中包含用于 BigQuery SQL 查询的 javascript 代码。
要测量相似度,您可以使用 Levenshtein 的距离和这样的查询(来自 https://***.com/a/33443564/132438):
SELECT *
FROM js(
(
SELECT title,target FROM
(SELECT 'hola' title, 'hello' target), (SELECT 'this is beautiful' title, 'that is fantastic' target)
),
title, target,
// Output schema.
"[name: 'title', type:'string',
name: 'target', type:'string',
name: 'distance', type:'integer']",
// The function
"function(r, emit)
var _extend = function(dst)
var sources = Array.prototype.slice.call(arguments, 1);
for (var i=0; i<sources.length; ++i)
var src = sources[i];
for (var p in src)
if (src.hasOwnProperty(p)) dst[p] = src[p];
return dst;
;
var Levenshtein =
/**
* Calculate levenshtein distance of the two strings.
*
* @param str1 String the first string.
* @param str2 String the second string.
* @return Integer the levenshtein distance (0 and above).
*/
get: function(str1, str2)
// base cases
if (str1 === str2) return 0;
if (str1.length === 0) return str2.length;
if (str2.length === 0) return str1.length;
// two rows
var prevRow = new Array(str2.length + 1),
curCol, nextCol, i, j, tmp;
// initialise previous row
for (i=0; i<prevRow.length; ++i)
prevRow[i] = i;
// calculate current row distance from previous row
for (i=0; i<str1.length; ++i)
nextCol = i + 1;
for (j=0; j<str2.length; ++j)
curCol = nextCol;
// substution
nextCol = prevRow[j] + ( (str1.charAt(i) === str2.charAt(j)) ? 0 : 1 );
// insertion
tmp = curCol + 1;
if (nextCol > tmp)
nextCol = tmp;
// deletion
tmp = prevRow[j + 1] + 1;
if (nextCol > tmp)
nextCol = tmp;
// copy current col value into previous (in preparation for next iteration)
prevRow[j] = curCol;
// copy last col value into previous (in preparation for next iteration)
prevRow[j] = nextCol;
return nextCol;
;
var the_title;
try
the_title = decodeURI(r.title).toLowerCase();
catch (ex)
the_title = r.title.toLowerCase();
emit(title: the_title, target: r.target,
distance: Levenshtein.get(the_title, r.target));
")
【讨论】:
嗨 Felipe,那么 JavaScript 是唯一的选择吗?由于速度限制,我想避免它。但如果不可能 - 将使用 JS。 OMG 内联javascript?!我希望这样的事情是可能的,但我不知道该怎么做。感谢您发布此信息! 你能告诉我在哪里可以找到这个 JS() 函数的文档吗?即使我现在知道它存在,我也很难找到它。谢谢!【参考方案2】:以下是为相似度量身定制的 在How to perform trigram operations in Google BigQuery? 中使用并基于@thomaspark 的https://storage.googleapis.com/thomaspark-sandbox/udf-examples/pataky.js
SELECT text1, text2, similarity FROM
JS(
// input table
(
SELECT * FROM
(SELECT 'mikhail' AS text1, 'mikhail' AS text2),
(SELECT 'mikhail' AS text1, 'mike' AS text2),
(SELECT 'mikhail' AS text1, 'michael' AS text2),
(SELECT 'mikhail' AS text1, 'javier' AS text2),
(SELECT 'mikhail' AS text1, 'thomas' AS text2)
) ,
// input columns
text1, text2,
// output schema
"[name: 'text1', type:'string',
name: 'text2', type:'string',
name: 'similarity', type:'float']
",
// function
"function(r, emit)
var _extend = function(dst)
var sources = Array.prototype.slice.call(arguments, 1);
for (var i=0; i<sources.length; ++i)
var src = sources[i];
for (var p in src)
if (src.hasOwnProperty(p)) dst[p] = src[p];
return dst;
;
var Levenshtein =
/**
* Calculate levenshtein distance of the two strings.
*
* @param str1 String the first string.
* @param str2 String the second string.
* @return Integer the levenshtein distance (0 and above).
*/
get: function(str1, str2)
// base cases
if (str1 === str2) return 0;
if (str1.length === 0) return str2.length;
if (str2.length === 0) return str1.length;
// two rows
var prevRow = new Array(str2.length + 1),
curCol, nextCol, i, j, tmp;
// initialise previous row
for (i=0; i<prevRow.length; ++i)
prevRow[i] = i;
// calculate current row distance from previous row
for (i=0; i<str1.length; ++i)
nextCol = i + 1;
for (j=0; j<str2.length; ++j)
curCol = nextCol;
// substution
nextCol = prevRow[j] + ( (str1.charAt(i) === str2.charAt(j)) ? 0 : 1 );
// insertion
tmp = curCol + 1;
if (nextCol > tmp)
nextCol = tmp;
// deletion
tmp = prevRow[j + 1] + 1;
if (nextCol > tmp)
nextCol = tmp;
// copy current col value into previous (in preparation for next iteration)
prevRow[j] = curCol;
// copy last col value into previous (in preparation for next iteration)
prevRow[j] = nextCol;
return nextCol;
;
var the_text1;
try
the_text1 = decodeURI(r.text1).toLowerCase();
catch (ex)
the_text1 = r.text1.toLowerCase();
try
the_text2 = decodeURI(r.text2).toLowerCase();
catch (ex)
the_text2 = r.text2.toLowerCase();
emit(text1: the_text1, text2: the_text2,
similarity: 1 - Levenshtein.get(the_text1, the_text2) / the_text1.length);
"
)
ORDER BY similarity DESC
【讨论】:
你能告诉我在哪里可以找到这个 JS() 函数的文档吗?即使我现在知道它存在,我也很难找到它。谢谢! 链接在回答中 我想我只是没有看到它。你提到的链接是一个UDF。我对“FROM JS(...)”部分感兴趣,您可以在其中声明 Javascript 函数内联以及 SQL SELECT 语句的其余部分。我在哪里可以找到这方面的文档? @fxm27 您可以在此处找到有关 BigQuery 用户定义函数的文档 --> cloud.google.com/bigquery/user-defined-functions 再次,看起来这些示例正在使用 BigQuery UI 中的 UDF 编辑器(“bigquery.defineFunction(...”),或者使用(例如通过 -- bq 命令行的 udf_resource 选项。)我想更多地了解如何在 SQL 查询中定义 UDF 内联,就像您在上面发布的代码中所做的那样(“FROM JS(...) “。什么是 JS()?那是 BigQuery SQL 函数吗?还是 BigQuery SQL 语言构造?它在文档中的哪里描述?再说一次,我只是没有看到它。你在看哪个部分?【参考方案3】:REGEXP_REPLACE 第二个参数必须是 const 且非 null 有没有 解决方法?
以下只是解决上述问题的想法/方向,适用于您描述的逻辑:
我会将这两个字符串预先“烹饪”成 'ga|an|na' 和 'ga|an|no'(2-gram 列表)并执行以下操作:REGEXP_REPLACE('ga|an|na', 'ga|an|no', '')。然后根据长度的变化我可以计算出我的 系数。
“解决方法”是:
SELECT a.w AS w1, b.w AS w2, SUM(a.x = b.x) / COUNT(1) AS c
FROM (
SELECT w, SPLIT(p, '|') AS x, ROW_NUMBER() OVER(PARTITION BY w) AS pos
FROM
(SELECT 'gana' AS w, 'ga|an|na' AS p)
) AS a
JOIN (
SELECT w, SPLIT(p, '|') AS x, ROW_NUMBER() OVER(PARTITION BY w) AS pos
FROM
(SELECT 'gano' AS w, 'ga|an|no' AS p),
(SELECT 'gamo' AS w, 'ga|am|mo' AS p),
(SELECT 'kana' AS w, 'ka|an|na' AS p)
) AS b
ON a.pos = b.pos
GROUP BY w1, w2
也许有更好的方法来做到这一点?
下面是如何在此处处理 Pair Similarity 的简单示例(包括构建二元组和计算系数:
SELECT
a.word AS word1, b.word AS word2,
2 * SUM(a.bigram = b.bigram) /
(EXACT_COUNT_DISTINCT(a.bigram) + EXACT_COUNT_DISTINCT(b.bigram) ) AS c
FROM (
SELECT word, char + next_char AS bigram
FROM (
SELECT word, char, LEAD(char, 1) OVER(PARTITION BY word ORDER BY pos) AS next_char
FROM (
SELECT word, SPLIT(word, '') AS char, ROW_NUMBER() OVER(PARTITION BY word) AS pos
FROM
(SELECT 'gana' AS word)
)
)
WHERE next_char IS NOT NULL
GROUP BY 1, 2
) a
CROSS JOIN (
SELECT word, char + next_char AS bigram
FROM (
SELECT word, char, LEAD(char, 1) OVER(PARTITION BY word ORDER BY pos) AS next_char
FROM (
SELECT word, SPLIT(word, '') AS char, ROW_NUMBER() OVER(PARTITION BY word) AS pos
FROM
(SELECT 'gano' AS word)
)
)
WHERE next_char IS NOT NULL
GROUP BY 1, 2
) b
GROUP BY 1, 2
【讨论】:
以上是关于REGEXP_REPLACE 模式必须是 const 吗?比较 BigQuery 中的字符串的主要内容,如果未能解决你的问题,请参考以下文章
Oracle regexp_replace 挑选出模式匹配组
SQL:Regexp_replace 但仅在值第一次出现在记录中时