jaro_winkle_distance 的 google-bigquery UDF
Posted
技术标签:
【中文标题】jaro_winkle_distance 的 google-bigquery UDF【英文标题】:google-bigquery UDF for jaro_winkle_distance 【发布时间】:2016-10-05 23:46:09 【问题描述】:我是下面计算jaro_winkle_distance的UDF代码 使用 json 测试数据对其进行测试时似乎可以工作,但是当我尝试在 google-bigquery UI 中调用它时,它始终给我零分。 即使是自我加入,例如
输入:
[
a: "Liu",b:"Lau",
a: "John",b:"Jone"]
输出:
[
"scr": 80
,
"scr": 87
]
SQL:
CREATE TEMP FUNCTION
jwd(a STRING,
b STRING)
RETURNS INT64
LANGUAGE js AS """
// Assumes 'doInterestingStuff' is defined in one of the library files.
//return doInterestingStuff(a, b);
return jaro_winkler_distance(a,b);
""" OPTIONS ( library="gs://kayama808/javascript/jaro_winkler_google_UDF.js" );
SELECT
x.name name1,
jwd(x.name,
x.name) scr
FROM
babynames.usa_1910_2013_copy x
WHERE
x.gender = 'F' and x.number >= 1000 and x.state = 'CA'
ORDER BY
scr DESC;
http://storage.googleapis.com/bigquery-udf-test-tool/testtool.html
https://storage.cloud.google.com/kayama808/javascript/jaro_winkler_google_UDF.js?_ga=1.184402278.1320598031.1475534357
【问题讨论】:
我尝试使用 float64 而不是 int64 但它仍然无法正常工作 bigquery.cloud.google.com:443/savedquery/… storage.googleapis.com/kayama808/javascript/… 【参考方案1】:试试下面。它按预期工作,结果为
name1 name2 scr
Liu Liu 100
John Jone 87
Liu Lau 80
希望您能够将其放回您的外部库文件中:o)
CREATE TEMP FUNCTION jwd(a STRING, b STRING)
RETURNS INT64
LANGUAGE js AS """
/* JS implementation of the strcmp95 C function written by
Bill Winkler, George McLaughlin, Matt Jaro and Maureen Lynch,
released in 1994 (http://web.archive.org/web/20100227020019/http://www.census.gov/geo/msb/stand/strcmp.c).
a and b should be strings. Always performs case-insensitive comparisons
and always adjusts for long strings. */
var jaro_winkler_adjustments =
'A': 'E',
'A': 'I',
'A': 'O',
'A': 'U',
'B': 'V',
'E': 'I',
'E': 'O',
'E': 'U',
'I': 'O',
'I': 'U',
'O': 'U',
'I': 'Y',
'E': 'Y',
'C': 'G',
'E': 'F',
'W': 'U',
'W': 'V',
'X': 'K',
'S': 'Z',
'X': 'S',
'Q': 'C',
'U': 'V',
'M': 'N',
'L': 'I',
'Q': 'O',
'P': 'R',
'I': 'J',
'2': 'Z',
'5': 'S',
'8': 'B',
'1': 'I',
'1': 'L',
'0': 'O',
'0': 'Q',
'C': 'K',
'G': 'J',
'E': ' ',
'Y': ' ',
'S': ' '
;
if (!a || !b) return 0.0;
a = a.trim().toUpperCase();
b = b.trim().toUpperCase();
var a_len = a.length;
var b_len = b.length;
var a_flag = []; var b_flag = [];
var search_range = Math.floor(Math.max(a_len, b_len) / 2) - 1;
var minv = Math.min(a_len, b_len);
// Looking only within the search range, count and flag the matched pairs.
var Num_com = 0;
var yl1 = b_len - 1;
for (var i = 0; i < a_len; i++)
var lowlim = (i >= search_range) ? i - search_range : 0;
var hilim = ((i + search_range) <= yl1) ? (i + search_range) : yl1;
for (var j = lowlim; j <= hilim; j++)
if (b_flag[j] !== 1 && a[j] === b[i])
a_flag[j] = 1;
b_flag[i] = 1;
Num_com++;
break;
// Return if no characters in common
if (Num_com === 0) return 0.0;
// Count the number of transpositions
var k = 0; var N_trans = 0;
for (var i = 0; i < a_len; i++)
if (a_flag[i] === 1)
var j;
for (j = k; j < b_len; j++)
if (b_flag[j] === 1)
k = j + 1;
break;
if (a[i] !== b[j]) N_trans++;
N_trans = Math.floor(N_trans / 2);
// Adjust for similarities in nonmatched characters
var N_simi = 0; var adjwt = jaro_winkler_adjustments;
if (minv > Num_com)
for (var i = 0; i < a_len; i++)
if (!a_flag[i])
for (var j = 0; j < b_len; j++)
if (!b_flag[j])
if (adjwt[a[i]] === b[j])
N_simi += 3;
b_flag[j] = 2;
break;
var Num_sim = (N_simi / 10.0) + Num_com;
// Main weight computation
var weight = Num_sim / a_len + Num_sim / b_len + (Num_com - N_trans) / Num_com;
weight = weight / 3;
// Continue to boost the weight if the strings are similar
if (weight > 0.7)
// Adjust for having up to the first 4 characters in common
var j = (minv >= 4) ? 4 : minv;
var i;
for (i = 0; (i < j) && a[i] === b[i]; i++)
if (i) weight += i * 0.1 * (1.0 - weight) ;
// Adjust for long strings.
// After agreeing beginning chars, at least two more must agree
// and the agreeing characters must be more than half of the
// remaining characters.
if (minv > 4 && Num_com > i + 1 && 2 * Num_com >= minv + i)
weight += (1 - weight) * ((Num_com - i - 1) / (a_len * b_len - i*2 + 2));
return Math.round(weight*100);
""";
SELECT
name1, name2,
jwd(name1, name2) scr
FROM -- babynames.usa_1910_2013_copy x
(
select "Liu" as name1, "Lau" as name2 union all
select "Liu" as name1, "Liu" as name2 union all
select "John" as name1, "Jone" as name2
) x
ORDER BY scr DESC
另外:我刚刚仔细检查了您的 jaro_winkler_google_UDF2.js 文件,并清楚地看到该文件存在问题。 使用我的答案中的代码修复此文件
或者,只需删除其中的以下行
var a = r.a;
var b = r.b;
并取消注释
//jaro_winkler.distance = function(a, b)
//return Math.round(weight*100)
并在其中使用emit
进行评论
jaro_winkler_distance=function(r, emit)
emit(weight);
那你应该没事吧!
【讨论】:
我们使用您提供的内联 JavaScript 示例,它可以工作。但是,我们尝试使用 google 存储并将 javascript 放在那里而不是内联。看起来它要求发射工作错误:ReferenceError:发射未在 gs://kayama808/javascript/jaro_winkler_google_UDF3.js 第 154 行,第 4-5 列中定义 google-bigquery 上是否有在存储中使用 javascript UDF 的端到端示例。它显示了文档中的示例,但不包含 JS 代码。在测试工具中,它使用了 javascript 中的发出。为什么谷歌不能只使用 JS UDF "AS-IS" ?? 我相信,测试工具适用于传统版本的 UDF - 这是表值 UDF - 与标准中的标量 UDF 我将使用 lib 检查/测试,但与此同时 - 你可以使用内联版本吗? 也是我的 jsbin,用于测试 UDF JS 代码。 jsbin.com/wawamex/6/edit?html,js,output 看起来像你说的那样不需要发射,所以为什么在测试工具中将发射放在那里。以上是关于jaro_winkle_distance 的 google-bigquery UDF的主要内容,如果未能解决你的问题,请参考以下文章