如何计算bigquery中两个字符串的字母差异?
Posted
技术标签:
【中文标题】如何计算bigquery中两个字符串的字母差异?【英文标题】:How to count letter differences of two strings in bigquery? 【发布时间】:2018-09-09 05:45:50 【问题描述】:例如我有:
1: 6c71d997ba39
2: 6c71d997d269
我需要得到 4 个。
【问题讨论】:
【参考方案1】:您可以考虑将Levenshtein distance 用于您的用例
Levenshtein 距离是衡量两个序列之间差异的字符串度量。通俗地说,两个单词之间的 Levenshtein 距离是将一个单词变为另一个单词所需的最小单字符编辑(插入、删除或替换)次数
以下示例适用于 BigQuery 标准 SQL
#standardSQL
CREATE TEMPORARY FUNCTION EDIT_DISTANCE(string1 STRING, string2 STRING)
RETURNS INT64
LANGUAGE js AS """
var _extend = function(dst)
var sources = Array.prototype.slice.call(arguments, 1);
for (var i=0; i<sources.length; ++i)
var src = sources[i];
for (var p in src)
if (src.hasOwnProperty(p)) dst[p] = src[p];
return dst;
;
var Levenshtein =
/**
* Calculate levenshtein distance of the two strings.
*
* @param str1 String the first string.
* @param str2 String the second string.
* @return Integer the levenshtein distance (0 and above).
*/
get: function(str1, str2)
// base cases
if (str1 === str2) return 0;
if (str1.length === 0) return str2.length;
if (str2.length === 0) return str1.length;
// two rows
var prevRow = new Array(str2.length + 1),
curCol, nextCol, i, j, tmp;
// initialise previous row
for (i=0; i<prevRow.length; ++i)
prevRow[i] = i;
// calculate current row distance from previous row
for (i=0; i<str1.length; ++i)
nextCol = i + 1;
for (j=0; j<str2.length; ++j)
curCol = nextCol;
// substution
nextCol = prevRow[j] + ( (str1.charAt(i) === str2.charAt(j)) ? 0 : 1 );
// insertion
tmp = curCol + 1;
if (nextCol > tmp)
nextCol = tmp;
// deletion
tmp = prevRow[j + 1] + 1;
if (nextCol > tmp)
nextCol = tmp;
// copy current col value into previous (in preparation for next iteration)
prevRow[j] = curCol;
// copy last col value into previous (in preparation for next iteration)
prevRow[j] = nextCol;
return nextCol;
;
var the_string1;
try
the_string1 = decodeURI(string1).toLowerCase();
catch (ex)
the_string1 = string1.toLowerCase();
try
the_string2 = decodeURI(string2).toLowerCase();
catch (ex)
the_string2 = string2.toLowerCase();
return Levenshtein.get(the_string1, the_string2)
""";
WITH strings AS (
SELECT '1: 6c71d997ba39' string1, '2: 6c71d997d269' string2
)
SELECT string1, string2, EDIT_DISTANCE(string1, string2) changes
FROM strings
结果
Row string1 string2 changes
1 1: 6c71d997ba39 2: 6c71d997d269 4
【讨论】:
【参考方案2】:SELECT
(SELECT COUNTIF(c != s2[OFFSET(off)])
FROM UNNEST(SPLIT(s1, '')) AS c WITH OFFSET off) AS count
FROM dataset.table
【讨论】:
虽然这段代码 sn-p 可以解决问题,但它没有解释为什么或如何回答这个问题。请include an explanation for your code,因为这确实有助于提高您的帖子质量。请记住,您是在为将来的读者回答问题,而这些人可能不知道您提出代码建议的原因。【参考方案3】:来源:https://***.com/a/57499387/11059644
准备使用共享 UDF - Levenshtein 距离:
SELECT fhoffa.x.levenshtein('felipe', 'hoffa'), fhoffa.x.levenshtein('googgle', 'goggles'), fhoffa.x.levenshtein('is this the', 'Is This The')
【讨论】:
以上是关于如何计算bigquery中两个字符串的字母差异?的主要内容,如果未能解决你的问题,请参考以下文章
单词错误纠正功能 编辑距离 最大公共字串 两个字符串的相似度 差异度
python二级练习和考试复习(计算元音字母数&&首字母恢复小写&&差异)