SQL Regex substr模式匹配
Posted
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了SQL Regex substr模式匹配相关的知识,希望对你有一定的参考价值。
好吧,所以我向我认识的程序员询问了以下内容,没有人能想出一种方法来做...。如果可以,请帮助!
我正在为医院程序进行模式匹配,在这个示例中,它将将¾个词从一个概念匹配到另一个。基本上,我想使它与“ x,a,y,z”匹配(请记住,我已经删除了所有字母数字字符,因此我可以这样做。下面是一个长期的示例,我需要找到一种方法来使它基于单词数来动态化,而不是每次迭代都这样做。例如:
'Spinal Fusion' = 'Fusion of the Spine'
'Mammogram-bilateral' = 'bilateral mammogram scan'
'Echocardiogram (ECG)' = 'ECG'
我写出了它可能如何工作,但是其中一些迭代有几十个,因此,在进行statement时需要一种情况。如果有人知道如何使这种动态变化,我将永远感激
WHEN regexp_count(x.y,'(\w+)+') =4 and regexp_count(a.b,'(\w+)+') =3 – (when the count of words is = to 3 and 4)
AND (
( regexp_substr (x.y,'\w+\b',1,1) = regexp_substr (a.b,'\w+\b',1,1)
or regexp_substr (x.y,'\w+\b',1,1) = regexp_substr (a.b,'\w+\b',1,2)
or regexp_substr (x.y,'\w+\b',1,1) = regexp_substr (a.b,'\w+\b',1,3))
and(
regexp_substr (x.y,'\w+\b',1,2) = regexp_substr (a.b,'\w+\b',1,1)
or regexp_substr (x.y,'\w+\b',1,2) = regexp_substr (a.b,'\w+\b',1,2)
or regexp_substr (x.y,'\w+\b',1,2) = regexp_substr (a.b,'\w+\b',1,3))
and(
regexp_substr (x.y,'\w+\b',1,3) = regexp_substr (a.b,'\w+\b',1,1)
or regexp_substr (x.y,'\w+\b',1,3) = regexp_substr (a.b,'\w+\b',1,2)
or regexp_substr (x.y,'\w+\b',1,3) = regexp_substr (a.b,'\w+\b',1,3))
)
or
( regexp_substr (x.y,'\w+\b',1,2) = regexp_substr (a.b,'\w+\b',1,1)
or regexp_substr (x.y,'\w+\b',1,2) = regexp_substr (a.b,'\w+\b',1,2)
or regexp_substr (x.y,'\w+\b',1,2) = regexp_substr (a.b,'\w+\b',1,3))
and(
regexp_substr (x.y,'\w+\b',1,3) = regexp_substr (a.b,'\w+\b',1,1)
or regexp_substr (x.y,'\w+\b',1,3) = regexp_substr (a.b,'\w+\b',1,2)
or regexp_substr (x.y,'\w+\b',1,3) = regexp_substr (a.b,'\w+\b',1,3))
and(
regexp_substr (x.y,'\w+\b',1,4) = regexp_substr (a.b,'\w+\b',1,1)
or regexp_substr (x.y,'\w+\b',1,4) = regexp_substr (a.b,'\w+\b',1,2)
or regexp_substr (x.y,'\w+\b',1,4) = regexp_substr (a.b,'\w+\b',1,3))
or
( regexp_substr (x.y,'\w+\b',1,1) = regexp_substr (a.b,'\w+\b',1,1)
or regexp_substr (x.y,'\w+\b',1,1) = regexp_substr (a.b,'\w+\b',1,2)
or regexp_substr (x.y,'\w+\b',1,1) = regexp_substr (a.b,'\w+\b',1,3))
and(
regexp_substr (x.y,'\w+\b',1,3) = regexp_substr (a.b,'\w+\b',1,1)
or regexp_substr (x.y,'\w+\b',1,3) = regexp_substr (a.b,'\w+\b',1,2)
or regexp_substr (x.y,'\w+\b',1,3) = regexp_substr (a.b,'\w+\b',1,3))
and(
regexp_substr (x.y,'\w+\b',1,4) = regexp_substr (a.b,'\w+\b',1,1)
or regexp_substr (x.y,'\w+\b',1,4) = regexp_substr (a.b,'\w+\b',1,2)
or regexp_substr (x.y,'\w+\b',1,4) = regexp_substr (a.b,'\w+\b',1,3))
or
( regexp_substr (x.y,'\w+\b',1,1) = regexp_substr (a.b,'\w+\b',1,1)
or regexp_substr (x.y,'\w+\b',1,1) = regexp_substr (a.b,'\w+\b',1,2)
or regexp_substr (x.y,'\w+\b',1,1) = regexp_substr (a.b,'\w+\b',1,3))
and(
regexp_substr (x.y,'\w+\b',1,2) = regexp_substr (a.b,'\w+\b',1,1)
or regexp_substr (x.y,'\w+\b',1,2) = regexp_substr (a.b,'\w+\b',1,2)
or regexp_substr (x.y,'\w+\b',1,2) = regexp_substr (a.b,'\w+\b',1,3))
and(
regexp_substr (x.y,'\w+\b',1,4) = regexp_substr (a.b,'\w+\b',1,1)
or regexp_substr (x.y,'\w+\b',1,4) = regexp_substr (a.b,'\w+\b',1,2)
or regexp_substr (x.y,'\w+\b',1,4) = regexp_substr (a.b,'\w+\b',1,3))
)
THEN x.y = a.b
答案
尝试Vertica的文本索引包。
这是一种可以用来创建辅助表的方法,最终可以将其与基本表连接以获得匹配的字符串:
DROP TABLE IF EXISTS textbase CASCADE;
CREATE TABLE textbase(
id INT NOT NULL PRIMARY KEY
, txt VARCHAR(32)
) UNSEGMENTED ALL NODES;
INSERT INTO textbase
SELECT 0,'Spinal Fusion'
UNION ALL SELECT 1,'Fusion of the Spine'
UNION ALL SELECT 2,'Mammogram - bilateral'
UNION ALL SELECT 3,'bilateral mammogram scan'
UNION ALL SELECT 4,'Echocardiogram (ECG)'
UNION ALL SELECT 5,'ECG'
;
COMMIT;
-- Work with the Vertica standard Text Index package
-- either write your own stemmer, which removes articles and prepositions
-- and typical suffixes, or do the below - adding a pre-stemmed column.
ALTER TABLE textbase ADD prestemmed VARCHAR(32) DEFAULT
REGEXP_REPLACE(
REGEXP_REPLACE(
REGEXP_REPLACE(
txt
-- remove articles
, ' the\b'
, ''
, 1
, 1
,'i'
)
-- remove prepositions
, ' of\b'
, ''
, 1
, 1
,'i'
)
-- remove "al" and "e" suffixes
, 'e\b|al\b'
, ''
, 1
, 1
,'i'
);
-- Create your text index
CREATE TEXT INDEX textindex ON textbase(id,prestemmed)
TOKENIZER v_txtindex.BasicLogTokenizer (LONG VARCHAR)
STEMMER v_txtindex.Stemmer(LONG VARCHAR)
;
-- The text index table joins to the INTEGER primary key of the base table using "doc_id"
-- and has one row per token / keyword
SELECT * FROM textbase JOIN textindex ON id=doc_id ORDER BY doc_id;
-- out id | txt | prestemmed | token | doc_id
-- out ----+--------------------------+------------------------+----------------+--------
-- out 0 | Spinal Fusion | Spin Fusion | spin | 0
-- out 0 | Spinal Fusion | Spin Fusion | fusion | 0
-- out 1 | Fusion of the Spine | Fusion Spin | spin | 1
-- out 1 | Fusion of the Spine | Fusion Spin | fusion | 1
-- out 2 | Mammogram - bilateral | Mammogram - bilater | mammogram | 2
-- out 2 | Mammogram - bilateral | Mammogram - bilater | bilat | 2
-- out 3 | bilateral mammogram scan | bilater mammogram scan | scan | 3
-- out 3 | bilateral mammogram scan | bilater mammogram scan | mammogram | 3
-- out 3 | bilateral mammogram scan | bilater mammogram scan | bilat | 3
-- out 4 | Echocardiogram (ECG) | Echocardiogram (ECG) | echocardiogram | 4
-- out 4 | Echocardiogram (ECG) | Echocardiogram (ECG) | ecg | 4
有了上面的文本索引,您就可以通过对单词和匹配单词进行计数来应用4匹配3关键字匹配,创建一个内联表,您可以再次将其与基表连接起来。
WITH -- count number of tokens per doc_id ...
wcount AS (
SELECT
doc_id
, count(*) AS wcount
FROM textindex
GROUP BY 1
)
,
-- count how many matches in tokens we have, where the "doc_id" is not equal ...
-- and, counting these, we have over 75% of the total tokens matching
matchcount AS (
SELECT
a.doc_id AS a_doc_id
, b.doc_id AS b_doc_id
, count(*) AS matchcount
FROM textindex a
JOIN textindex b USING (token)
WHERE a.doc_id <> b.doc_id
GROUP BY
1
, 2
HAVING count(*) > (SELECT wcount * .75 FROM wcount WHERE doc_id = a.doc_id)
)
SELECT
QUOTE_LITERAL(a.txt) ||' is probably equal to '||QUOTE_LITERAL(b.txt) AS assumption
FROM matchcount
JOIN textbase a ON a.id=a_doc_id
JOIN textbase b ON b.id=b_doc_id
;
-- out assumption
-- out -------------------------------------------------------------------------
-- out 'Spinal Fusion' is probably equal to 'Fusion of the Spine'
-- out 'Fusion of the Spine' is probably equal to 'Spinal Fusion'
-- out 'Mammogram - bilateral' is probably equal to 'bilateral mammogram scan'
-- out 'ECG' is probably equal to 'Echocardiogram (ECG)'
以上是关于SQL Regex substr模式匹配的主要内容,如果未能解决你的问题,请参考以下文章
亚马逊 Redshift 的 REGEXP_SUBSTR 中的“匹配但排除”