SQL Regex substr模式匹配

Posted

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了SQL Regex substr模式匹配相关的知识,希望对你有一定的参考价值。

好吧,所以我向我认识的程序员询问了以下内容,没有人能想出一种方法来做...。如果可以,请帮助!

我正在为医院程序进行模式匹配,在这个示例中,它将将¾个词从一个概念匹配到另一个。基本上,我想使它与“ x,a,y,z”匹配(请记住,我已经删除了所有字母数字字符,因此我可以这样做。下面是一个长期的示例,我需要找到一种方法来使它基于单词数来动态化,而不是每次迭代都这样做。例如:

'Spinal Fusion' = 'Fusion of the Spine' 
'Mammogram-bilateral' = 'bilateral mammogram scan' 
'Echocardiogram (ECG)' = 'ECG'

我写出了它可能如何工作,但是其中一些迭代有几十个,因此,在进行statement时需要一种情况。如果有人知道如何使这种动态变化,我将永远感激

    WHEN regexp_count(x.y,'(\w+)+') =4 and regexp_count(a.b,'(\w+)+') =3 – (when the count of words is = to 3 and 4)
    AND (
                (  regexp_substr (x.y,'\w+\b',1,1) = regexp_substr (a.b,'\w+\b',1,1)
                or regexp_substr (x.y,'\w+\b',1,1) = regexp_substr (a.b,'\w+\b',1,2)
                or regexp_substr (x.y,'\w+\b',1,1) = regexp_substr (a.b,'\w+\b',1,3)) 
        and( 
                   regexp_substr (x.y,'\w+\b',1,2) = regexp_substr (a.b,'\w+\b',1,1)
                or regexp_substr (x.y,'\w+\b',1,2) = regexp_substr (a.b,'\w+\b',1,2)
                or regexp_substr (x.y,'\w+\b',1,2) = regexp_substr (a.b,'\w+\b',1,3))

        and( 
                   regexp_substr (x.y,'\w+\b',1,3) = regexp_substr (a.b,'\w+\b',1,1)
                or regexp_substr (x.y,'\w+\b',1,3) = regexp_substr (a.b,'\w+\b',1,2)
                or regexp_substr (x.y,'\w+\b',1,3) = regexp_substr (a.b,'\w+\b',1,3))
    )
    or


                (  regexp_substr (x.y,'\w+\b',1,2) = regexp_substr (a.b,'\w+\b',1,1)
                or regexp_substr (x.y,'\w+\b',1,2) = regexp_substr (a.b,'\w+\b',1,2)
                or regexp_substr (x.y,'\w+\b',1,2) = regexp_substr (a.b,'\w+\b',1,3)) 
        and( 
                   regexp_substr (x.y,'\w+\b',1,3) = regexp_substr (a.b,'\w+\b',1,1)
                or regexp_substr (x.y,'\w+\b',1,3) = regexp_substr (a.b,'\w+\b',1,2)
                or regexp_substr (x.y,'\w+\b',1,3) = regexp_substr (a.b,'\w+\b',1,3))

        and( 
                   regexp_substr (x.y,'\w+\b',1,4) = regexp_substr (a.b,'\w+\b',1,1)
                or regexp_substr (x.y,'\w+\b',1,4) = regexp_substr (a.b,'\w+\b',1,2)
                or regexp_substr (x.y,'\w+\b',1,4) = regexp_substr (a.b,'\w+\b',1,3))
    or


                (  regexp_substr (x.y,'\w+\b',1,1) = regexp_substr (a.b,'\w+\b',1,1)
                or regexp_substr (x.y,'\w+\b',1,1) = regexp_substr (a.b,'\w+\b',1,2)
                or regexp_substr (x.y,'\w+\b',1,1) = regexp_substr (a.b,'\w+\b',1,3)) 
         and( 
                   regexp_substr (x.y,'\w+\b',1,3) = regexp_substr (a.b,'\w+\b',1,1)
                or regexp_substr (x.y,'\w+\b',1,3) = regexp_substr (a.b,'\w+\b',1,2)
                or regexp_substr (x.y,'\w+\b',1,3) = regexp_substr (a.b,'\w+\b',1,3))    
        and( 
                   regexp_substr (x.y,'\w+\b',1,4) = regexp_substr (a.b,'\w+\b',1,1)
                or regexp_substr (x.y,'\w+\b',1,4) = regexp_substr (a.b,'\w+\b',1,2)
                or regexp_substr (x.y,'\w+\b',1,4) = regexp_substr (a.b,'\w+\b',1,3))

    or


                (  regexp_substr (x.y,'\w+\b',1,1) = regexp_substr (a.b,'\w+\b',1,1)
                or regexp_substr (x.y,'\w+\b',1,1) = regexp_substr (a.b,'\w+\b',1,2)
                or regexp_substr (x.y,'\w+\b',1,1) = regexp_substr (a.b,'\w+\b',1,3)) 
        and( 
                   regexp_substr (x.y,'\w+\b',1,2) = regexp_substr (a.b,'\w+\b',1,1)
                or regexp_substr (x.y,'\w+\b',1,2) = regexp_substr (a.b,'\w+\b',1,2)
                or regexp_substr (x.y,'\w+\b',1,2) = regexp_substr (a.b,'\w+\b',1,3)) 
        and( 
                   regexp_substr (x.y,'\w+\b',1,4) = regexp_substr (a.b,'\w+\b',1,1)
                or regexp_substr (x.y,'\w+\b',1,4) = regexp_substr (a.b,'\w+\b',1,2)
                or regexp_substr (x.y,'\w+\b',1,4) = regexp_substr (a.b,'\w+\b',1,3))
    )
    THEN x.y = a.b
答案

尝试Vertica的文本索引包。

Docu在这里:https://www.vertica.com/docs/9.3.x/HTML/Content/Authoring/AdministratorsGuide/Tables/TextSearch/TextSearchConceptual.htm?tocpath=Administrator%27s%20Guide%7CUsing%20Text%20Search%7C_____0

这是一种可以用来创建辅助表的方法,最终可以将其与基本表连接以获得匹配的字符串:

DROP TABLE IF EXISTS textbase CASCADE;
CREATE TABLE textbase(
  id INT NOT NULL PRIMARY KEY
, txt VARCHAR(32)
) UNSEGMENTED ALL NODES;

INSERT INTO textbase
          SELECT 0,'Spinal Fusion'
UNION ALL SELECT 1,'Fusion of the Spine' 
UNION ALL SELECT 2,'Mammogram - bilateral'
UNION ALL SELECT 3,'bilateral mammogram scan' 
UNION ALL SELECT 4,'Echocardiogram (ECG)'
UNION ALL SELECT 5,'ECG'
;
COMMIT;

-- Work with the Vertica standard Text Index package

-- either write your own stemmer, which removes articles and prepositions
-- and typical suffixes, or do the below - adding a pre-stemmed column.
ALTER TABLE textbase ADD prestemmed VARCHAR(32) DEFAULT 
 REGEXP_REPLACE(
   REGEXP_REPLACE(
     REGEXP_REPLACE(
       txt
     -- remove articles
     , ' the\b'
     , ''
     , 1
     , 1
     ,'i'
     )
   -- remove prepositions
   , ' of\b'
   , ''
   , 1
   , 1
   ,'i'
   )
 -- remove "al" and "e" suffixes
 , 'e\b|al\b'
 , ''
 , 1
 , 1
 ,'i'
);

-- Create your text index
CREATE TEXT INDEX textindex ON textbase(id,prestemmed) 
TOKENIZER v_txtindex.BasicLogTokenizer (LONG VARCHAR)
STEMMER v_txtindex.Stemmer(LONG VARCHAR)
;

-- The text index table joins to the INTEGER primary key of the base table using "doc_id"
-- and has one row per token / keyword
SELECT * FROM textbase JOIN textindex ON id=doc_id ORDER BY doc_id;

-- out  id |           txt            |       prestemmed       |     token      | doc_id 
-- out ----+--------------------------+------------------------+----------------+--------
-- out   0 | Spinal Fusion            | Spin Fusion            | spin           |      0
-- out   0 | Spinal Fusion            | Spin Fusion            | fusion         |      0
-- out   1 | Fusion of the Spine      | Fusion Spin            | spin           |      1
-- out   1 | Fusion of the Spine      | Fusion Spin            | fusion         |      1
-- out   2 | Mammogram - bilateral    | Mammogram - bilater    | mammogram      |      2
-- out   2 | Mammogram - bilateral    | Mammogram - bilater    | bilat          |      2
-- out   3 | bilateral mammogram scan | bilater mammogram scan | scan           |      3
-- out   3 | bilateral mammogram scan | bilater mammogram scan | mammogram      |      3
-- out   3 | bilateral mammogram scan | bilater mammogram scan | bilat          |      3
-- out   4 | Echocardiogram (ECG)     | Echocardiogram (ECG)   | echocardiogram |      4
-- out   4 | Echocardiogram (ECG)     | Echocardiogram (ECG)   | ecg            |      4

有了上面的文本索引,您就可以通过对单词和匹配单词进行计数来应用4匹配3关键字匹配,创建一个内联表,您可以再次将其与基表连接起来。

WITH -- count number of tokens per doc_id ...
wcount AS (
   SELECT 
     doc_id
   , count(*) AS wcount
   FROM textindex
   GROUP BY 1
) 
, 
-- count how many matches in tokens we have, where the "doc_id" is not equal ...
-- and, counting these, we have over 75% of the total tokens matching
matchcount AS (
   SELECT 
     a.doc_id AS a_doc_id
   , b.doc_id AS b_doc_id
   , count(*) AS matchcount
   FROM textindex a
   JOIN textindex b USING (token)
   WHERE a.doc_id <> b.doc_id
   GROUP BY 
     1
   , 2
   HAVING count(*) > (SELECT wcount * .75 FROM wcount WHERE doc_id = a.doc_id)
)
SELECT
  QUOTE_LITERAL(a.txt) ||' is probably equal to '||QUOTE_LITERAL(b.txt) AS assumption
FROM matchcount
JOIN textbase a ON a.id=a_doc_id
JOIN textbase b ON b.id=b_doc_id
;
-- out                                 assumption
-- out -------------------------------------------------------------------------
-- out  'Spinal Fusion' is probably equal to 'Fusion of the Spine'
-- out  'Fusion of the Spine' is probably equal to 'Spinal Fusion'
-- out  'Mammogram - bilateral' is probably equal to 'bilateral mammogram scan'
-- out  'ECG' is probably equal to 'Echocardiogram (ECG)'

以上是关于SQL Regex substr模式匹配的主要内容,如果未能解决你的问题,请参考以下文章

PHP Regex 在同一行分别匹配模式

在 C# 中仅保留 Regex.Split 的匹配模式

亚马逊 Redshift 的 REGEXP_SUBSTR 中的“匹配但排除”

js中RegEx中的匹配模式

如何在 C# Regex 中使用lookbehind 来跳过重复前缀模式的匹配?

考研模式匹配