如何识别字符串数据集中的文本模板模式?

Posted

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了如何识别字符串数据集中的文本模板模式?相关的知识,希望对你有一定的参考价值。

我试图找到一种有效的方法来处理文本记录列表并识别记录中常用的文本模板,只保留固定部分并抽象变量,还计算与每个识别模板匹配的记录数。

——

我在解决这一挑战方面最成功的尝试是将文本记录拆分为单词数组,比较每个单词大小相同的数组,以便将模板中的模板写入模板列表中。

正如您所料,它不是完美的,并且难以运行超过50,000条记录的数据集。

我想知道是否有一些文本分类库可以提高效率或更快的逻辑来提高性能,我目前的代码非常幼稚......

——

这是我在Python中的第一次尝试,使用了一个非常简单的逻辑。

samples = ['Your order 12345 has been confirmed. Thank you',
'Your order 12346 has been confirmed. Thank you',
'Your order 12347 has been confirmed. Thank you',
'Your order 12348 has been confirmed. Thank you',
'Your order 12349 has been confirmed. Thank you',
'The code for your bakery purchase is 1234',
'The code for your bakery purchase is 1237',
'The code for your butcher purchase is 1232',
'The code for your butcher purchase is 1231',
'The code for your gardening purchase is 1235']

samples_split = [x.split() for x in samples]
identified_templates = []

for words_list in samples_split:
    for j,words_list_ref in enumerate(samples_split):
         template = str()
         if len(words_list) != len(words_list_ref) or words_list==words_list_ref:
            continue
         else:
            for i,word in enumerate(words_list):
                if word == words_list_ref[i]:
                    template += ' '+word
                else:
                    template += ' %'
            identified_templates.append(template)

templates = dict()          
for template in identified_templates:
    if template not in templates.keys():
        templates[template]=1

templates_2 = dict()

for key, value in templates.items():
    if '% % %' not in key:
        templates_2[key]=1

print(templates_2)  

理想情况下,代码应该采取如下输入:

- “Your order tracking number is 123” 
- “Thank you for creating an account with us” 
- “Your order tracking number is 888”
- “Thank you for creating an account with us” 
- “Hello Jim, what is your issue?”
- “Hello Jack, what is your issue?”

并输出模板列表以及它们匹配的记录数。

- “Your order tracking number is {}”,2
- “Thank you for creating an account with us”,2
- “Hello {}, what is your issue?”,2 
答案

您可以尝试以下代码。我希望输出符合您的期望。

import re
templates_2 = {}
samples = ['Your order 12345 has been confirmed. Thank you',
'Your order 12346 has been confirmed. Thank you',
'Your order 12347 has been confirmed. Thank you',
'Your order 12348 has been confirmed. Thank you',
'Your order 12349 has been confirmed. Thank you',
'The code for your bakery purchase is 1234',
'The code for your bakery purchase is 1237',
'The code for your butcher purchase is 1232',
'The code for your butcher purchase is 1231',
'The code for your gardening purchase is 1235']

identified_templates = [re.sub('[0-9]+', '{}', asample) for asample in samples]
unique_identified_templates = list(set(identified_templates))
for atemplate in unique_identified_templates:
    templates_2.update({atemplate:identified_templates.count(atemplate)})
for k, v in templates_2.items():
    print(k,':',v)

输出:

The code for your gardening purchase is {} : 1
Your order {} has been confirmed. Thank you : 5
The code for your bakery purchase is {} : 2
The code for your butcher purchase is {} : 2

以上是关于如何识别字符串数据集中的文本模板模式?的主要内容,如果未能解决你的问题,请参考以下文章

模式识别 - 特征归一化 及 測试 代码(Matlab)

java 实现图片的文字识别

Node.js 或 PHP 中的模式识别算法?

如何在导航抽屉活动模板中的片段之间传递字符串变量

GoF23种设计模式之行为型模式之模板方法

;~ 小部分AutoHotkey源代码片段测试模板2019年10月9日.ahk