深度学习前沿应用文本审核

Posted 2022-09-07 灵彧universe

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了深度学习前沿应用文本审核相关的知识，希望对你有一定的参考价值。

【深度学习前沿应用】文本审核

(文章目录)

前言

1. 为什么要内容审核?

网络世界，内容参差不齐。优质内容是“流量天使”，背后的商业价值不言而喻，而劣质甚至违规内容一旦触碰法律红线，对社会和平台本身都是威胁。要想守护内容平台的一片清净，内容审核不可或缺。

2. 文本检测模型的作用？

文本检测模型可自动判别文本是否涉及敏感词，并给出相应的置信度，对文本中的各种描述和文案进行识别。

3. 如何实现文本检测模型？

PaddleHub提供了使用多种网络结构进行的色情文本检测预训练模块，比如，porn_detection_lstm采用LSTM网络结构并按字粒度切词进行文本检测，该模型最大句子长度为256字，但是仅仅支持预测，无法微调，相似功能的模型还有porn_detection_gru及porn_detection_cnn，分别使用GRU、CNN模型框架进行文本检测。

本实验的目的是简单地演示如何使用PaddleHub工具，实现文本审核（仅推理过程，暂不支持微调），实验平台为百度AI Studio，实验环境为Python3.7，Paddle2.0，PaddleHub2.0。

一、定义文本审核接口函数

(一)、数据加载及预处理

导入相关包

# from __future__ import print_function
import json
import six
# import paddlehub as hub
import paddlehub as hub

参数配置

test_text = ["打击色情犯罪，是每一个人的责任", 引导未成年人远离黄赌毒] #文本内容
use_gpu=True #是否调用GPU
batch_size=2 #批处理大小

(二)、通过LSTM实现

def porn_detection_lstm(test_text, use_gpu ,batch_size):
    # Load porn_detection_lstm module
    porn_detection_lstm = hub.Module(name="porn_detection_lstm")
    input_dict = "text": test_text
    results = porn_detection_lstm.detection(data=input_dict, use_gpu=use_gpu, batch_size=batch_size)
    return results

(三)、通过GRU实现

def porn_detection_gru(test_text, use_gpu, batch_size):
    # Load porn_detection_gru module
    porn_detection_gru = hub.Module(name="porn_detection_gru")
    input_dict = "text": test_text
    results = porn_detection_gru.detection(data=input_dict, use_gpu=use_gpu, batch_size=batch_size)
    return results

(四)、通过CNN实现

def porn_detection_cnn(test_text, use_gpu, batch_size):
    # Load porn_detection_cnn module
    porn_detection_cnn = hub.Module(name="porn_detection_cnn")
    results = porn_detection_cnn.detection(texts=test_text, use_gpu=use_gpu, batch_size=batch_size)
    return results

二、使用不同的模型进行审核

lstm = porn_detection_lstm(test_text, use_gpu, batch_size) #调用lstm
gru = porn_detection_gru(test_text, use_gpu, batch_size)   #调用gru
cnn = porn_detection_cnn(test_text, use_gpu, batch_size)   #调用cnn

三、输出审核结果

(一)、定义格式化输出函数

Tips:label的值越高则涉及色情的可能性越高

def output_dict(dic,pre=LSTM):
    print(------\\nPorn detection with .format(pre))
    for line in dic:
        for k,v in line.items():
            print(:20s: .format(k,v))

(二)、分别格式化输出三种审核结果

output_dict(lstm)
output_dict(gru,GRU)
output_dict(cnn,CNN)

输出结果如下图1所示：

(三)、求三种审核结果的平均结果作为输出

for index, text in enumerate(test_text):
    lstm[index]["text"] = text
    print("文本内容:",text)
    label = (lstm[index]["porn_detection_label"] + gru[index]["porn_detection_label"] + cnn[index]["porn_detection_label"])
    porn_probs = (lstm[index]["porn_probs"] + gru[index]["porn_probs"] + cnn[index]["porn_probs"])/3
    not_porn_probs = (lstm[index]["not_porn_probs"] + gru[index]["not_porn_probs"] + cnn[index]["not_porn_probs"])/3
    print(label:%0.0f, porn_probs:%0.5f, not_porn_probs:%0.5f % (label, porn_probs, not_porn_probs))

输出结果如下图2所示：

总结

本系列文章内容为根据清华社出版的《机器学习实践》所作的相关笔记和感悟，其中代码均为基于百度飞桨开发，若有任何侵权和不妥之处，请私信于我，定积极配合处理，看到必回！！！

最后，引用本次活动的一句话，来作为文章的结语～(￣▽￣～)~：

【**学习的最大理由是想摆脱平庸，早一天就多一份人生的精彩；迟一天就多一天平庸的困扰。**】

以上是关于深度学习前沿应用文本审核的主要内容，如果未能解决你的问题，请参考以下文章