含噪音的标注数据修正方法待完成
Posted 小基基o_O
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了含噪音的标注数据修正方法待完成相关的知识,希望对你有一定的参考价值。
标注数据不纯,导致训练出来的模型不行
图
修真方法
机器学习
"""
参考:https://blog.csdn.net/Yellow_python/article/details/97677183
新闻9分类
1987 car
1961 education
1910 entertainment
1909 fashion
1996 finance
1906 military
1925 politics
1960 science
1989 sports
总数:17543
测试集切分比例:0.25
模型 | 准确率 | 秒
MultinomialNB | 0.8201 | 0.13
LogisticRegression | 0.8518 | 25.44
DecisionTreeClassifier | 0.7321 | 18.93
AdaBoostClassifier | 0.6979 | 23.44
GradientBoostingClassifier | 0.8381 | 793.17
RandomForestClassifier | 0.8399 | 129.67
SVC | 0.8450 | 492.10
建议:
1. 逻辑回归:准确率高且稳定,速度快;C=10时更准但更慢
2. 朴素贝叶斯:速度极快
3. 随机森林:准确率高但不稳定,速度中
"""
from collections import Counter
from numpy import argmax
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from segment import tk, clean, corpus
from warnings import filterwarnings
filterwarnings('ignore') # 不打印警告
N = 24000
def clear(text):
text = clean.re_url.sub('', text)
text = clean.re_email.sub('', text)
text = clean.re_ip.sub('', text)
text = clean.replace_punctuation(text)
return clean.SEP45(text)
def cut(text):
for sentence in clear(text):
for word in tk.cut(sentence):
if word not in corpus.STOP_WORDS:
yield word
def clf_text_self(X, y):
"""文本分类错误结果存EXCEL"""
# 建模
vec = TfidfVectorizer(tokenizer=cut, max_features=N)
clf = LogisticRegression(C=2.0)
XX = vec.fit_transform(X)
clf.fit(XX, y)
classes = clf.classes_
ly = len(y)
# 概率预测
proba = clf.predict_proba(XX)
y_pred = [classes[argmax(i)] for i in proba]
y01 = [1 if y[i] == y_pred[i] else 0 for i in range(ly)] # 比较预测值与实际值
proba = [max(i) for i in proba]
# 混淆矩阵
matrix = confusion_matrix(y, y_pred)
ls_of_df = [corpus.pd.DataFrame(matrix, classes, classes)]
# 筛选预测错误
df = corpus.pd.DataFrame({
'X': X,
'y': y,
'y_pred': y_pred,
'probability': proba,
'y01': y01,
})
df = df[df['y01'] == 0][['X', 'y', 'y_pred', 'probability']]
ls_of_df.append(df)
# 存EXCEL
corpus.df2sheets(ls_of_df, ['matrix', 'self'], 'clf_self.xlsx')
if __name__ == '__main__':
from segment.app.data10 import X, Y
clf_text_self(X, Y)
基于规则
词组合
正则表达式
以上是关于含噪音的标注数据修正方法待完成的主要内容,如果未能解决你的问题,请参考以下文章