机器学习之路: python 朴素贝叶斯分类器 预测新闻类别
Posted 稀里糊涂林老冷
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了机器学习之路: python 朴素贝叶斯分类器 预测新闻类别相关的知识,希望对你有一定的参考价值。
使用python3 学习朴素贝叶斯分类api
设计到字符串提取特征向量
欢迎来到我的git下载源代码: https://github.com/linyi0604/kaggle
1 from sklearn.datasets import fetch_20newsgroups 2 from sklearn.cross_validation import train_test_split 3 # 导入文本特征向量转化模块 4 from sklearn.feature_extraction.text import CountVectorizer 5 # 导入朴素贝叶斯模型 6 from sklearn.naive_bayes import MultinomialNB 7 # 模型评估模块 8 from sklearn.metrics import classification_report 9 10 ‘‘‘ 11 朴素贝叶斯模型广泛用于海量互联网文本分类任务。 12 由于假设特征条件相互独立,预测需要估计的参数规模从幂指数量级下降接近线性量级,节约内存和计算时间 13 但是 该模型无法将特征之间的联系考虑,数据关联较强的分类任务表现不好。 14 ‘‘‘ 15 16 ‘‘‘ 17 1 读取数据部分 18 ‘‘‘ 19 # 该api会即使联网下载数据 20 news = fetch_20newsgroups(subset="all") 21 # 检查数据规模和细节 22 # print(len(news.data)) 23 # print(news.data[0]) 24 ‘‘‘ 25 18846 26 27 From: Mamatha Devineni Ratnam <[email protected]> 28 Subject: Pens fans reactions 29 Organization: Post Office, Carnegie Mellon, Pittsburgh, PA 30 Lines: 12 31 NNTP-Posting-Host: po4.andrew.cmu.edu 32 33 I am sure some bashers of Pens fans are pretty confused about the lack 34 of any kind of posts about the recent Pens massacre of the Devils. Actually, 35 I am bit puzzled too and a bit relieved. However, I am going to put an end 36 to non-PIttsburghers‘ relief with a bit of praise for the Pens. Man, they 37 are killing those Devils worse than I thought. Jagr just showed you why 38 he is much better than his regular season stats. He is also a lot 39 fo fun to watch in the playoffs. Bowman should let JAgr have a lot of 40 fun in the next couple of games since the Pens are going to beat the pulp out of Jersey anyway. I was very disappointed not to see the Islanders lose the final 41 regular season game. PENS RULE!!! 42 ‘‘‘ 43 44 ‘‘‘ 45 2 分割数据部分 46 ‘‘‘ 47 x_train, x_test, y_train, y_test = train_test_split(news.data, 48 news.target, 49 test_size=0.25, 50 random_state=33) 51 52 ‘‘‘ 53 3 贝叶斯分类器对新闻进行预测 54 ‘‘‘ 55 # 进行文本转化为特征 56 vec = CountVectorizer() 57 x_train = vec.fit_transform(x_train) 58 x_test = vec.transform(x_test) 59 # 初始化朴素贝叶斯模型 60 mnb = MultinomialNB() 61 # 训练集合上进行训练, 估计参数 62 mnb.fit(x_train, y_train) 63 # 对测试集合进行预测 保存预测结果 64 y_predict = mnb.predict(x_test) 65 66 ‘‘‘ 67 4 模型评估 68 ‘‘‘ 69 print("准确率:", mnb.score(x_test, y_test)) 70 print("其他指标:\n",classification_report(y_test, y_predict, target_names=news.target_names)) 71 ‘‘‘ 72 准确率: 0.8397707979626485 73 其他指标: 74 precision recall f1-score support 75 76 alt.atheism 0.86 0.86 0.86 201 77 comp.graphics 0.59 0.86 0.70 250 78 comp.os.ms-windows.misc 0.89 0.10 0.17 248 79 comp.sys.ibm.pc.hardware 0.60 0.88 0.72 240 80 comp.sys.mac.hardware 0.93 0.78 0.85 242 81 comp.windows.x 0.82 0.84 0.83 263 82 misc.forsale 0.91 0.70 0.79 257 83 rec.autos 0.89 0.89 0.89 238 84 rec.motorcycles 0.98 0.92 0.95 276 85 rec.sport.baseball 0.98 0.91 0.95 251 86 rec.sport.hockey 0.93 0.99 0.96 233 87 sci.crypt 0.86 0.98 0.91 238 88 sci.electronics 0.85 0.88 0.86 249 89 sci.med 0.92 0.94 0.93 245 90 sci.space 0.89 0.96 0.92 221 91 soc.religion.christian 0.78 0.96 0.86 232 92 talk.politics.guns 0.88 0.96 0.92 251 93 talk.politics.mideast 0.90 0.98 0.94 231 94 talk.politics.misc 0.79 0.89 0.84 188 95 talk.religion.misc 0.93 0.44 0.60 158 96 97 avg / total 0.86 0.84 0.82 4712 98 ‘‘‘
以上是关于机器学习之路: python 朴素贝叶斯分类器 预测新闻类别的主要内容,如果未能解决你的问题,请参考以下文章
《机器学习实战》基于朴素贝叶斯分类算法构建文本分类器的Python实现