R中的朴素贝叶斯分类器仅预测一个类

Posted

技术标签:

【中文标题】R中的朴素贝叶斯分类器仅预测一个类【英文标题】:NaiveBayes Classifer in R predicting only one Class 【发布时间】:2015-11-19 13:55:09 【问题描述】:

我正在将 arduino 帖子分类为硬件和软件类别。我已经手动准备了火车组。 但是,在进入测试集时,所有帖子都被预测为“硬件”。 火车组格式是否有一些错误。 NaiveBayes 是否无法将句子识别为执行预测的输入? 训练集格式为:class "\t" pred "\t" set 分类器将使用 set 列来识别标签,并将 pred 列作为预测器。类列仅用于创建集合列。

//programmed in R
library(e1071)
train = read.table("train_set.csv", sep="\t", header=T)
test = read.table("test_one.csv", sep="\t", header=T)
train$set = "Hardware"
train[train$class==0,]$set = "Software"
train$set = as.factor(train$set)
model <- naiveBayes(set ~ pred, data = train)
pred <- predict(model, train[495:510,]) //displays train set prediction
pred1 <- predict(model, test[1:10,]) //displays incorrect prediction for test set

训练数据集(分隔符=\t,只附加4行1000行)

1 代表硬件 0 代表软件 在程序中,追加了一个名为“set”的列,用来存放1和0对应的“硬件”或“软件”。

class   pred
1    Im making a simple Arduino web server and I want to keep it turned on all the time. So it must endure to stay working continuously. Im using an Arduino Uno with a Ethernet Shield.Its powered with a simple outlet power supply 5V @ 1A. My Questions: Will I have any problems leaving the Arduino turned on all the time? Is there some other Arduino board better recommended for this? Are there any precautions that I need to heed regarding this? 
1    Put plainly: is there a way to get an HTTPS connection on the Arduino? I have been looking in to it and I have found it is impossible with the standard library and the Ethernet shield but is there a custom library that can do it? What about a coprocessor i.e. like the WiFi shield has? Anyone know if the Arduino yn has ssl? 
0    The use of malloc and free seems pretty rare in the Arduino world. It is used in pure AVR C much more often but still with caution. Is it a really bad idea to use malloc and free with Arduino? 
0    What do I need to build a shield capable of receiving 1080p video from USB camera timestamp each frame and send the frame to memory card? 

测试数据集

 pred
arduino-uno web-server ethernet i'm making a simple arduino web server and i want to keep it turned on all the time. so it must endure to stay working continuously. i'm using an arduino uno with a ethernet shield.it's powered with a simple outlet power supply 5v @ 1a. my questions: will i have any problems leaving the arduino turned on all the time? is there some other arduino board better recommended for this? are there any precautions that i need to heed regarding this?    
I made a circuit which in my intentions would allow me to toggle a LED dimming loop. Problem is that once I push the button the first time pushing it a second time doesnt toggle the LED loop off. Here is the code: const int LED = 9; // the pin for the LEDconst int BUTTON = 7;int val = LOW;int old_val = LOW;int state = 0;int i = 0;void setup pinModeLED OUTPUT; pinModeBUTTON INPUT;void loop val = digitalReadBUTTON; if val == HIGH &amp;&amp; old_val==LOW  state = 1 - state; delay10;  old_val = val; if state == 1  for i = 0; i &lt; 255; i++ // loop from 0 to 254 fade in  analogWriteLED i; // set the LED brightness delay10; // Wait 10ms because analogWrite // is instantaneous and we would // not see any change  for i = 255; i &gt; 0; i-- // loop from 255 to 1 fade out  analogWriteLED i; // set the LED brightness delay10; // Wait 10m

预期输出: 硬件软件

【问题讨论】:

Ummmm...您的文档术语矩阵在哪里? head(train)是什么(如果是你发的sn-p,那你就少了一个很关键的步骤) 我缺少文档术语矩阵。谢谢你提到它。我将尽快发布文档术语矩阵。我对这个领域很陌生。我可以对文档术语矩阵使用 tf-idf 权重度量。 @Vlo 非常感谢@Vlo 您关于文档术语矩阵的一个问题引导我找到了正确的路径。 【参考方案1】:
library(e1071)
library(tm)
library(MASS)
library(SnowballC)

train = read.table("train_set.csv", sep="\t", header=T)
test = read.table("test_set.csv", sep="\t", header=T)

#stopwords
mystopwords <- c(stopwords("english"),"week","arduino","words","need","get","will","want","know","work","also")

#corpus for train set
train.corpus <- Corpus(VectorSource(train$pred))
train.corpus <- tm_map(train.corpus, content_transformer(tolower))
train.corpus <- tm_map(train.corpus, removePunctuation)
train.corpus <- tm_map(train.corpus, stripWhitespace)
train.corpus <- tm_map(train.corpus, removeNumbers)
train.corpus <- tm_map(train.corpus, removeWords, mystopwords)
train.corpus <- tm_map(train.corpus, stemDocument)
train.corpus <- tm_map(train.corpus, removeWords, "(http)\\w+")
train.corpus <- tm_map(train.corpus, removeWords, "\\b[a-zA-Z0-9]10,100\\b")
train.corpus.dtm <- DocumentTermMatrix(train.corpus, control = list(weighting = function(x) weightTfIdf(x, normalize = FALSE), stopwords = TRUE, removePunctuation=TRUE))
train.corpus.dtms <- removeSparseTerms(train.corpus.dtm, 0.98)

#Debugging
#TermDocumentMatrix(train.corpus)
#inspect(train.corpus.dtm)
#findFreqTerms(train.corpus.dtm, N)   #N <- freq

#corpus for test set
test.corpus <- Corpus(VectorSource(test$pred))
test.corpus <- tm_map(test.corpus, content_transformer(tolower))
test.corpus <- tm_map(test.corpus, removePunctuation)
test.corpus <- tm_map(test.corpus, stripWhitespace)
test.corpus <- tm_map(test.corpus, removeNumbers)
test.corpus <- tm_map(test.corpus, removeWords, mystopwords)
test.corpus <- tm_map(test.corpus, stemDocument)
test.corpus <- tm_map(test.corpus, removeWords, "(http)\\w+")
test.corpus <- tm_map(test.corpus, removeWords, "\\b[a-zA-Z0-9]10,100\\b")
test.corpus.dtm <- DocumentTermMatrix(test.corpus, control = list(weighting = function(x) weightTfIdf(x, normalize = FALSE), stopwords = TRUE, removePunctuation=TRUE))
test.corpus.dtms <- removeSparseTerms(test.corpus.dtm, 0.98) 


m <- as.matrix(train.corpus.dtms)
n <- as.matrix(test.corpus.dtms)

#Train model
model <- naiveBayes(m,as.factor(train$class));

#Prediction
results <- predict(model,n[1:10,])

下一步是在这个分类器中加入 10 折交叉验证以进行性能检查;我现在被困在哪里。

【讨论】:

您可以将library(caret)library(klaR) 中的naivebayes 实现一起使用topepo.github.io/caret/modelList.html “插入符号”在 R 3.1.2 上不起作用吗?它依赖于> = 3.2的“汽车”。 R 3.2 与我降级到 R 3.1 的某些软件包存在不兼容问题。任何建议。 创建您自己的测试和训练数据集。这并不难。 我已经按照我的要求处理了 arduino.stackexchange.com 转储的数据集。训练和测试数据集是我细化的。以及如何创建与该查询相关的自己的数据集? @KHANirfan 我的问题与 DocumentTerm 矩阵有关。我直接将字符串作为模型的输入,这是不正确的。第一步是创建 DocumentTerm 矩阵,它只是字符串的矢量化和一些基本的规范化。

以上是关于R中的朴素贝叶斯分类器仅预测一个类的主要内容,如果未能解决你的问题,请参考以下文章

6步骤带你了解朴素贝叶斯分类器(含Python和R语言代码)

如何生成混淆矩阵并找到朴素贝叶斯分类器的错误分类率?

Python从0实现朴素贝叶斯分类器

R中支持向量机和朴素贝叶斯分类器的可变重要性

朴素贝叶斯分类算法预测具有属性的人是不是买电脑python

R中的朴素贝叶斯分类 - 从头开始