R中的朴素贝叶斯分类器仅预测一个类
Posted
技术标签:
【中文标题】R中的朴素贝叶斯分类器仅预测一个类【英文标题】:NaiveBayes Classifer in R predicting only one Class 【发布时间】:2015-11-19 13:55:09 【问题描述】:我正在将 arduino 帖子分类为硬件和软件类别。我已经手动准备了火车组。 但是,在进入测试集时,所有帖子都被预测为“硬件”。 火车组格式是否有一些错误。 NaiveBayes 是否无法将句子识别为执行预测的输入? 训练集格式为:class "\t" pred "\t" set 分类器将使用 set 列来识别标签,并将 pred 列作为预测器。类列仅用于创建集合列。
//programmed in R
library(e1071)
train = read.table("train_set.csv", sep="\t", header=T)
test = read.table("test_one.csv", sep="\t", header=T)
train$set = "Hardware"
train[train$class==0,]$set = "Software"
train$set = as.factor(train$set)
model <- naiveBayes(set ~ pred, data = train)
pred <- predict(model, train[495:510,]) //displays train set prediction
pred1 <- predict(model, test[1:10,]) //displays incorrect prediction for test set
训练数据集(分隔符=\t,只附加4行1000行)
1 代表硬件 0 代表软件 在程序中,追加了一个名为“set”的列,用来存放1和0对应的“硬件”或“软件”。
class pred
1 Im making a simple Arduino web server and I want to keep it turned on all the time. So it must endure to stay working continuously. Im using an Arduino Uno with a Ethernet Shield.Its powered with a simple outlet power supply 5V @ 1A. My Questions: Will I have any problems leaving the Arduino turned on all the time? Is there some other Arduino board better recommended for this? Are there any precautions that I need to heed regarding this?
1 Put plainly: is there a way to get an HTTPS connection on the Arduino? I have been looking in to it and I have found it is impossible with the standard library and the Ethernet shield but is there a custom library that can do it? What about a coprocessor i.e. like the WiFi shield has? Anyone know if the Arduino yn has ssl?
0 The use of malloc and free seems pretty rare in the Arduino world. It is used in pure AVR C much more often but still with caution. Is it a really bad idea to use malloc and free with Arduino?
0 What do I need to build a shield capable of receiving 1080p video from USB camera timestamp each frame and send the frame to memory card?
测试数据集
pred
arduino-uno web-server ethernet i'm making a simple arduino web server and i want to keep it turned on all the time. so it must endure to stay working continuously. i'm using an arduino uno with a ethernet shield.it's powered with a simple outlet power supply 5v @ 1a. my questions: will i have any problems leaving the arduino turned on all the time? is there some other arduino board better recommended for this? are there any precautions that i need to heed regarding this?
I made a circuit which in my intentions would allow me to toggle a LED dimming loop. Problem is that once I push the button the first time pushing it a second time doesnt toggle the LED loop off. Here is the code: const int LED = 9; // the pin for the LEDconst int BUTTON = 7;int val = LOW;int old_val = LOW;int state = 0;int i = 0;void setup pinModeLED OUTPUT; pinModeBUTTON INPUT;void loop val = digitalReadBUTTON; if val == HIGH && old_val==LOW state = 1 - state; delay10; old_val = val; if state == 1 for i = 0; i < 255; i++ // loop from 0 to 254 fade in analogWriteLED i; // set the LED brightness delay10; // Wait 10ms because analogWrite // is instantaneous and we would // not see any change for i = 255; i > 0; i-- // loop from 255 to 1 fade out analogWriteLED i; // set the LED brightness delay10; // Wait 10m
预期输出: 硬件软件
【问题讨论】:
Ummmm...您的文档术语矩阵在哪里?head(train)
是什么(如果是你发的sn-p,那你就少了一个很关键的步骤)
我缺少文档术语矩阵。谢谢你提到它。我将尽快发布文档术语矩阵。我对这个领域很陌生。我可以对文档术语矩阵使用 tf-idf 权重度量。 @Vlo
非常感谢@Vlo 您关于文档术语矩阵的一个问题引导我找到了正确的路径。
【参考方案1】:
library(e1071)
library(tm)
library(MASS)
library(SnowballC)
train = read.table("train_set.csv", sep="\t", header=T)
test = read.table("test_set.csv", sep="\t", header=T)
#stopwords
mystopwords <- c(stopwords("english"),"week","arduino","words","need","get","will","want","know","work","also")
#corpus for train set
train.corpus <- Corpus(VectorSource(train$pred))
train.corpus <- tm_map(train.corpus, content_transformer(tolower))
train.corpus <- tm_map(train.corpus, removePunctuation)
train.corpus <- tm_map(train.corpus, stripWhitespace)
train.corpus <- tm_map(train.corpus, removeNumbers)
train.corpus <- tm_map(train.corpus, removeWords, mystopwords)
train.corpus <- tm_map(train.corpus, stemDocument)
train.corpus <- tm_map(train.corpus, removeWords, "(http)\\w+")
train.corpus <- tm_map(train.corpus, removeWords, "\\b[a-zA-Z0-9]10,100\\b")
train.corpus.dtm <- DocumentTermMatrix(train.corpus, control = list(weighting = function(x) weightTfIdf(x, normalize = FALSE), stopwords = TRUE, removePunctuation=TRUE))
train.corpus.dtms <- removeSparseTerms(train.corpus.dtm, 0.98)
#Debugging
#TermDocumentMatrix(train.corpus)
#inspect(train.corpus.dtm)
#findFreqTerms(train.corpus.dtm, N) #N <- freq
#corpus for test set
test.corpus <- Corpus(VectorSource(test$pred))
test.corpus <- tm_map(test.corpus, content_transformer(tolower))
test.corpus <- tm_map(test.corpus, removePunctuation)
test.corpus <- tm_map(test.corpus, stripWhitespace)
test.corpus <- tm_map(test.corpus, removeNumbers)
test.corpus <- tm_map(test.corpus, removeWords, mystopwords)
test.corpus <- tm_map(test.corpus, stemDocument)
test.corpus <- tm_map(test.corpus, removeWords, "(http)\\w+")
test.corpus <- tm_map(test.corpus, removeWords, "\\b[a-zA-Z0-9]10,100\\b")
test.corpus.dtm <- DocumentTermMatrix(test.corpus, control = list(weighting = function(x) weightTfIdf(x, normalize = FALSE), stopwords = TRUE, removePunctuation=TRUE))
test.corpus.dtms <- removeSparseTerms(test.corpus.dtm, 0.98)
m <- as.matrix(train.corpus.dtms)
n <- as.matrix(test.corpus.dtms)
#Train model
model <- naiveBayes(m,as.factor(train$class));
#Prediction
results <- predict(model,n[1:10,])
下一步是在这个分类器中加入 10 折交叉验证以进行性能检查;我现在被困在哪里。
【讨论】:
您可以将library(caret)
与library(klaR)
中的naivebayes 实现一起使用topepo.github.io/caret/modelList.html
“插入符号”在 R 3.1.2 上不起作用吗?它依赖于> = 3.2的“汽车”。 R 3.2 与我降级到 R 3.1 的某些软件包存在不兼容问题。任何建议。
创建您自己的测试和训练数据集。这并不难。
我已经按照我的要求处理了 arduino.stackexchange.com 转储的数据集。训练和测试数据集是我细化的。以及如何创建与该查询相关的自己的数据集?
@KHANirfan 我的问题与 DocumentTerm 矩阵有关。我直接将字符串作为模型的输入,这是不正确的。第一步是创建 DocumentTerm 矩阵,它只是字符串的矢量化和一些基本的规范化。以上是关于R中的朴素贝叶斯分类器仅预测一个类的主要内容,如果未能解决你的问题,请参考以下文章