Mahout Naive Bayes 模型无法找到缓存文件问题

Posted

技术标签:

【中文标题】Mahout Naive Bayes 模型无法找到缓存文件问题【英文标题】:Mahout Naive Bayes Model Unable to find cached files issue 【发布时间】:2014-08-23 05:31:56 【问题描述】:

我已经导入了:

import org.apache.mahout.classifier.naivebayes.training.TrainNaiveBayesJob;

来自Mahout-Core-0.9-job

但是当我尝试调用以下方法时:

String[] trainerArgs = "-i", vectorsDirectory + "tfidf-vectors/",
                        "-o", modelDirectory,
                        "-l", labelIndex,
                        "-el", "-ow";

TrainNaiveBayesJob thisTrainer = new TrainNaiveBayesJob();
thisTrainer.run(trainerArgs);

我收到以下错误:

java.lang.Exception: java.lang.IllegalStateException: Unable to find cached files!
    at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:354)
Caused by: java.lang.IllegalStateException: Unable to find cached files!
    at com.google.common.base.Preconditions.checkState(Preconditions.java:176)
    at org.apache.mahout.common.HadoopUtil.getCachedFiles(HadoopUtil.java:300)
    at org.apache.mahout.common.HadoopUtil.getSingleCachedFile(HadoopUtil.java:281)
    at org.apache.mahout.classifier.naivebayes.BayesUtils.readIndexFromCache(BayesUtils.java:146)
    at org.apache.mahout.classifier.naivebayes.training.IndexInstancesMapper.setup(IndexInstancesMapper.java:41)
    at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:142)
    at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:364)
    at org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:223)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
    at java.util.concurrent.FutureTask.run(FutureTask.java:262)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:744)

我到底在哪里漏掉了重点?有人可以指导我吗?

【问题讨论】:

【参考方案1】:

只需将对 NaiveBayes 的调用更改为:

ToolRunner.run(new Configuration(), new TrainNaiveBayesJob(), trainerArgs);

解决了这个问题。

【讨论】:

以上是关于Mahout Naive Bayes 模型无法找到缓存文件问题的主要内容,如果未能解决你的问题,请参考以下文章

Spark Naive Bayes 模型 - 没有这样的方法错误

python Naive Bayes模型的示例代码。参考:机器学习在行动第4章。

R - 为 multinomial_naive_bayes() 函数生成的模型生成混淆矩阵和 ROC

手写算法实现 之 朴素贝叶斯 Naive Bayes 篇

手写算法实现 之 朴素贝叶斯 Naive Bayes 篇

Spark Naive Bayes 模型持久性:理解 pi 和 theta