为啥 weka 在 WEKA 上计算 stringToWordVector 的错误数函数？

Posted 2023-03-12

技术标签:

【中文标题】为啥 weka 在 WEKA 上计算 stringToWordVector 的错误数函数？【英文标题】：Why weka calculates wrong number function of stringToWordVector on WEKA?为什么 weka 在 WEKA 上计算 stringToWordVector 的错误数函数？ 【发布时间】：2016-03-31 11:35:12 【问题描述】：

我想在 WEKA 应用程序上计算我的数据集的 stringToWordVector。我将 wordsToKeep 的参数更新了 50。但它计算了 78 个单词。我想要 50 个单词，但它计算 78 个单词。如何更正计算？

我的数据集：http://www.dt.fee.unicamp.br/~tiago/smsspamcollection - Link1

【问题讨论】：

请附上您正在使用的代码，否则很难提供帮助没有代码。 WEKA 是用于数据挖掘的应用程序。 weka.sourceforge.net/doc.dev/weka/filters/unsupervised/… 不，WEKA 是一个库，既可以从代码中使用，也可以通过 WEKA-GUI 使用。如果您只使用 gui - 那么您应该包括在错误发生之前执行的确切操作，以便我们可以轻松地重现它。我得到了预处理选项卡，并以“.arff”格式上传了我的数据集文件。之后，我单击过滤器选择并选择 StringToWordVector。该功能已设置，我单击原始。 StringToWordVector 的编辑器打开。 wordsToKeep 的选择默认为 1000。我将其设置为 50。然后单击应用按钮。属性面板有 78 att。但我设置为 50。为什么是 78 att？ 【参考方案1】：

-W 选项限制要保留的字数每类，因此对于 2 个类设置 -W 50 可以限制为 100

来源：

public String wordsToKeepTipText() 
    return "The number of words (per class if there is a class attribute "+
    "assigned) to attempt to keep.";

此外，基于source，它不是一个严格的约束，它只影响修剪排序出现列表的位置，这可以更改

// sort the array
sortArray(array);
if (array.length < m_WordsToKeep) 
// if there aren't enough words, set the threshold to
// minFreq
prune[z] = m_minTermFreq;
   else 
// otherwise set it to be at least minFreq
prune[z] = Math.max(m_minTermFreq, 
    array[array.length - m_WordsToKeep]);

【讨论】：

以上是关于为啥 weka 在 WEKA 上计算 stringToWordVector 的错误数函数？的主要内容，如果未能解决你的问题，请参考以下文章