CountVectorizer：AttributeError：“numpy.ndarray”对象没有属性“lower”

Posted 2023-03-12

技术标签:

【中文标题】CountVectorizer：AttributeError：“numpy.ndarray”对象没有属性“lower”【英文标题】：CountVectorizer: AttributeError: 'numpy.ndarray' object has no attribute 'lower' 【发布时间】：2014-12-09 14:48:10 【问题描述】：

我有一个一维数组，每个元素都包含大字符串。我正在尝试使用CountVectorizer 将文本数据转换为数字向量。但是，我收到一条错误消息：

AttributeError: 'numpy.ndarray' object has no attribute 'lower'

mealarray 在每个元素中都包含大字符串。有 5000 个这样的样本。我正在尝试将其矢量化，如下所示：

vectorizer = CountVectorizer(
    stop_words='english',
    ngram_range=(1, 1),  #ngram_range=(1, 1) is the default
    dtype='double',
)
data = vectorizer.fit_transform(mealarray)

完整的堆栈跟踪：

File "/Library/Python/2.7/site-packages/sklearn/feature_extraction/text.py", line 817, in fit_transform
    self.fixed_vocabulary_)
  File "/Library/Python/2.7/site-packages/sklearn/feature_extraction/text.py", line 748, in _count_vocab
    for feature in analyze(doc):
  File "/Library/Python/2.7/site-packages/sklearn/feature_extraction/text.py", line 234, in <lambda>
    tokenize(preprocess(self.decode(doc))), stop_words)
  File "/Library/Python/2.7/site-packages/sklearn/feature_extraction/text.py", line 200, in <lambda>
    return lambda x: strip_accents(x.lower())
AttributeError: 'numpy.ndarray' object has no attribute 'lower'

【问题讨论】：

某人（如果没有完整的堆栈跟踪，很难判断谁是 scikit 或 Numpy）试图将 Numpy 数组视为字符串（"FOO".lower() 返回"foo"）。你确定mealarray 的内容是字符串，还是CountVectorizer 想要一个字符串数组？ @AhmedFasih，刚刚在问题中添加了完整的堆栈跟踪！ 【参考方案1】：

我遇到了同样的错误：

AttributeError: 'numpy.ndarray' object has no attribute 'lower'

为了解决这个问题，我做了以下操作：

name_of_array1.shape

flatten()

flat_array = name_of_array1.flatten()

CountVectorizer()

【讨论】：

【参考方案2】：

更好的解决方案是显式调用 pandas 系列并将其传递给 CountVectorizer()：

>>> tex = df4['Text']
>>> type(tex)
<class 'pandas.core.series.Series'>
X_train_counts = count_vect.fit_transform(tex)

下一个不起作用，因为它是一个框架而不是系列

>>> tex2 = (df4.ix[0:,[11]])
>>> type(tex2)
<class 'pandas.core.frame.DataFrame'>

【讨论】：

【参考方案3】：

得到了我的问题的答案。基本上，CountVectorizer 将列表（带有字符串内容）作为参数而不是数组。这解决了我的问题。

【讨论】：

嗨@ashu，你能分享你在代码中所做的更改吗？如果你有这个。很接近，但不完全是：它必须是一维数组/列表【参考方案4】：

检查mealarray 的形状。如果fit_transform 的参数是字符串数组，则它必须是一维数组。（即mealarray.shape 必须是(n,) 的形式。）例如，如果mealarray 具有(n, 1) 之类的形状，您将收到“无属性”错误。

你可以试试

data = vectorizer.fit_transform(mealarray.ravel())

【讨论】：

我用 ravel 试了一下，得到以下错误。 AttributeError：'NoneType' 对象没有属性'lower'。 mealarray 的形状是 (5000,1) 因为我使用 "mealarray = np.empty((plen,1), dtype=object)" 创建它 OK，然后你再填充数组。那你肯定算过mealarray的实际字数吧？假设它是nwords。然后将mealarray[:nwords].ravel() 传递给fit_transform()。（虽然我想知道为什么你创建形状为(plen,1) 而不仅仅是(plen,) 的数组。）注意：在我之前的评论中，我假设您从头开始填写mealarray，在包含单词的索引之间没有包含None 的索引。 @WarrenWeckesser 有类似的问题，您的 ravel() 解决方案对我有用。谢谢！

以上是关于CountVectorizer：AttributeError：“numpy.ndarray”对象没有属性“lower”的主要内容，如果未能解决你的问题，请参考以下文章

CountVectorizer：未安装词汇表

Spark CountVectorizer

sklearn CountVectorizer

Spark 机器学习 ---CountVectorizer

sklearn中CountVectorizer与TfidfVectorizer区别

CountVectorizer 删除只出现一次的特征