numpy sum 给出错误

Posted

技术标签:

【中文标题】numpy sum 给出错误【英文标题】:numpy sum gives an error 【发布时间】:2015-07-02 03:37:05 【问题描述】:

如何解决以下错误:dist = np.sum(train_data_features, axis=0) 文件“/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/numpy/core/fromnumeric.py”,第1711行,总和 返回总和(轴=轴,dtype=dtype,out=out) TypeError: sum() 得到了一个意外的关键字参数 'dtype'

这是我的代码:

import os
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.ensemble import RandomForestClassifier
from KaggleWord2VecUtility import KaggleWord2VecUtility
import pandas as pd
import numpy as np

if __name__ == '__main__':
    train = pd.read_csv(os.path.join(os.path.dirname(__file__), 'data', 'NYTimesBlogTrain.csv'), header=0)
    test = pd.read_csv(os.path.join(os.path.dirname(__file__), 'data', 'NYTimesBlogTest.csv'), header=0)
    train["Abstract"].fillna(0)
    print 'A sample Abstract is:'
    print train["Abstract"][0]
    #raw_input("Press Enter to continue...")


    #print 'Download text data sets. If you already have NLTK datasets downloaded, just close the Python download window...'
    #nltk.download()  # Download text data sets, including stop words

    # Initialize an empty list to hold the clean reviews
    clean_train_reviews = []
    # Loop over each review; create an index i that goes from 0 to the length
    # of the movie review list
    print "Cleaning and parsing the training set abstracts...\n"
    #for i in xrange( 0, len(train["Abstract"])):
    for i in xrange( 0, 10):
        if pd.isnull(train["Abstract"][i])==False:
            clean_train_reviews.append(" ".join(KaggleWord2VecUtility.review_to_wordlist(train["Abstract"][i], True)))
        else:
            clean_train_reviews.append(" ")
    print clean_train_reviews  

    # ****** Create a bag of words from the training set
    #
    print "Creating the bag of words...\n"


    # Initialize the "CountVectorizer" object, which is scikit-learn's
    # bag of words tool.
    vectorizer = CountVectorizer(analyzer = "word",   \
                             tokenizer = None,    \
                             preprocessor = None, \
                             stop_words = None,   \
                             max_features = 5000)

    # fit_transform() does two functions: First, it fits the model
    # and learns the vocabulary; second, it transforms our training data
    # into feature vectors. The input to fit_transform should be a list of
    # strings.
    print clean_train_reviews
    train_data_features = vectorizer.fit_transform(clean_train_reviews)
    print 'train_data_features'
    print train_data_features
    print train_data_features.shape
    # Take a look at the words in the vocabulary
    vocab = vectorizer.get_feature_names()
    print vocab

    # Sum up the counts of each vocabulary word
    dist = np.sum(train_data_features, axis=0)

【问题讨论】:

【参考方案1】:

看起来你无法总结矢量化器给你的东西。您将需要一种不同的方法来求和,您应该可以在scipy's sparse library 中找到它,很可能只需调用

dist = train_data_features.sum (axis=0)

这是我从 coo_sparse matrix sum 上的文档中获得的。详情见下文

来自sklearn documentation:

此实现使用 scipy.sparse.coo_matrix 生成计数的稀疏表示。

来自谷歌搜索this type of error:

这从来没有用过,因为 numpy 对 scipy.sparse 一无所知。

【讨论】:

但它之前在不同的数据集上与我合作过。 使用我在上面发布的建议代码试一试,它应该可以满足您的需求。根据我所阅读的内容,我认为您的代码不应该能够将 CountVectorizer (scipy.sparse 类型)的输出传递给 numpy sum。

以上是关于numpy sum 给出错误的主要内容,如果未能解决你的问题,请参考以下文章

AWS lambda 中的 Pandas 给出了 numpy 错误

Python:两个数据帧的外部连接或合并给出错误:TypeError:unhashable type:'numpy.ndarray'

NumPy 包给出了一个 array() 错误如何解决这个问题? [关闭]

numpy.cov或numpy.linalg.eigvals给出了错误的结果

Numpy C++ 程序总是给出段错误(很可能滥用语法或类型)

Django - 注释多个 Sum() 对象会给出错误的结果