数据挖掘比赛，构建矩阵时的脑残行为

Posted 2021-01-11 smartwhite

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了数据挖掘比赛，构建矩阵时的脑残行为相关的知识，希望对你有一定的参考价值。

scipy.sparse.hstack(blocks, format=None, dtype=None)[source]?

Stack sparse matrices horizontally (column wise)

Parameters:	blocks sequence of sparse matrices with compatible shapes format : str sparse format of the result (e.g. “csr”) by default an appropriate sparse matrix format is returned. This choice is subject to change. dtype : dtype, optional The data-type of the output matrix. If not given, the dtype is determined from that of blocks.

上面是出错函数

///////////////////////////////////////////////////////////////////////////////////////////////////

再比赛中，把特征变为系数矩阵，照着开源来改的：

base_train_csr = np.float64(train_x[num_feature])
    base_predict_csr = np.float64(predict_x[num_feature])

    enc = OneHotEncoder()   
    for feature in short_cate_feature:
        enc.fit(data[feature].values.reshape(-1, 1))
        base_train_csr = sparse.hstack((base_train_csr, enc.transform(train_x[feature].values.reshape(-1, 1))), ‘csr‘,‘bool‘)
        base_predict_csr = sparse.hstack((base_predict_csr, enc.transform(predict_x[feature].values.reshape(-1, 1))), ‘csr‘, ‘bool‘)
    print(‘one-hot prepared !‘)

    cv = CountVectorizer(min_df=20)
    for feature in long_cate_feature: 
        cv.fit(data[feature])
        base_train_csr = sparse.hstack((base_train_csr, cv.transform(train_x[feature])), ‘csr‘, ‘int‘)
        base_predict_csr = sparse.hstack((base_predict_csr, cv.transform(predict_x[feature])), ‘csr‘,‘int‘)
    print(‘cv prepared !‘)

特征放如lgb，loss急速下降惊了。一晚上没找到原因，

今天从头做简单实验，找到原因。

上面，我先对数值特征，直接用np转换，类别较少的特征，用onehot编码，问题就出现在这： sparse.hstack( , ‘csr‘,‘bool‘)

我把float（64）的矩阵直接与bool行的矩阵相连，然后转化为成了bool形，脑残啊，前面的数值特征全都没用了。。。。。。。。。。。。。。。。

总结：以后再使用hstack的时候，要从粗粒度往细粒度加，如bool->int32->float32->float64,，要不然细粒度的特征就会被压缩，信息损失很多

以上是关于数据挖掘比赛，构建矩阵时的脑残行为的主要内容，如果未能解决你的问题，请参考以下文章