使用pandas创建稀疏矩阵，并使用来自.dat文件的其他两列的索引[x，y]的.dat文件的一列中的值填充它

Question

我有一个.dat文件，其中包含三列 - userID，artistID和weight。使用Python，我将数据读入带有data = pd.read_table（'train.dat'）的pandas Dataframe。

我想创建一个稀疏矩阵（/ 2D数组），它将数据Dataframe的前两列（'userID'，'artistID'）中的值作为索引，将第三列中的值作为值（'weight'）。数据帧中未给出的索引组合应为NaN。

我尝试使用for循环创建一个空的numpy数组并填充它，但是需要花费很多时间（train.dat中有大约10万行）。

import csv
import numpy as np

f = open("train.dat", "rt")
reader = csv.reader(f, delimiter="	")
next(reader)
data = [d for d in reader]
f.close()

data = np.array(data, dtype=float)
col = int(a[:,0].max()) + 1
row = int(a[:,1].max()) + 1

empty = np.empty((row, col))
empty[:] = np.nan

for d in data:
   empty[int(d[0]), int(d[1])] = d[2]

还尝试创建一个coo_matrix并将其转换为csr_matrix（因此我可以使用索引访问数据），但索引重置。

import scipy.sparse as sps
import pandas as pd

data = pd.read_table('train.dat')
matrix = sps.coo_matrix((data.weight, (data.index.labels[0], data.index.labels[1])))
matrix = matrix.tocsr()

数据示例：

userID    artistID  weight
    45           7      0.7114779874213837
   204         144      0.46399999999999997
    36         650      2.4232887490165225
   140         146      1.0146699266503667
   170          31      1.4124783362218372
   240         468      0.6529992406985573