来自json文件的Scipy稀疏

Posted 2023-03-12

技术标签:

【中文标题】来自json文件的Scipy稀疏【英文标题】：Scipy sparse from json file 【发布时间】：2019-04-08 03:33:32 【问题描述】：

我尝试使用 scipy.sparse 从 json 文件创建一个矩阵。

我有这种方式的json文件

"reviewerID": "A10000012B7CGYKOMPQ4L", "asin": "000100039X", "reviewerName": "Adam", "helpful": [0, 0], "reviewText": "Spiritually and mentally inspiring! A book that allows you to question your morals and will help you discover who you really are!", "overall": 5.0, "summary": "Wonderful!", "unixReviewTime": 1355616000, "reviewTime": "12 16, 2012"

这是我的 Json 格式...更多这样的元素（基于亚马逊评论文件）

并且想要对这个矩阵执行 scipy sparse

    count            
object       a   b   c   d
id                   
him       NaN   1 NaN   1
me          1 NaN NaN   1
you         1 NaN   1 NaN

我正在尝试这样做

我

mport numpy as np
import pandas as pd
from scipy.sparse import csr_matrix

df= pd.read_json('C:\\Users\\anto-\\Desktop\\university\\Big Data computing\\Ex. Resource\\test2.json',lines=True)


a= df['reviewerID']
b= df['asin']
data= df.groupby(["reviewerID"]).size()



row = df.reviewerID.astype('category', categories=a).cat.codes
col = df.asin.astype('category', categories=b).cat.codes
sparse_matrix = csr_matrix((data, (row, col)), shape=(len(a), len(b)))

从这个旧示例中读取

Efficiently create sparse pivot tables in pandas?

我的代码中有一些 deprecates 元素的错误，但我不明白如何构建这个矩阵。

这是错误日志：

 FutureWarning: specifying 'categories' or 'ordered' in .astype() is deprecated; pass a CategoricalDtype instead
  from ipykernel import kernelapp as app

我有点困惑。任何人都可以给我一些建议或类似的例子吗？

【问题讨论】：

除非您的输入不是正确的 JSON，否则将其作为完整矩阵加载到 pandas 中有点违背了使其稀疏的目的，不是吗？我希望您对标准 Python json 模块有更好的运气。通常我们会询问实际的错误和回溯，这样我们就可以准确地看到问题是什么以及它发生在哪里。您没有向我们展示真正的 JSON 字符串或文件。您不显示生成的数据框。也不是生成的 data, row, col 数组。 @MadPhysicist 我想要稀疏表示来计算相似度函数。你是对的，我的 JSON 是错误的表述，我更正了。 @hpaulj 我尝试使用这些东西，所以我尝试了几次午餐，但在不同的时间我有不同的错误。我对解决我的问题不感兴趣，我只想了解程序，所以我可以申请所有文件。我的数据、行、列是根据链接中给出的示例创建的...我认为也不需要。未来的警告并不是真正的错误。它们稍后会警告您潜在的问题，但不会阻止当前代码运行。探索的地方是 pandas 文档（它的 astype 等）。但是sparse_matrix变量有问题吗？ 【参考方案1】：

生成一个看起来像

的稀疏矩阵

    count            
object       a   b   c   d
id                   
him       NaN   1 NaN   1
me          1 NaN NaN   1
you         1 NaN   1 NaN

您需要生成 3 个数组，例如：

In [215]: from scipy import sparse
In [216]: data = np.array([1,1,1,1,1,1])
In [217]: row = np.array([1,2,0,2,0,1])
In [218]: col = np.array([0,0,1,2,3,3])
In [219]: M = sparse.csr_matrix((data, (row, col)), shape=(3,4))
In [220]: M
Out[220]: 
<3x4 sparse matrix of type '<class 'numpy.int64'>'
    with 6 stored elements in Compressed Sparse Row format>
In [221]: M.A
Out[221]: 
array([[0, 1, 0, 1],
       [1, 0, 0, 1],
       [1, 0, 1, 0]], dtype=int64)

必须将诸如“他”、“我”、“你”之类的类别映射到诸如 0、1、2 之类的唯一索引上。 'a','b','c','d' 也是如此。

【讨论】：

所以我需要在数组中转换我的 json 字段，比如数据行和列？谢谢你的回答据说这就是数据框分组和分类为您所做的。但我会让其他人解释这个过程。好的@hpaulj我得到了结构，但是如何为我的元素设置行，比如我的JSON中的“reviewerID”和像JSON中的“asin”一样的列？此外，我没有“数据”数组，因为如果“id”有“asin”，则选择我在数据中的值，否则我有 NaN 值，就像我在输出结构中写的那样。我需要预处理这些值吗？ Pandas sparse 允许像nan 这样的填充，但 scipy sparse 只存储非零值 - 正如我的示例所示。

以上是关于来自json文件的Scipy稀疏的主要内容，如果未能解决你的问题，请参考以下文章