有啥方法可以扩展包含列表的 pandas Dataframe 中的列并从列表值本身中获取列名？

Posted 2023-03-12

技术标签:

【中文标题】有啥方法可以扩展包含列表的 pandas Dataframe 中的列并从列表值本身中获取列名？【英文标题】：Is there any way to expand a column in a pandas Dataframe containing lists and fetch the column names from the list values themselves?有什么方法可以扩展包含列表的 pandas Dataframe 中的列并从列表值本身中获取列名？ 【发布时间】：2017-07-07 17:24:09 【问题描述】：

我已将嵌套的 JSON 文件转换为 pandas DataFrame。一些列现在包含列表。它们看起来像这样：

0         [BikeParking: True, BusinessAcceptsBitcoin: Fa...
1         [BusinessAcceptsBitcoin: False, BusinessAccept...
2         [Alcohol: none, Ambience: 'romantic': False, ...
3         [AcceptsInsurance: False, BusinessAcceptsCredi...
4         [BusinessAcceptsCreditCards: True, Restaurants...
5         [BusinessAcceptsCreditCards: True, ByAppointme...
6         [BikeParking: True, BusinessAcceptsCreditCards...
7         [Alcohol: none, Ambience: 'romantic': False, ...
8                        [BusinessAcceptsCreditCards: True]
9         [BikeParking: True, BusinessAcceptsCreditCards...
10                                                     None
.
.
.
144070    [Alcohol: none, Ambience: 'romantic': False, ...
144071    [BikeParking: True, BusinessAcceptsCreditCards...
Name: attributes, dtype: object

还有这个：

0         [Monday 11:0-21:0, Tuesday 11:0-21:0, Wednesda...
1         [Monday 0:0-0:0, Tuesday 0:0-0:0, Wednesday 0:...
2         [Monday 11:0-2:0, Tuesday 11:0-2:0, Wednesday ...
3         [Tuesday 10:0-21:0, Wednesday 10:0-21:0, Thurs...
4                                                      None

144066                                                 None
144067    [Tuesday 8:0-16:0, Wednesday 8:0-16:0, Thursda...
144068    [Tuesday 10:0-17:30, Wednesday 10:0-17:30, Thu...
144069                                                 None
144070    [Monday 11:0-20:0, Tuesday 11:0-20:0, Wednesda...
144071    [Monday 10:0-21:0, Tuesday 10:0-21:0, Wednesda...
Name: hours, dtype: object

我有什么方法可以自动提取标签（BikeParking、AcceptsInsurance 等）并将它们用作列名，同时用真/假值填充单元格。对于 Ambience dict，我想在单元格中执行 Ambience_romantic 和 true/false 之类的操作。同样，我想将星期几提取为列名，并使用小时来填充单元格。

或者之前有没有办法将json数据展平？我尝试将 json 数据逐行传递给 json_normalize 并从输出中创建一个数据框，但它会产生相同的结果。也许我做错了什么？

原始json格式（yelp_academic_dataset_business.json）：

 
    "business_id":"encrypted business id",
    "name":"business name",
    "neighborhood":"hood name",
    "address":"full address",
    "city":"city",
    "state":"state -- if applicable --",
    "postal code":"postal code",
    "latitude":latitude,
    "longitude":longitude,
    "stars":star rating, rounded to half-stars,
    "review_count":number of reviews,
    "is_open":0/1 (closed/open),
    "attributes":["an array of strings: each array element is an attribute"],
    "categories":["an array of strings of business categories"],
    "hours":["an array of strings of business hours"],
    "type": "business"

我对 json_normalize 的初步尝试：

with open('yelp_academic_dataset_business.json') as f:
        #Normalize the json data to flatten it and store output in a dataframe
        frame= json_normalize([json.loads(line) for line in f])

        #write the dataframe to a csv file
        frame.to_csv('yelp_academic_dataset_business.csv', encoding='utf-8', index=False)

我目前正在尝试什么：

with open(json_filename) as f:
    data = f.readlines()

    # remove the trailing "\n" from each line
    data = map(lambda x: x.rstrip(), data)

    data_json_str = "[" + ','.join(data) + "]"  

    df = read_json(data_json_str)
    #Now Looking to expand df['attributes'] and others here

我还应该提到我的目标是将其转换为 .csv 以将其加载到数据库中。我不想在我的数据库列中列出列表。

您可以从 Yelp 数据集挑战网站获取原始 json 数据： https://www.yelp.ca/dataset_challenge/dataset

【问题讨论】：

我们可以看看原始 json 和你的尝试吗？添加了 json 格式、数据链接和我的尝试。 【参考方案1】：

您正在尝试将“文档”（半结构化数据）转换为表格。如果一个记录包含例如，这可能是有问题的。 100 个其他记录没有的属性——您可能不想在主表中添加 100 列并且所有其他记录都有空单元格。

但最后你解释说你打算这样做：

加载 JSON。转换为 Pandas。导出 CSV。导入数据库。

我在这里告诉你，这完全没有意义。通过所有这些中间格式混合数据只会导致问题。

相反，让我们回到基础：

加载 JSON。写入数据库。

现在第一步是提出一个架构。或者，如果您使用的是 NoSQL 数据库，则可以直接加载 JSON，无需其他步骤。

【讨论】：

我必须用 mysql 来做这件事。作为这项任务的一部分，我必须编写一个脚本来将 json 转换为可以加载到 MySQL 数据库中的任何格式。我必须写一个脚本。我知道这毫无意义，但我想这是练习的一部分。 @Koryx：好的，你会用这样的东西加载它吗？ dev.mysql.com/doc/refman/5.7/en/load-data.html 如果是这样，那就太好了……您可以简单地编写一个脚本来将输入的 JSON 转换为 CSV。不需要熊猫。考虑写入多个相关表。是的，我会这样加载它。但是如何在不使用 pandas 的情况下用一个简单的脚本来展平 json 数据呢？加载到 MySQL 时，csv 列中的列表（和其中的字典）不会产生问题吗？ @Koryx：您首先需要定义您的数据库架构。在确定架构之前不要担心如何编写脚本。

以上是关于有啥方法可以扩展包含列表的 pandas Dataframe 中的列并从列表值本身中获取列名？的主要内容，如果未能解决你的问题，请参考以下文章