如何在熊猫数据框中读取嵌套的 json 文件?

Posted

技术标签:

【中文标题】如何在熊猫数据框中读取嵌套的 json 文件?【英文标题】:how to read nested json file in pandas dataframe? 【发布时间】:2019-07-28 19:59:18 【问题描述】:

我学习了如何在 pandas 数据框中加载和读取 json 文件。但是,我有多个关于新闻的 json 文件,每个 json 文件都有一个相当复杂的嵌套结构来表示新闻内容及其元数据。我需要在 pandas 数据框中读取它们以进行下一次下游分析。所以我想出了如何在python中加载和读取json文件。但是,我为我的 json 文件学习的解决方案对我不起作用。这是即时的示例 json 数据 sn-p:example json file,这是我尝试过的:

import os, json
import pandas as pd

path_to_json = 'FakeNewsNetData/BuzzFeed/FakeNewsContent/'  // multiple json files
json_files = [pos_json for pos_json in os.listdir(path_to_json) if pos_json.endswith('.json')]

with open('json_files[0]') as f:
    data = pd.DataFrame(json.loads(line) for line in f)

但我没有得到预期的 pandas 数据框。如何将具有嵌套结构的 json 文件很好地读入 pandas 数据帧?有没有人看一下示例 json data sn-p 并提供一个可能的想法来使这项工作在 pandas 数据框中工作?有什么想法吗?谢谢

json数据来源

我使用了来自这个 github 存储库的 json 数据:FakeNewsNet Dataset,因此您可以浏览原始数据的样子并从中创建整洁的 pandas 数据框。有什么想法可以轻松完成吗?再次感谢

更新 2

我尝试了以下解决方案,但它对我不起作用:

import json
import pandas as pd
with open('FakeNewsContent/BuzzFeed_Fake_1-Webpage.json', 'r') as f:
    data = json.load(f)
df = pd.DataFrame(data)

ValueError: 数组的长度必须相同

【问题讨论】:

Pandas read nested json的可能重复 @MayurBhangale 我也试过了,但没用,有更好的主意吗? 【参考方案1】:
import os
import glob
import json

from pandas.io.json import json_normalize

path_to_json = 'FakeNewsNetData/BuzzFeed/FakeNewsContent/'
json_paths = glob.glob(os.path.join(path_to_json, "*.json"))
df = pd.concat((json_normalize(json.load(open(p))) for p in json_paths), axis=0)
df = df.reset_index(drop=True)  # Optionally reset index.

这会将您的所有 json 文件加载到单个数据帧中。 它还将通过添加“。”来展平嵌套的 json 层次结构。键之间。

您可能需要执行进一步的数据清理,例如,将 NaN 替换为适当的值。这可以通过数据框的fillna 来完成,或者通过应用函数来转换单个值。

编辑

正如我在评论中提到的,数据实际上是杂乱无章的,因此诸如“查看所有帖子”之类的词可以作为“作者”的值之一。有关示例,请参阅 JSON“BuzzFeed_Fake_26-Webpage.json”。

要删除这些条目以及可能的其他条目,

# This will be a set of entries you wish to remove.
# Here we only consider "View All Posts".
invalid_entries = "View All Posts"

import functools
def fix(x, invalid):
    if isinstance(x, list):
        return [i for i in x if i not in invalid]
    else:
        # You can optionally choose to return [] here to fix the NaNs
        # and to standardize the types of the values in this column
        return x

fix_author = functools.partial(fix, invalid=invalid_entries)
df["authors"] = df.authors.apply(fix_author)

【讨论】:

修正了错误。 glob 函数需要一个路径。 您能否提供一个您认为映射错误的示例?请注意,不同的 JSON 文件具有不同数量的键和嵌套。因此,当它们被展平到同一个数据帧中时,必然会丢失条目。此外,与嵌套字典对应的条目属于其键由“.”连接的列。 例如,如果您浏览 df.authors 之类的列,则会错误包含一些额外的文本,以及诸如 meta_data.DC.date.issuedmeta_data.apple-mobile-web-app-capablemeta_data.article.author 等错误列,这是不正确的根据原始json 文件。当我们将数据从 json 映射到 pandas 数据帧时,python 似乎感到困惑。有什么可行的解决方案来解决这个问题? 就像我在回答和上面的评论中所说的那样,meta_data.DC.date.issued etc 等列是从 flattening 添加的 嵌套的 JSON 结构。如果不引入这些新列,则无法将 JSON 文件合并到单个数据框中,因为它们具有不同的嵌套级别。 meta_data.app-mobile-web-app-capable 之类的列表示原始 json dict 有嵌套 "metadata": "apple-mobile-web-app-capable": <value> 您认为我们可以修复df.author 列的错误条目吗?我不明白为什么要添加View all port 关键字?我们可以避免这种情况吗?【参考方案2】:

您需要定位您的dataframe。尝试以下代码来更新您的 Update 2 方法:

x = "top_img": "http://eaglerising.com/wp-content/uploads/2016/09/terrorism-2.jpg", "text": "On Saturday, September 17 at 8:30 pm EST, an explosion rocked West 23 Street in Manhattan, in the neighborhood commonly referred to as Chelsea, injuring 29 people, smashing windows and initiating street closures. There were no fatalities. Officials maintain that a homemade bomb, which had been placed in a dumpster, created the explosion. The explosive device was removed by the police at 2:25 am and was sent to a lab in Quantico, Virginia for analysis. A second device, which has been described as a \u201cpressure cooker\u201d device similar to the device used for the Boston Marathon bombing in 2013, was found on West 27th Street between the Avenues of the Americas and Seventh Avenue. By Sunday morning, all 29 people had been released from the hospital. The Chelsea incident came on the heels of an incident Saturday morning in Seaside Heights, New Jersey where a bomb exploded in a trash can along a route where thousands of runners were present to run a 5K Marine Corps charity race. There were no casualties. By Sunday afternoon, law enforcement had learned that the NY and NJ explosives were traced to the same person.\n\nGiven that we are now living in a world where acts of terrorism are increasingly more prevalent, when a bomb goes off, our first thought usually goes to the possibility of terrorism. After all, in the last year alone, we have had several significant incidents with a massive number of casualties and injuries in Paris, San Bernardino California, Orlando Florida and Nice, to name a few. And of course, last week we remembered the 15th anniversary of the September 11, 2001 attacks where close to 3,000 people were killed at the hands of terrorists. However, we also live in a world where political correctness is the order of the day and the fear of being labeled a racist supersedes our natural instincts towards self-preservation which, of course, includes identifying the evil-doers. Isn\u2019t that how crimes are solved? Law enforcement tries to identify and locate the perpetrators of the crime or the \u201cbad guys.\u201d Unfortunately, our leadership \u2013 who ostensibly wants to protect us \u2013 finds their hands and their tongues tied. They are not allowed to be specific about their potential hypotheses for fear of offending anyone.\n\nNew York City Mayor Bill de Blasio \u2013 who famously ended \u201cstop-and-frisk\u201d profiling in his city \u2013 was extremely cautious when making his first remarks following the Chelsea neighborhood explosion. \u201cThere is no specific and credible threat to New York City from any terror organization,\u201d de Blasio said late Saturday at the news conference. \u201cWe believe at this point in this time this was an intentional act. I want to assure all New Yorkers that the NYPD and \u2026 agencies are at full alert\u201d, he said. Isn\u2019t \u201can intentional act\u201d terrorism? We may not know whether it is from an international terrorist group such as ISIS, or a homegrown terrorist organization or a deranged individual or group of individuals. It is still terrorism. It is not an accident. James O\u2019Neill, the New York City Police Commissioner had already ruled out the possibility that the explosion was caused by a natural gas leak at the time the Mayor made his comments. New York\u2019s Governor Andrew Cuomo was a little more direct than de Blasio saying that there was no evidence of international terrorism and that no specific groups had claimed responsibility. However, he did say that it is a question of how the word \u201cterrorism\u201d is defined. \u201cA bomb exploding in New York is obviously an act of terrorism.\u201d Cuomo hit the nail on the head, but why did need to clarify and caveat before making his \u201cobvious\u201d assessment?\n\nThe two candidates for president Hillary Clinton and Donald Trump also weighed in on the Chelsea explosion. Clinton was very generic in her response saying that \u201cwe need to do everything we can to support our first responders \u2013 also to pray for the victims\u201d and that \u201cwe need to let this investigation unfold.\u201d Trump was more direct. \u201cI must tell you that just before I got off the plane a bomb went off in New York and nobody knows what\u2019s going on,\u201d he said. \u201cBut boy we are living in a time\u2014we better get very tough folks. We better get very, very tough. It\u2019s a terrible thing that\u2019s going on in our world, in our country and we are going to get tough and smart and vigilant.\u201d\n\nUnfortunately, an incident like the Chelsea explosion reminds us how vulnerable our country is particularly in venues defined as \u201csoft targets.\u201d Now more than ever, America needs strong leadership which is laser-focused on protecting her citizens from terrorist attacks of all genres and is not afraid of being politically incorrect.\n\nThe views expressed in this opinion article are solely those of their author and are not necessarily either shared or endorsed by EagleRising.com", "authors": ["View All Posts", "Leonora Cravotta"], "keywords": [], "meta_data": "description": "\u201cWe believe at this point in this time this was an intentional act,\" de Blasio said. Isn\u2019t \u201can intentional act\u201d terrorism?", "og": "site_name": "Eagle Rising", "description": "\u201cWe believe at this point in this time this was an intentional act,\" de Blasio said. Isn\u2019t \u201can intentional act\u201d terrorism?", "title": "Another Terrorist Attack in NYC...Why Are we STILL Being Politically Correct", "locale": "en_US", "image": "http://eaglerising.com/wp-content/uploads/2016/09/terrorism-2.jpg", "updated_time": "2016-09-22T10:49:05+00:00", "url": "http://eaglerising.com/36942/another-terrorist-attack-in-nyc-why-are-we-still-being-politically-correct/", "type": "article", "robots": "noimageindex", "fb": "app_id": 256195528075351, "pages": 135665053303678, "article": "section": "Political Correctness", "tag": "terrorism", "published_time": "2016-09-22T07:10:30+00:00", "modified_time": "2016-09-22T10:49:05+00:00", "viewport": "initial-scale=1,maximum-scale=1,user-scalable=no", "googlebot": "noimageindex", "canonical_link": "http://eaglerising.com/36942/another-terrorist-attack-in-nyc-why-are-we-still-being-politically-correct/", "images": ["http://constitution.com/wp-content/uploads/2017/08/confederatemonument_poll_pop.jpg", "http://2lv0hm3wvpix464wwy2zh7d1.wpengine.netdna-cdn.com/wp-content/uploads/wordpress-popular-posts/46772-featured-300x130.jpg", "http://2lv0hm3wvpix464wwy2zh7d1.wpengine.netdna-cdn.com/wp-content/uploads/2016/03/eagle-rising-logo3-1.png", "http://2lv0hm3wvpix464wwy2zh7d1.wpengine.netdna-cdn.com/wp-content/uploads/wordpress-popular-posts/46729-featured-300x130.jpg", "http://2lv0hm3wvpix464wwy2zh7d1.wpengine.netdna-cdn.com/wp-content/uploads/wordpress-popular-posts/46764-featured-300x130.jpg", "http://2lv0hm3wvpix464wwy2zh7d1.wpengine.netdna-cdn.com/wp-content/uploads/wordpress-popular-posts/46731-featured-300x130.jpg", "http://pixel.quantserve.com/pixel/p-52ePUfP6_NxQ_.gif", "http://0.gravatar.com/avatar/9b4601287436c60e1c7c5b65d725151f?s=112&d=mm&r=g", "http://b.scorecardresearch.com/p?c1=2&c2=22315475&cv=2.0&cj=1", "http://2lv0hm3wvpix464wwy2zh7d1.wpengine.netdna-cdn.com/wp-content/uploads/wordpress-popular-posts/46784-featured-300x130.png", "http://2lv0hm3wvpix464wwy2zh7d1.wpengine.netdna-cdn.com/wp-content/uploads/2016/09/terrorism-2-800x300.jpg", "http://2lv0hm3wvpix464wwy2zh7d1.wpengine.netdna-cdn.com/wp-content/uploads/2016/09/coup-375x195.jpg", "http://2lv0hm3wvpix464wwy2zh7d1.wpengine.netdna-cdn.com/wp-content/uploads/2017/04/crtv_300x600_1.jpg", "http://2lv0hm3wvpix464wwy2zh7d1.wpengine.netdna-cdn.com/wp-content/uploads/wordpress-popular-posts/46774-featured-300x130.jpg", "http://2lv0hm3wvpix464wwy2zh7d1.wpengine.netdna-cdn.com/wp-content/uploads/2016/09/superstar-375x195.jpg", "http://2lv0hm3wvpix464wwy2zh7d1.wpengine.netdna-cdn.com/wp-content/uploads/wordpress-popular-posts/46763-featured-300x130.jpg", "http://2lv0hm3wvpix464wwy2zh7d1.wpengine.netdna-cdn.com/wp-content/uploads/wordpress-popular-posts/46612-featured-300x130.jpg", "http://2lv0hm3wvpix464wwy2zh7d1.wpengine.netdna-cdn.com/wp-content/uploads/wordpress-popular-posts/46761-featured-300x130.jpg", "http://2lv0hm3wvpix464wwy2zh7d1.wpengine.netdna-cdn.com/wp-content/uploads/wordpress-popular-posts/46642-featured-300x130.jpg", "http://2lv0hm3wvpix464wwy2zh7d1.wpengine.netdna-cdn.com/wp-content/uploads/wordpress-popular-posts/46735-featured-300x130.jpg", "http://2lv0hm3wvpix464wwy2zh7d1.wpengine.netdna-cdn.com/wp-content/uploads/wordpress-popular-posts/46750-featured-300x130.jpg", "http://2lv0hm3wvpix464wwy2zh7d1.wpengine.netdna-cdn.com/wp-content/uploads/wordpress-popular-posts/46755-featured-300x130.jpg", "http://2lv0hm3wvpix464wwy2zh7d1.wpengine.netdna-cdn.com/wp-content/uploads/wordpress-popular-posts/46752-featured-300x130.png", "http://eaglerising.com/wp-content/uploads/2016/09/terrorism-2.jpg", "http://2lv0hm3wvpix464wwy2zh7d1.wpengine.netdna-cdn.com/wp-content/uploads/wordpress-popular-posts/46743-featured-300x130.jpg", "http://2lv0hm3wvpix464wwy2zh7d1.wpengine.netdna-cdn.com/wp-content/uploads/wordpress-popular-posts/46712-featured-300x130.jpg", "http://0.gravatar.com/avatar/9b4601287436c60e1c7c5b65d725151f?s=100&d=mm&r=g", "http://2lv0hm3wvpix464wwy2zh7d1.wpengine.netdna-cdn.com/wp-content/uploads/wordpress-popular-posts/46757-featured-300x130.png"], "title": "Another Terrorist Attack in NYC\u2026Why Are we STILL Being Politically Correct \u2013 Eagle Rising", "url": "http://eaglerising.com/36942/another-terrorist-attack-in-nyc-why-are-we-still-being-politically-correct/", "summary": "", "movies": [], "publish_date": "$date": 1474528230000, "source": "http://eaglerising.com"

import pandas as pd
df = pd.DataFrame.from_dict(x, orient='index')
print df

从 JSON 文件中读取:

import json
import pandas as pd
with open('FakeNewsContent/BuzzFeed_Fake_1-Webpage.json', 'r') as f:
     data = json.load(f)
df = pd.DataFrame.from_dict(data, orient='index')
print df

【讨论】:

我将 json 从你的一个 json 文件复制到 x 中 我们能否让这段代码更高效,比如如何将一堆具有相同结构的 json 文件合并到一个 pandas 数据帧中?无论如何让你的代码聪明高效?还有什么想法吗?谢谢 @beyond_inifinity 更新了我的解决方案添加了新方法 @beyond_inifinity 合并两个df 调用不同的问题。不过你可以查看***.com/questions/12850345/… 来合并相关的解决方案。

以上是关于如何在熊猫数据框中读取嵌套的 json 文件?的主要内容,如果未能解决你的问题,请参考以下文章

如何在熊猫数据框中读取 mongodb 导出的 Json

如何使用嵌套字典列表展平熊猫数据框中的列

在熊猫数据框中展平嵌套的 Json

如何使用平面数据表中的嵌套记录构建 JSON 文件?

如何将熊猫数据框中的嵌套逗号分隔列转换为Python中的特定格式

从熊猫数据框中提取嵌套字典