使用 python (Jupyter notebook) 对 json 数据进行数据预处理

Posted

技术标签:

【中文标题】使用 python (Jupyter notebook) 对 json 数据进行数据预处理【英文标题】:Data preprocessing with json data using python (Jupyter notebook) 【发布时间】:2021-04-26 22:58:52 【问题描述】:

我正在尝试为 json 数据集实现一些预处理命令。使用 .csv 文件很容易,但我不知道如何实现一些预处理命令,如 isnull()、fillna()、dropna() 和 imputer 类。

以下是我已执行但未能执行上述操作的一些命令,因为我无法弄清楚如何使用 Json 文件数据集。

数据集链接:https://drive.google.com/file/d/1puNNrRaV-Jt_kt709fuYGCvDW9-EuwoB/view?usp=sharing

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import json

dataset = pd.read_json('moviereviews.json', orient='columns')
print(dataset)

movies = pd.read_json( ( dataset).to_json(), orient='index')
print(movies)
print(type(movies))

movie = pd.read_json( ( dataset['12 Strong']).to_json(), orient='index')
print(movie)

movie_name = [
    "12 Strong",
    "A Ciambra",
    "All The Money In The World",
    "Along With The Gods: The Two Worlds",
    "Bilal: A New Breed Of Hero",
    "Call Me By Your Name",
    "Condorito: La Película",
    "Darkest Hour",
    "Den Of Thieves",
    "Downsizing",
    "Father Figures",
    "Film Stars Don'T Die In Liverpool",
    "Forever My Girl",
    "Happy End",
    "Hostiles",
    "I, Tonya",
    "In The Fade (Aus Dem Nichts)",
    "Insidious: The Last Key",
    "Jumanji: Welcome To The Jungle",
    "Mary And The Witch'S Flower",
    "Maze Runner: The Death Cure",
    "Molly'S Game",
    "Paddington 2",
    "Padmaavat",
    "Phantom Thread",
    "Pitch Perfect 3",
    "Proud Mary",
    "Star Wars: Episode Viii - The Last Jedi",
    "Star Wars: The Last Jedi",
    "The Cage Fighter",
    "The Commuter",
    "The Final Year",
    "The Greatest Showman",
    "The Insult (L'Insulte)",
    "The Post",
    "The Shape Of Water",
    "Una Mujer Fantástica",
    "Winchester"
]
print(movie_name)

data = []
for moviename in movie_name:
    movie = pd.read_json( ( dataset[moviename]).to_json(), orient='index')
    data.append(movie)
   
print(data)

【问题讨论】:

我想你的意思是熊猫。如果您已经将 json 数据集导入 pandas 数据框;这些函数可以根据文档在您的数据集上调用。 【参考方案1】:

您可以将字典中的项目拆分并单独阅读,一次将NaN填充为None。

如果你的json被称为数据,那么

df = pd.DataFrame(data[0].values()).fillna('None')
df['Movie Name'] = pd.DataFrame(data[0].keys())
df.set_index('Movie Name', inplace=True)

df.head()

                                         Genre       Gross IMDB Metascore Popcorn Score   Rating Tomato Score popcornscore rating tomatoscore
Movie Name
12 Strong                               Action  $1,465,000             54            72        R           54         None   None        None
A Ciambra                                Drama     unknown             70       unknown  unrated       unkown         None   None        None
All The Money In The World                None        None           None          None     None         None         72.0      R        76.0
Along With The Gods: The Two Worlds       None        None           None          None     None         None         90.0     NR        50.0
Bilal: A New Breed Of Hero           Animation     unknown             52       unknown  unrated       unkown         None   None        None

【讨论】:

【参考方案2】:

您在使用此数据集时面临的一个挑战是它对相同数据具有不同的键名,例如 'Tomato Score''tomatoscore' 。下面的解决方案不是最好的,它可以进行很多优化,但是,我这样说是为了让您更容易看到为使数据保持一致而实施的步骤:

import pandas as pd

with open('moviereviews.json', "r") as read_file:
    dataset = json.load(read_file)

data = []

for index in range(len(dataset)):
    for key in dataset[index]:
        movie_name = key
        
        if 'Genre' in dataset[index][key]:
            genre = dataset[index][key]['Genre']
        else:
            genre = None
            
        if 'Gross' in dataset[index][key]:
            gross = dataset[index][key]['Gross']
        else:
            gross = None
            
        if 'IMDB Metascore' in  dataset[index][key]:
            imdb = dataset[index][key]['IMDB Metascore']            
        else:
            imdb = None
            
        if 'Popcorn Score' in dataset[index][key]:
            popcorn = dataset[index][key]['Popcorn Score']            
        elif 'popcornscore' in  dataset[index][key]:
            popcorn = dataset[index][key]['popcornscore']
        else:
            popcorn = None                                              
                                                      
        if 'Rating' in dataset[index][key]:
            rating = dataset[index][key]['Rating']                                     
        elif 'rating' in dataset[index][key]:
            rating = dataset[index][key]['rating']
        else:
            rating = None
            
        if 'Tomato Score' in dataset[index][key]:                                         
            tomato = dataset[index][key]['Tomato Score']                                       
        elif 'tomatoscore' in dataset[index][key]:
            tomato = dataset[index][key]['tomatoscore']                                              
        else:
            tomato = None
                
        data.append('Movie Name': movie_name,
                     'Genre': genre,
                     'Gross': gross,
                     'IMDB Metascore': imdb,
                     'Popcorn Score': popcorn,
                     'Rating': rating,
                     'Tomato Score': tomato)
    
df = pd.DataFrame(data)

df
        

【讨论】:

以上是关于使用 python (Jupyter notebook) 对 json 数据进行数据预处理的主要内容,如果未能解决你的问题,请参考以下文章

Linux Ubuntu下Jupyter Notebook的安装

pythonJupyter的使用(python代码编辑器)

Python的IDE之Jupyter的使用

在 jupyter 中选择 python 内核

python / pyspark 版本的 Jupyter 问题

解决:使用jupyter创建python时错误