从 csv 中提取列中的数据,保存为字典(Python、Pandas)

Posted

技术标签:

【中文标题】从 csv 中提取列中的数据,保存为字典(Python、Pandas)【英文标题】:Extract data in a column from a csv, saved as a dictionary (Python, Pandas) 【发布时间】:2022-01-21 01:07:31 【问题描述】:

我正在学习人工智能和机器学习,但我发现了一个困难。我的 CSV 数据集有两个重要的列,它们本身就是字典,例如其中之一是在每一行中显示信息的类别,例如"id":252,"name":"Graphic Novels"...,我想分解这些数据,以便它显示在各个列中,例如cat_id, cat_name...,以便稍后应用过滤器。

我猜 Python 和 Pandas 中有一些选项,但我现在看不到。非常感谢您的指导。

编辑:我把 Excel 中的前十行,复制到一个新文档中,然后在记事本中打开新的 csv 文档,将记事本中的前十行复制并粘贴到这里,文档可以在我的gdrive 中找到:

backers_count,blurb,category,converted_pledged_amount,country,country_displayable_name,created_at,creator,currency,currency_symbol,currency_trailing_code,current_currency,deadline,disable_communication,friends,fx_rate,goal,id,is_backing,is_starrable,is_starred,launched_at,location,name,permissions,photo,pledged,profile,slug,source_url,spotlight,staff_pick,state,state_changed_at,static_usd_rate,urls,usd_exchange_rate,usd_pledged,usd_type
"260,""A new full-color graphic novel featuring attorneys Alanna Wolff and Jeff Byrd. This time a Grandfather ghost needs their help!"",""""""id"""":252,""""name"""":""""Graphic Novels"""",""""analytics_name"""":""""Graphic Novels"""",""""slug"""":""""comics/graphic novels"""",""""position"""":4,""""parent_id"""":3,""""parent_name"""":""""Comics"""",""""color"""":16776056,""""urls"""":""""web"""":""""discover"""":""""http://www.kickstarter.com/discover/categories/comics/graphic%20novels"""""",""14080"",""US"",""the United States"",""1519427735"",""""""id"""":1379875462,""""name"""":""""Batton Lash"""",""""is_registered"""":null,""""is_email_verified"""":null,""""chosen_currency"""":null,""""is_superbacker"""":null,""""avatar"""":""""thumb"""":""""https://ksr-ugc.imgix.net/assets/006/347/706/b3908a1a23f6b9e472edcf7c934e5b0e_original.jpg?ixlib=rb-4.0.2&w=40&h=40&fit=crop&v=1461382354&auto=format&frame=1&q=92&s=4d88bd2ed1e7098fcaf046321cc4be15"""",""""small"""":""""https://ksr-ugc.imgix.net/assets/006/347/706/b3908a1a23f6b9e472edcf7c934e5b0e_original.jpg?ixlib=rb-4.0.2&w=80&h=80&fit=crop&v=1461382354&auto=format&frame=1&q=92&s=664f586cef17d83dc408a6a10b0f3c4a"""",""""medium"""":""""https://ksr-ugc.imgix.net/assets/006/347/706/b3908a1a23f6b9e472edcf7c934e5b0e_original.jpg?ixlib=rb-4.0.2&w=160&h=160&fit=crop&v=1461382354&auto=format&frame=1&q=92&s=fe307263e32a2385e764e3923a13179e"""",""""urls"""":""""web"""":""""user"""":""""https://www.kickstarter.com/profile/1379875462"""",""""api"""":""""user"""":""""https://api.kickstarter.com/v1/users/1379875462?signature=1631849432.d50b79030e15111575554ecae171babad1f2925d"""""",""USD"",""$"",""true"",""USD"",""1522083186"",""false"",,""1"",""11000"",""640715060"",,""false"",,""1519494786"",""""""id"""":2487889,""""name"""":""""San Diego"""",""""slug"""":""""san-diego-ca"""",""""short_name"""":""""San Diego, CA"""",""""displayable_name"""":""""San Diego, CA"""",""""localized_name"""":""""San Diego"""",""""country"""":""""US"""",""""state"""":""""CA"""",""""type"""":""""Town"""",""""is_root"""":false,""""expanded_country"""":""""United States"""",""""urls"""":""""web"""":""""discover"""":""""https://www.kickstarter.com/discover/places/san-diego-ca"""",""""location"""":""""https://www.kickstarter.com/locations/san-diego-ca"""",""""api"""":""""nearby_projects"""":""""https://api.kickstarter.com/v1/discover?signature=1631819277.2ad6e6fd997457447d2b61e2381f4332bfad5f21&woe_id=2487889"""""",""Supernatural Law: Grandfathered In"",,""""""key"""":""""assets/020/340/272/c66a24c64ead18a0d47fcc8a4367a127_original.jpg"""",""""full"""":""""https://ksr-ugc.imgix.net/assets/020/340/272/c66a24c64ead18a0d47fcc8a4367a127_original.jpg?ixlib=rb-4.0.2&crop=faces&w=560&h=315&fit=crop&v=1519428123&auto=format&frame=1&q=92&s=76bac41cb2f43b97e6b55fdc34b1f7ca"""",""""ed"""":""""https://ksr-ugc.imgix.net/assets/020/340/272/c66a24c64ead18a0d47fcc8a4367a127_original.jpg?ixlib=rb-4.0.2&crop=faces&w=352&h=198&fit=crop&v=1519428123&auto=format&frame=1&q=92&s=96a344a45a9ef2042f27097c1dc7d04a"""",""""med"""":""""https://ksr-ugc.imgix.net/assets/020/340/272/c66a24c64ead18a0d47fcc8a4367a127_original.jpg?ixlib=rb-4.0.2&crop=faces&w=272&h=153&fit=crop&v=1519428123&auto=format&frame=1&q=92&s=0b20f77c82158c3a35c8f140cbe07591"""",""""little"""":""""https://ksr-ugc.imgix.net/assets/020/340/272/c66a24c64ead18a0d47fcc8a4367a127_original.jpg?ixlib=rb-4.0.2&crop=faces&w=208&h=117&fit=crop&v=1519428123&auto=format&frame=1&q=92&s=9217d3ad72da53831b3e6e2851cf2544"""",""""small"""":""""https://ksr-ugc.imgix.net/assets/020/340/272/c66a24c64ead18a0d47fcc8a4367a127_original.jpg?ixlib=rb-4.0.2&crop=faces&w=160&h=90&fit=crop&v=1519428123&auto=format&frame=1&q=92&s=3f44497a673dab166c74a3e4954f2ca2"""",""""thumb"""":""""https://ksr-ugc.imgix.net/assets/020/340/272/c66a24c64ead18a0d47fcc8a4367a127_original.jpg?ixlib=rb-4.0.2&crop=faces&w=48&h=27&fit=crop&v=1519428123&auto=format&frame=1&q=92&s=07d7f99494b563195c40686b7131ef43"""",""""1024x576"""":""""https://ksr-ugc.imgix.net/assets/020/340/272/c66a24c64ead18a0d47fcc8a4367a127_original.jpg?ixlib=rb-4.0.2&crop=faces&w=1024&h=576&fit=crop&v=1519428123&auto=format&frame=1&q=92&s=ae396bd768773a2e57bb8ac0563f664f"""",""""1536x864"""":""""https://ksr-ugc.imgix.net/assets/020/340/272/c66a24c64ead18a0d47fcc8a4367a127_original.jpg?ixlib=rb-4.0.2&crop=faces&w=1552&h=873&fit=crop&v=1519428123&auto=format&frame=1&q=92&s=1736aa59d62bb684684213dc5dd70241"""""",""14080"",""""""id"""":3318379,""""project_id"""":3318379,""""state"""":""""inactive"""",""""state_changed_at"""":1519427735,""""name"""":null,""""blurb"""":null,""""background_color"""":null,""""text_color"""":null,""""link_background_color"""":null,""""link_text_color"""":null,""""link_text"""":null,""""link_url"""":null,""""show_feature_image"""":false,""""background_image_opacity"""":0.8,""""should_show_feature_image_section"""":true,""""feature_image_attributes"""":""""image_urls"""":""""default"""":""""https://ksr-ugc.imgix.net/assets/020/340/272/c66a24c64ead18a0d47fcc8a4367a127_original.jpg?ixlib=rb-4.0.2&crop=faces&w=1552&h=873&fit=crop&v=1519428123&auto=format&frame=1&q=92&s=1736aa59d62bb684684213dc5dd70241"""",""""baseball_card"""":""""https://ksr-ugc.imgix.net/assets/020/340/272/c66a24c64ead18a0d47fcc8a4367a127_original.jpg?ixlib=rb-4.0.2&crop=faces&w=560&h=315&fit=crop&v=1519428123&auto=format&frame=1&q=92&s=76bac41cb2f43b97e6b55fdc34b1f7ca"""""",""supernatural-law-grandfathered-in"",""https://www.kickstarter.com/discover/categories/comics/graphic%20novels"",""true"",""false"",""successful"",""1522083188"",""1"",""""""web"""":""""project"""":""""https://www.kickstarter.com/projects/1379875462/supernatural-law-grandfathered-in?ref=discovery_category_newest"""",""""rewards"""":""""https://www.kickstarter.com/projects/1379875462/supernatural-law-grandfathered-in/rewards"""""",""1"",""14080.0"",""international"""
"13,""My name is Scott Easley Jr or just Skoddii. Im an illustrator and writer for my graphic novel/comic series, ZOMBIE WORKSHOP"",""""""id"""":252,""""name"""":""""Graphic Novels"""",""""analytics_name"""":""""Graphic Novels"""",""""slug"""":""""comics/graphic novels"""",""""position"""":4,""""parent_id"""":3,""""parent_name"""":""""Comics"""",""""color"""":16776056,""""urls"""":""""web"""":""""discover"""":""""http://www.kickstarter.com/discover/categories/comics/graphic%20novels"""""",""306"",""US"",""the United States"",""1519337502"",""""""id"""":408247096,""""name"""":""""Scott(skoddii)"""",""""is_registered"""":null,""""is_email_verified"""":null,""""chosen_currency"""":null,""""is_superbacker"""":null,""""avatar"""":""""thumb"""":""""https://ksr-ugc.imgix.net/assets/020/330/517/383423c1c19dfbd99534c6185eb09a6f_original.png?ixlib=rb-4.0.2&w=40&h=40&fit=crop&v=1519354368&auto=format&frame=1&q=92&s=74f83e0070b20db01d5180ba214d1b5e"""",""""small"""":""""https://ksr-ugc.imgix.net/assets/020/330/517/383423c1c19dfbd99534c6185eb09a6f_original.png?ixlib=rb-4.0.2&w=80&h=80&fit=crop&v=1519354368&auto=format&frame=1&q=92&s=671b9100176dbfa63752a7a8e9cc63d0"""",""""medium"""":""""https://ksr-ugc.imgix.net/assets/020/330/517/383423c1c19dfbd99534c6185eb09a6f_original.png?ixlib=rb-4.0.2&w=160&h=160&fit=crop&v=1519354368&auto=format&frame=1&q=92&s=956c6f85ffbc3fb179c260611254a2be"""",""""urls"""":""""web"""":""""user"""":""""https://www.kickstarter.com/profile/408247096"""",""""api"""":""""user"""":""""https://api.kickstarter.com/v1/users/408247096?signature=1631849432.6cc0456d4795aea0b32f861b050212afef4387ce"""""",""USD"",""$"",""true"",""USD"",""1521946605"",""false"",,""1"",""200"",""1775748628"",,""false"",,""1519358205"",""""""id"""":2383552,""""name"""":""""Columbia"""",""""slug"""":""""columbia-sc"""",""""short_name"""":""""Columbia, SC"""",""""displayable_name"""":""""Columbia, SC"""",""""localized_name"""":""""Columbia"""",""""country"""":""""US"""",""""state"""":""""SC"""",""""type"""":""""Town"""",""""is_root"""":false,""""expanded_country"""":""""United States"""",""""urls"""":""""web"""":""""discover"""":""""https://www.kickstarter.com/discover/places/columbia-sc"""",""""location"""":""""https://www.kickstarter.com/locations/columbia-sc"""",""""api"""":""""nearby_projects"""":""""https://api.kickstarter.com/v1/discover?signature=1631819963.63e726eafa55ada9bb1408342a0002cf17e0998e&woe_id=2383552"""""",""zombie workshop graphic novel"",,""""""key"""":""""assets/020/330/390/59e26fe754103c70cc452239f7d1ce0c_original.png"""",""""full"""":""""https://ksr-ugc.imgix.net/assets/020/330/390/59e26fe754103c70cc452239f7d1ce0c_original.png?ixlib=rb-4.0.2&crop=faces&w=560&h=315&fit=crop&v=1519353020&auto=format&frame=1&q=92&s=b1d2849eb8ef16243a5193fd1b1ee640"""",""""ed"""":""""https://ksr-ugc.imgix.net/assets/020/330/390/59e26fe754103c70cc452239f7d1ce0c_original.png?ixlib=rb-4.0.2&crop=faces&w=352&h=198&fit=crop&v=1519353020&auto=format&frame=1&q=92&s=e4f6989e77fa03f59bda07d68d35cf54"""",""""med"""":"""

【问题讨论】:

这能回答你的问题吗? Split / Explode a column of dictionaries into separate columns with pandas 我尝试使用系列,但我不能prnt.sc/23kz4wp 【参考方案1】:

考虑使用DataFrame() 构造函数和ast.literal_evalcategory 列的每一行上构建一个相邻的数据框。连接所有生成的数据帧并将其连接回原始数据帧。

import ast
import pandas as pd

kickstarter_df = pd.read_csv("Kickstarter004.csv")

category_df = (
    pd.concat(
        [pd.DataFrame(ast.literal_eval(d))               
         for d in kickstarter_df['category']]
    )
    .add_prefix("cat_")
    .reset_index(drop=True)
)

final_df = (
    kickstarter_df
      .drop(["category"], axis="columns")
      .join(category_df)
)

要扩展所有 dict 列,请添加定义的方法。由于数据的性质,需要进行相当多的例外和替换。

def str_to_dict(d, col):
    if d != d:
        clean_d = 'None'
    elif col == "category" or type(d) in [int, float]:
        clean_d = d
    else:
        clean_d = (
         d.replace('null', '""')
          .replace('"name":"', '"name":"""')
          .replace('","is_registered":"', '""","is_registered":"')
          .replace('","slug":"', ' ""","slug":"""')
          .replace('","short_name":"', ' ""","short_name":"')
          .replace('","blurb":"', ' ""","blurb":"""')
          .replace('","background_color":"', ' ""","background_color":"')
          .replace("true", "True")
          .replace("false", "False")
          .replace("nan", "None")
       )
    
    dict_d = ast.literal_eval(clean_d)
    
    return dict_d

def dict_to_frame(col):
    print(col)
    
    df = (
        pd.concat(
            [(pd.DataFrame(str_to_dict(d, col), index=[0]) 
              if col == "photo"
              else pd.DataFrame(str_to_dict(d, col))
             ) for d in kickstarter_df[col]]
        )
        .add_prefix(f"col_")
        .reset_index(drop=True)
    )
    
    return df

然后在加入最终数据框之前遇到水平合并的列表理解:

dict_cols = ["category", "creator", "location", "photo", "profile", "urls"]

dict_df = (
    pd.concat(
        [dict_to_frame(col) for col in dict_cols],
        axis = "rows"
    ).reset_index(drop=True)
)

final_df = (
    kickstarter_df
      .drop(dict_cols, axis="columns")
      .join(dict_df)
)

【讨论】:

太棒了!太感谢了!它工作得很好,但我试图对另一个专栏做同样的事情,但它没有工作,它给了我一个错误,但我只是将“类别”更改为“创作者”,你能告诉我出了什么问题吗,谢谢再次。 prnt.sc/23lr1dl 好的,我发现我必须去掉数据集中的每一个“空”值,谢谢 我更新了解决方案以遍历所有字典列。需要进行相当多的清理工作! 哇,非常感谢您抽出宝贵的时间。你有一些严肃的数据工程忍者技能。我试图了解你在那里所做的一切。【参考方案2】:

您可以使用字典值单独解析列。例如:

import pandas as pd 

data = [['one', "id":1,"name":"Graphic Novels"], 
        ['two', "id":2,"name":"Algorithms"], 
        ['three', "id":3,"name":"C++"]]

df = pd.DataFrame(data, columns = ['column_1', 'column_2'])

new_columns = ['column_1', 'id', 'name']
new_data_frame = pd.DataFrame(columns = new_columns)

for i in range(0, len(df)):

    column_1_value = df['column_1'].iloc[i]
    dict_column = list(df['column_2'].iloc[i].values())
    print(dict_column)
    column_2_value = dict_column[0]
    column_3_value = dict_column[1]

    new_row = [column_1_value, column_2_value, column_3_value]
    new_data_frame.loc[len(new_data_frame)] = new_row

print(new_data_frame.head())

新的数据框如下:

  column_1 id            name
0      one  1  Graphic Novels
1      two  2      Algorithms
2    three  3             C++

【讨论】:

【参考方案3】:

你好试试这个。

    # Assuming your dataset is called df.
    # Create another df called subset with only 2 columns
    subset = df[['country', 'continent']]
    subset.head()
    type(subset)

【讨论】:

以上是关于从 csv 中提取列中的数据,保存为字典(Python、Pandas)的主要内容,如果未能解决你的问题,请参考以下文章

BASH - 如何从 CSV 文件中的列中提取数据并将其放入数组中?

如何从 Python Pandas Dataframe 中的 STRING 列中提取嵌套字典?

按名称从字典中提取数据框

PYTHON将列表存储为csv文件以及从csv中提取数据2

从 PCAP 文件中提取时间并保存为 CSV 文件

从 oracle-sql 中的时间戳日期列中提取日期