将数据框列转换为多行，重复其他列的值

Posted 2023-02-25

技术标签:

【中文标题】将数据框列转换为多行，重复其他列的值【英文标题】：Converting dataframe column into multiple rows ,repeating values of other columns 【发布时间】：2018-09-19 04:15:28 【问题描述】：

这是我的 CSV：

languages,    origin,     other_test1,       other_test2
"['name': 'French', 'vowel_count': 3, 'name': 'Dutch', 'vowel_count': 4, 'name': 'English', 'vowel_count': 5]",Germanic,ABC,DEF

我想将 CSV 的语言列转换为以下输出：

Language_name ,Language_vowel_count, origin,    other.test1, other.test2
French,        3,                    Germanic,  ABC,         DEF
Dutch,         4,                    Germanic,  ABC,         DEF
English,       5,                    Germanic,  ABC,         DEF

我尝试过的代码：

 from itertools import chain

 a = df['languages'].str.findall("'(.*?)'").astype(np.object)
 lens = a.str.len()

  df = pd.DataFrame(
'origin' : df['origin'].repeat(lens),
'other_test1' : df['other_test1'].repeat(lens),
'other_test2' : df['other_test2'].repeat(lens),
'name' : list(chain.from_iterable(a.tolist())),
'vowel_count' : list(chain.from_iterable(a.tolist())),
)

df

但它没有给我预期的输出。

【问题讨论】：

【参考方案1】：

您可以使用嵌套列表推导来解包数据，并使用 ast.literal_eval 将 JSON 字符串转换为 python 字典。

import ast

>>> pd.DataFrame(
    [[languages.get('name'), languages.get('vowel_count'), row['origin'], row['other_test1'], row['other_test2']]
     for idx, row in df.iterrows() 
     for languages in ast.literal_eval(row['languages'])],
    columns=['Language_name', 'Language_vowel_count', 'origin', 'other.test1', 'other.test2'])
  Language_name  Language_vowel_count    origin other.test1 other.test2
0        French                     3  Germanic         ABC         DEF
1         Dutch                     4  Germanic         ABC         DEF
2       English                     5  Germanic         ABC         DEF

另一种不使用iterrows 的方法将解压缩的语言与基础数据连接起来：

languages = df['languages'].apply(lambda x: ast.literal_eval(x))

df_lang = pd.DataFrame(
    [(lang.get('name'), lang.get('vowel_count')) 
     for language in languages 
     for lang in language])

df_new = pd.concat([
    df_lang, 
    df.iloc[:, 1:].reindex(df.index.repeat([len(x) for x in languages])).reset_index(drop=True)], axis=1)

df_new.columns = ['Language_name', 'Language_vowel_count', 'origin', 'other.test1', 'other.test2']

【讨论】：

【参考方案2】：

import re
import pandas as pd
import json
csv = """"['name': 'French', 'vowel_count': 3, 'name': 'Dutch', 'vowel_count': 4, 'name': 'English', 'vowel_count': 5]",Germanic,ABC,DEF"""
csv = re.split('(?![^)(]*\([^)(]*?\)\)),(?![^\[]*\])',csv)
df = pd.DataFrame(json.loads(csv[0].replace("'",'"')[1:-1]))
df['Origin']=csv[1]
df['other.test1']=csv[2]
df['other.test2']=csv[3]
df

【讨论】：

以上是关于将数据框列转换为多行，重复其他列的值的主要内容，如果未能解决你的问题，请参考以下文章