协助将数据框拆分为新列
Posted
技术标签:
【中文标题】协助将数据框拆分为新列【英文标题】:Assistance with splitting data frame to new columns 【发布时间】:2020-12-28 10:10:43 【问题描述】:我在按 _ 拆分数据框并从中创建新列时遇到问题。
原股
AMAT_0000006951_10Q_20200726_Item1A_excerpt.txt as section
我当前的代码
df = pd.DataFrame(myList,columns=['section','text'])
#df['text'] = df['text'].str.replace('•','')
df['section'] = df['section'].str.replace('Item1A', 'Filing Section: Risk Factors')
df['section'] = df['section'].str.replace('Item2_', 'Filing Section: Management Discussion and Analysis')
df['section'] = df['section'].str.replace('excerpt.txt', '').str.replace(r'\d10_|\d8_', '')
df.to_csv("./SECParse.csv", encoding='utf-8-sig', sep=',',index=False)
输出:
section text
AMAT_10Q_Filing Section: Risk Factors_ The COVID-19 pandemic and global measures taken in response
thereto have adversely impacted, and may continue to adversely
impact, Applied’s operations and financial results.
AMAT_10Q_Filing Section: Risk Factors_ The COVID-19 pandemic and measures taken in response by
governments and businesses worldwide to contain its spread,
AMAT_10Q_Filing Section: Risk Factors_ The degree to which the pandemic ultimately impacts Applied’s
financial condition and results of operations and the global
economy will depend on future developments beyond our control
我真的很想拆分“部分”,将其放入基于“_”的新列中 我尝试了许多不同的正则表达式变体来拆分“部分”,所有这些变体要么给了我没有填充的标题,要么在部分和文本之后添加了列,这没有用。我还应该添加大约 100,000 个观察值。
想要的结果:
Ticker Filing type Section Text
AMAT 10Q Filing Section: Risk Factors The COVID-19 pandemic and global measures taken in response
任何指导将不胜感激。
【问题讨论】:
【参考方案1】:如果你总是知道拆分的数量,你可以这样做:
import pandas as pd
df = pd.DataFrame( "a": [ "test_a_b", "test2_c_d" ] )
# Split column by "_"
items = df["a"].str.split("_")
# Get last item from splitted column and place it on "b"
df["b"] = items.apply(list.pop)
# Get next last item from splitted column and place it on "c"
df["c"] = items.apply(list.pop)
# Get final item from splitted column and place it on "d"
df["d"] = items.apply(list.pop)
这样,dataframe就会变成
a b c d
0 test_a_b b a test
1 test2_c_d d c test2
由于您希望列按特定顺序排列,因此您可以重新排列数据框的列,如下所示:
>>> df = df[[ "d", "c", "b", "a" ]]
>>> df
d c b a
0 test a b test_a_b
1 test2 c d test2_c_d
【讨论】:
嘿马可,当我尝试这个时,添加的列出现在列部分和文本的后面 您可以在插入列后重新排序。我将编辑我的答案以包括此类重新排序。 希望我能对您的回答投赞成票 - 现在一切正常 - 不胜感激! @xSuperAnnuated 我很高兴有帮助!以上是关于协助将数据框拆分为新列的主要内容,如果未能解决你的问题,请参考以下文章