循环遍历层后附加熊猫数据帧
Posted
技术标签:
【中文标题】循环遍历层后附加熊猫数据帧【英文标题】:Appending pandas dataframes after looping through layers 【发布时间】:2021-12-22 10:35:41 【问题描述】:在格式化 API 结果以导入 PostgreSQL 数据库的最后一步,我需要一些帮助。数据结构为:
[ "season": 0,
"seasonType": "string",
"week": 0,
"polls": [
"poll": "string",
"ranks": [
"rank": 0,
"school": "string",
"conference": "string",
"firstPlaceVotes": 0,
"points": 0 ]]]
这是我用来解压它的代码(当然,如果有更好、更有效的方法来做这件事,请大家注意):
year = list(range(2020,2021))
req = []
pbp = pd.DataFrame()
headers =
"Accept": "application/json",
"Authorization": "Bearer 1a2b3c4d"
for year in tqdm(year, desc = 'fetch record'):
parameters = "year":year, "seasonType":"regular"
req = requests.get("https://api.collegefootballdata.com/rankings", headers=headers, params = parameters)
r = req.json()
print(type(r))
df1 = pd.DataFrame(r, columns = ['season', 'seasonType', 'week'], dtype = int)
pbp = pbp.append(json.loads(req.text))
for polls in pbp["polls"]:
try:
p1 = polls[1]
except IndexError:
continue
df2 = pd.DataFrame.from_dict(p1)
poll = df2["poll"]
y = df1.append(poll)
for rank in df2["ranks"]:
df3 = pd.DataFrame(rank, index=[0])
z = y.append(df3)
当我追加时,数据是这样出来的:
year | season | week | poll | rank | team |
---|---|---|---|---|---|
2020 | regular | 1 | |||
2020 | regular | 2 | |||
AP | |||||
1 | Alabama | ||||
2020 | regular | 1 | |||
2020 | regular | 2 | |||
AP | |||||
2 | Clemson |
而且,我希望它看起来像这样:
year | season | week | poll | rank | team |
---|---|---|---|---|---|
2020 | regular | 1 | AP | 1 | Alabama |
2020 | regular | 1 | AP | 2 | Clemson |
2020 | regular | 2 | AP | 1 | Alabama |
2020 | regular | 2 | AP | 2 | Clemson |
【问题讨论】:
可能首先使用print()
来查看变量中的内容以及执行了哪一行代码。看来您必须以不同的方式编写代码,首先获取行中的所有值,然后附加到数据帧。但是你首先只附加year season week
,然后只附加poll
,然后你只附加rank
- 这就是你的问题。
【参考方案1】:
问题是你使用了太多append()
。
您应该首先创建包含行中所有值的列表/字典,最后附加该行。
import pandas as pd
from tqdm import tqdm
import requests
year = range(2020, 2021)
df = pd.DataFrame()
headers =
"Accept": "application/json",
"Authorization": "Bearer XXXXX"
for year in tqdm(year, desc='fetch record'):
parameters =
"year": year,
"seasonType": "regular"
url = "https://api.collegefootballdata.com/rankings"
response = requests.get(url, params=parameters, headers=headers)
data = response.json()
#print(data[0])
for item in data:
row =
'year': item['season'],
'season': item['seasonType'],
'week': item['week'],
for poll in item["polls"]:
row['poll'] = poll["poll"]
for rank in poll["ranks"]:
row['rank'] = rank["rank"]
row['team'] = rank["school"]
#print(row)
df = df.append(row, ignore_index=True)
print(df)
结果:
year season week poll rank team
0 2020.0 regular 1.0 AP Top 25 1.0 Clemson
1 2020.0 regular 1.0 AP Top 25 2.0 Ohio State
2 2020.0 regular 1.0 AP Top 25 3.0 Alabama
3 2020.0 regular 1.0 AP Top 25 4.0 Georgia
4 2020.0 regular 1.0 AP Top 25 5.0 Oklahoma
.. ... ... ... ... ... ...
845 2020.0 regular 16.0 Playoff Committee Rankings 21.0 Oklahoma State
846 2020.0 regular 16.0 Playoff Committee Rankings 22.0 NC State
847 2020.0 regular 16.0 Playoff Committee Rankings 23.0 Tulsa
848 2020.0 regular 16.0 Playoff Committee Rankings 24.0 San José State
849 2020.0 regular 16.0 Playoff Committee Rankings 25.0 Colorado
[850 rows x 9 columns]
编辑
使用.read_json()
、.explode()
.apply(pd.Series)
等特殊函数也是如此
# ... code ...
response = requests.get(url, params=parameters, headers=headers)
df = pd.read_json(response.text)
df = df.explode(['polls'])
df['poll'] = df['polls'].str['poll']
df['ranks'] = df['polls'].str['ranks']
df = df.explode(['ranks'])
df = df.reset_index(drop=True)
df = df.join(df['ranks'].apply(pd.Series))
df.drop(columns=['polls', 'ranks'], inplace=True)
print(df)
结果:
season seasonType week ... conference firstPlaceVotes points
0 2020 regular 1 ... ACC 38.0 1520.0
1 2020 regular 1 ... Big Ten 21.0 1504.0
2 2020 regular 1 ... SEC 2.0 1422.0
3 2020 regular 1 ... SEC 0.0 1270.0
4 2020 regular 1 ... Big 12 0.0 1269.0
.. ... ... ... ... ... ... ...
845 2020 regular 16 ... Big 12 NaN NaN
846 2020 regular 16 ... ACC NaN NaN
847 2020 regular 16 ... American Athletic NaN NaN
848 2020 regular 16 ... Mountain West NaN NaN
849 2020 regular 16 ... Pac-12 NaN NaN
[850 rows x 9 columns]
【讨论】:
这非常有用,而且效果很好。我在玩这个数据集的同时也在做类似的工作,这个解决方案也让我克服了工作中的困难!谢谢!以上是关于循环遍历层后附加熊猫数据帧的主要内容,如果未能解决你的问题,请参考以下文章