循环遍历层后附加熊猫数据帧

Posted 2023-03-11

技术标签:

【中文标题】循环遍历层后附加熊猫数据帧【英文标题】：Appending pandas dataframes after looping through layers 【发布时间】：2021-12-22 10:35:41 【问题描述】：

在格式化 API 结果以导入 PostgreSQL 数据库的最后一步，我需要一些帮助。数据结构为：

[  "season": 0,
    "seasonType": "string",
    "week": 0,
    "polls": [
      
        "poll": "string",
        "ranks": [
          
            "rank": 0,
            "school": "string",
            "conference": "string",
            "firstPlaceVotes": 0,
            "points": 0 ]]]

这是我用来解压它的代码（当然，如果有更好、更有效的方法来做这件事，请大家注意）：

year = list(range(2020,2021))
req = []
pbp = pd.DataFrame()
headers = 
    "Accept": "application/json",
    "Authorization": "Bearer 1a2b3c4d"
for year in tqdm(year, desc = 'fetch record'):
    parameters = "year":year, "seasonType":"regular"
    req = requests.get("https://api.collegefootballdata.com/rankings", headers=headers, params = parameters)
    r = req.json()
    print(type(r))
    df1 = pd.DataFrame(r, columns = ['season', 'seasonType', 'week'], dtype = int)
    pbp = pbp.append(json.loads(req.text))
    for polls in pbp["polls"]:
        try:
            p1 = polls[1]
        except IndexError:
            continue
        df2 = pd.DataFrame.from_dict(p1)
        poll = df2["poll"]
        y = df1.append(poll)
        for rank in df2["ranks"]:
            df3 = pd.DataFrame(rank, index=[0])
            z = y.append(df3)

当我追加时，数据是这样出来的：

year	season	week	poll	rank	team
2020	regular	1
2020	regular	2
			AP
				1	Alabama
2020	regular	1
2020	regular	2
			AP
				2	Clemson

而且，我希望它看起来像这样：

year	season	week	poll	rank	team
2020	regular	1	AP	1	Alabama
2020	regular	1	AP	2	Clemson
2020	regular	2	AP	1	Alabama
2020	regular	2	AP	2	Clemson

【问题讨论】：

可能首先使用print() 来查看变量中的内容以及执行了哪一行代码。看来您必须以不同的方式编写代码，首先获取行中的所有值，然后附加到数据帧。但是你首先只附加year season week，然后只附加poll，然后你只附加rank - 这就是你的问题。 【参考方案1】：

问题是你使用了太多append()。

您应该首先创建包含行中所有值的列表/字典，最后附加该行。

import pandas as pd
from tqdm import tqdm
import requests

year = range(2020, 2021)
df = pd.DataFrame()

headers = 
    "Accept": "application/json",
    "Authorization": "Bearer XXXXX"


for year in tqdm(year, desc='fetch record'):
    parameters = 
        "year": year,
        "seasonType": "regular"
    

    url = "https://api.collegefootballdata.com/rankings"
    response = requests.get(url, params=parameters, headers=headers)

    data = response.json()
    
    #print(data[0])
    
    for item in data:
        row = 
            'year':   item['season'],
            'season': item['seasonType'],
            'week':   item['week'],
        
    
        for poll in item["polls"]:
            row['poll'] = poll["poll"]
            for rank in poll["ranks"]:
                row['rank'] = rank["rank"]
                row['team'] = rank["school"]
                #print(row)
                df = df.append(row, ignore_index=True)
                
print(df)

结果：

       year   season  week                        poll  rank            team
0    2020.0  regular   1.0                   AP Top 25   1.0         Clemson
1    2020.0  regular   1.0                   AP Top 25   2.0      Ohio State
2    2020.0  regular   1.0                   AP Top 25   3.0         Alabama
3    2020.0  regular   1.0                   AP Top 25   4.0         Georgia
4    2020.0  regular   1.0                   AP Top 25   5.0        Oklahoma
..      ...      ...   ...                         ...   ...             ...
845  2020.0  regular  16.0  Playoff Committee Rankings  21.0  Oklahoma State
846  2020.0  regular  16.0  Playoff Committee Rankings  22.0        NC State
847  2020.0  regular  16.0  Playoff Committee Rankings  23.0           Tulsa
848  2020.0  regular  16.0  Playoff Committee Rankings  24.0  San José State
849  2020.0  regular  16.0  Playoff Committee Rankings  25.0        Colorado

[850 rows x 9 columns]

编辑

使用.read_json()、.explode().apply(pd.Series)等特殊函数也是如此


# ... code ...

response = requests.get(url, params=parameters, headers=headers)

df = pd.read_json(response.text)

df = df.explode(['polls'])
df['poll'] = df['polls'].str['poll']
df['ranks'] = df['polls'].str['ranks']
df = df.explode(['ranks'])
df = df.reset_index(drop=True)
df = df.join(df['ranks'].apply(pd.Series))
df.drop(columns=['polls', 'ranks'], inplace=True)

print(df)

结果：

     season seasonType  week  ...         conference  firstPlaceVotes  points
0      2020    regular     1  ...                ACC             38.0  1520.0
1      2020    regular     1  ...            Big Ten             21.0  1504.0
2      2020    regular     1  ...                SEC              2.0  1422.0
3      2020    regular     1  ...                SEC              0.0  1270.0
4      2020    regular     1  ...             Big 12              0.0  1269.0
..      ...        ...   ...  ...                ...              ...     ...
845    2020    regular    16  ...             Big 12              NaN     NaN
846    2020    regular    16  ...                ACC              NaN     NaN
847    2020    regular    16  ...  American Athletic              NaN     NaN
848    2020    regular    16  ...      Mountain West              NaN     NaN
849    2020    regular    16  ...             Pac-12              NaN     NaN

[850 rows x 9 columns]

【讨论】：

这非常有用，而且效果很好。我在玩这个数据集的同时也在做类似的工作，这个解决方案也让我克服了工作中的困难！谢谢！

以上是关于循环遍历层后附加熊猫数据帧的主要内容，如果未能解决你的问题，请参考以下文章

如何在循环中附加多个熊猫数据框？

熊猫数据框未附加

如何将每一行熊猫数据帧附加到另一个数据帧的每一行

在循环内附加数据框

while循环加入熊猫数据帧

如何将字典附加到熊猫数据框？