无法合并2个数据帧

Posted

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了无法合并2个数据帧相关的知识,希望对你有一定的参考价值。

我终于接近完成了这个脚本,但我有2个小问题,我认为应该很容易清理。主要的一点是,包含合并数据的CSV全部显示,但数据帧不完美排列。另一个是玩家的节目['5452']当我更喜欢5452时。如果有人能帮助我,我会非常感激。

import requests
from random import choice
from bs4 import BeautifulSoup
import pandas as pd
from urllib.parse import urlparse, parse_qs
from functools import reduce

desktop_agents = ['Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (Khtml, like Gecko) Chrome/54.0.2840.99 Safari/537.36',
                 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36',
                 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36',
                 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_1) AppleWebKit/602.2.14 (KHTML, like Gecko) Version/10.0.1 Safari/602.2.14',
                 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.71 Safari/537.36',
                 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.98 Safari/537.36',
                 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.98 Safari/537.36',
                 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.71 Safari/537.36',
                 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36',
                 'Mozilla/5.0 (Windows NT 10.0; WOW64; rv:50.0) Gecko/20100101 Firefox/50.0']

url = "https://www.fangraphs.com/leaders.aspx?pos=np&stats=bat&lg=all&qual=0&type=c,4,6,5,23,9,10,11,13,12,21,22,60,18,35,34,50,40,206,207,208,44,43,46,45,24,26,25,47,41,28,110,191,192,193,194,195,196,197,200&season=2018&month=0&season1=2018&ind=0&team=0&rost=0&age=0&filter=&players=0&page=1_100000"

def random_headers():
    return {'User-Agent': choice(desktop_agents),'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8'}

df3 = pd.DataFrame()
# get the url

page_request = requests.get(url,headers=random_headers())
soup = BeautifulSoup(page_request.text,"lxml")

table = soup.find_all('table')[11]
data = []
# pulls headings from the fangraphs table
column_headers = []
headingrows = table.find_all('th')
for row in headingrows[0:]:
    column_headers.append(row.text.strip())

data.append(column_headers)
table_body = table.find('tbody')
rows = table_body.find_all('tr')

for row in rows:
    cols = row.find_all('td')
    cols = [ele.text.strip() for ele in cols]
    data.append([ele for ele in cols[1:]])

ID = []

for tag in soup.select('a[href^=statss.aspx?playerid=]'):
    link = tag['href']
    query = parse_qs(link)
    ID.append(query)

df1 = pd.DataFrame(data)
df1 = df1.rename(columns=df1.iloc[0])
df1 = df1.reindex(df1.index.drop(0))
df2 = pd.DataFrame(ID)

df3 = pd.concat([df1, df2], axis=1)

df3.to_csv("1.csv")
答案

请考虑以下问题来解决您的两个问题:

  1. MISMATCH INDEX ISSUE:当您从df1中删除第一行时,索引从1运行到380.同时,df2的索引从0到379运行。由于pd.concat(..., axis=1)按索引对齐,您将在记录中呈现不匹配。 要解决此问题,请使用.loc过滤掉该行,然后运行.reset_index()以将0渲染为379.具体来说,替换: df1 = df1.reindex(df1.index.drop(0)) df1 = df1.loc[1:].reset_index(drop=True)
  2. 嵌入式列表问题:假设您正在使用urlparse.parse_qs(),其输出将呈现列表值的字典。具体来说,query = parse_qs(link)呈现{'playerid' : ['5452']}。 df2赋值的长期绘制版本如下所示,其中包含传递给DataFrame调用的字典列表: df2 = pd.DataFrame([{'playerid' : ['5452']}, {'playerid' : ['1111']}, {'playerid' : ['9999']}]) 要解决此问题,请重建您的词典列表,以使用嵌套列表/词典理解来获取列表值的第一项(即索引[0]): new_ID = [{'k':v[0]} for i in ID for k,v in i.items()] df2 = pd.DataFrame(new_ID) print(df2) # playerid # 0 5452 # 1 1111 # 2 9999

以上是关于无法合并2个数据帧的主要内容,如果未能解决你的问题,请参考以下文章

如何将2个时间序列数据(宽表)合并为1个数据帧(宽表)?

pandas concat 2个数据框,并在合并数据中添加一列新数据。

Python pandas:合并两个没有键的表(将 2 个数据帧与广播所有元素相乘;NxN 数据帧)

如何基于多个条件更快地合并 2 个 pandas 数据帧

按复杂标准合并/加入 2 个数据帧

合并 2 个数据帧而不更改关联值