如何使用 pandas 将多个 Xpath 转换为数据框?

Posted

技术标签:

【中文标题】如何使用 pandas 将多个 Xpath 转换为数据框?【英文标题】:How can I convert multiple Xpaths to a dataframe using pandas? 【发布时间】:2019-02-27 00:31:28 【问题描述】:

我开始为 2018 年 MLB 投手寻找机会。我有各种类别,我想把它们变成一个数据框,这样我就可以打印到 excel 中。我想用熊猫。这是我目前的代码:

from urllib.request import urlopen
from lxml.html import fromstring

url = "https://www.baseball-reference.com/leagues/MLB/2018-standard-pitching.shtml"

#remove HTML comment markup
content = str(urlopen(url).read())
comment = content.replace("-->","").replace("<!--","")
tree = fromstring(comment)    

for pitcher_row in tree.xpath('//table[contains(@class,"stats_table")]//tr[contains(@class,"full_table")]'):
    names = pitcher_row.xpath('.//td[@data-stat="player"]/a')[0].text
    age = pitcher_row.xpath('.//td[@data-stat="age"]/text()')[0]
    w = pitcher_row.xpath('.//td[@data-stat="W"]/text()')[0]
    l = pitcher_row.xpath('.//td[@data-stat="L"]/text()')[0]
    g = pitcher_row.xpath('.//td[@data-stat="G"]/text()')[0]
    gs = pitcher_row.xpath('.//td[@data-stat="GS"]/text()')[0]
    ip = pitcher_row.xpath('.//td[@data-stat="IP"]/text()')[0]
    hits = pitcher_row.xpath('.//td[@data-stat="H"]/text()')[0]
    runs = pitcher_row.xpath('.//td[@data-stat="R"]/text()')[0]
    bb = pitcher_row.xpath('.//td[@data-stat="BB"]/text()')[0]
    so = pitcher_row.xpath('.//td[@data-stat="SO"]/text()')[0]

#print data        
    print(names, age, w, l, g, gs, ip, hits, runs, bb, so)

我想用我的刮擦创建一个数据框。有谁知道怎么做?

我在https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html 上看到了有关如何创建数据框的说明,但是,我不知道如何将其应用于我的情况。

下面是一个例子:

>>> d = 'col1': [1, 2], 'col2': [3, 4]
>>> df = pd.DataFrame(data=d)
>>> df

不过,我想使用上面的数据。不确定我是否需要附加我的数据。

谢谢!

【问题讨论】:

【参考方案1】:

如何实例化一个空的数据框并逐行附加你抓取的数据:

columns = ("names", "age", "w", "l", "g", "gs", "ip", "hits", "runs", "bb", "so")
df = pd.DataFrame(columns=columns)

for idx, pitcher_row in enumerate(tree.xpath('//table[contains(@class,"stats_table")]//tr[contains(@class,"full_table")]')):
    tmp = []
    tmp.append(pitcher_row.xpath('.//td[@data-stat="player"]/a')[0].text)
    tmp.append(pitcher_row.xpath('.//td[@data-stat="age"]/text()')[0])
    tmp.append(pitcher_row.xpath('.//td[@data-stat="W"]/text()')[0])
    ...

    df.loc[idx] = tmp

如果您想坚持使用大部分代码,甚至更简单:

columns = ("names", "age", "w", "l", "g", "gs", "ip", "hits", "runs", "bb", "so")
df = pd.DataFrame(columns=columns)

for idx, pitcher_row in enumerate(tree.xpath('//table[contains(@class,"stats_table")]//tr[contains(@class,"full_table")]')):
    names = pitcher_row.xpath('.//td[@data-stat="player"]/a')[0].text
    age = pitcher_row.xpath('.//td[@data-stat="age"]/text()')[0]
    w = pitcher_row.xpath('.//td[@data-stat="W"]/text()')[0]
    l = pitcher_row.xpath('.//td[@data-stat="L"]/text()')[0]
    g = pitcher_row.xpath('.//td[@data-stat="G"]/text()')[0]
    gs = pitcher_row.xpath('.//td[@data-stat="GS"]/text()')[0]
    ip = pitcher_row.xpath('.//td[@data-stat="IP"]/text()')[0]
    hits = pitcher_row.xpath('.//td[@data-stat="H"]/text()')[0]
    runs = pitcher_row.xpath('.//td[@data-stat="R"]/text()')[0]
    bb = pitcher_row.xpath('.//td[@data-stat="BB"]/text()')[0]
    so = pitcher_row.xpath('.//td[@data-stat="SO"]/text()')[0]

    df.loc[idx] = (names, age, w, l, g, gs, ip, hits, runs, bb, so)

【讨论】:

你的作品完美无瑕@petezurich!先生,非常感谢您的时间和精力,先生。非常感谢 =)

以上是关于如何使用 pandas 将多个 Xpath 转换为数据框?的主要内容,如果未能解决你的问题,请参考以下文章

Pandas使用列标题作为值将多个列转换/合并为单个列

如何将xpath位置转换为像素

如何将 Xpath 转换为 CSS

将多个嵌套 JSON 转换为 Pandas 数据框

将pandas数据帧转换为具有多个键的字典

如何将此 XPath 表达式转换为 BeautifulSoup?