用python进行网络抓取。无法访问td元素

Posted

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了用python进行网络抓取。无法访问td元素相关的知识,希望对你有一定的参考价值。

我试图从这个地址网上刮:https://www.pro-football-reference.com/boxscores/

这是美式足球比赛的一页分数。我想得到每场比赛的日期,赢家和输家。我没有问题访问日期,但无法弄清楚如何孤立和获取赢家和输家的团队名称。到目前为止我有什么......

from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup


#assigning url
my_url = 'https://www.pro-football-reference.com/boxscores/'

# opening up connection, grabbing the page
uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()

# html parsing
page_soup = soup(page_html,"html.parser")

games = page_soup.findAll("div",{"class":"game_summary expanded nohover"})


for game in games:
    date_block = game.findAll("tr",{"class":"date"})
    date_val = date_block[0].text
    winner_block = game.findAll("tr",{"class":"winner"})
    #here I need a line that returns the game winner, e.g. "Philadelphia Eagles"
    loser = game.findAll("tr",{"class":"loser"})

这是相关的HTML ...

<div class="game_summary expanded nohover">
<table class="teams">
    <tbody>
        <tr class="date">
            <td colspan="3">Sep 6, 2018</td>
        </tr>
        <tr class="loser">
            <td><a href="/teams/atl/2018.htm">Atlanta Falcons</a></td>
            <td class="right">12</td>
            <td class="right gamelink">
                <a href="/boxscores/201809060phi.htm">Final</a>
            </td>
        </tr>
        <tr class="winner">
            <td><a href="/teams/phi/2018.htm">Philadelphia Eagles</a></td>
            <td class="right">18</td>
            <td class="right">
            </td>
        </tr>
    </tbody>
</table>
<table class="stats">
    <tbody>
        <tr>
            <td><strong>PassYds</strong></td>
            <td><a href="/players/R/RyanMa00.htm" title="Matt Ryan">Ryan</a>-ATL</td>
            <td class="right">251</td>
        </tr>
        <tr>
            <td><strong>RushYds</strong></td>
            <td><a href="/players/A/AjayJa00.htm" title="Jay Ajayi">Ajayi</a>-PHI</td>
            <td class="right">62</td>
        </tr>
        <tr>
            <td><strong>RecYds</strong></td>
            <td><a href="/players/J/JoneJu02.htm" title="Julio Jones">Jones</a>-ATL</td>
            <td class="right">169</td>
        </tr>
    </tbody>
</table>

我得到一个错误,说ResultSet对象没有属性“td”。任何帮助将不胜感激

答案

小心领带游戏,我认为这是导致你的错误的原因,因为在这种情况下没有胜利者因此你不会找到与获胜者类别的行。以下代码输出日期和获胜者。

for game in games:
    date_block = game.find('tr',{'class':'date'})
    date_val = date_block.text
    winner_block = game.find('tr',{'class':'winner'})
    if winner_block:
        winner = winner_block.find('a').text
        print(date_val)
        print(winner)
    loser = game.findAll('tr',{'class':'loser'})

输出:

Sep 6, 2018
Philadelphia Eagles
Sep 9, 2018
New England Patriots
Sep 9, 2018
Tampa Bay Buccaneers
Sep 9, 2018
Minnesota Vikings
Sep 9, 2018
Miami Dolphins
Sep 9, 2018
Cincinnati Bengals
Sep 9, 2018
Baltimore Ravens
Sep 9, 2018
Jacksonville Jaguars
Sep 9, 2018
Kansas City Chiefs
Sep 9, 2018
Denver Broncos
Sep 9, 2018
Washington Redskins
Sep 9, 2018
Carolina Panthers
Sep 9, 2018
Green Bay Packers
Sep 10, 2018
New York Jets
Sep 10, 2018
Los Angeles Rams
另一答案

你的代码看起来非常正确。

html = ''' ... '''
soup = bs4.BeautifulSoup(html, 'lxml')  # or 'html.parser' either way
print([elem.text for elem in soup.find_all('tr', {'class': 'loser'})])
['
Atlanta Falcons
12

Final

']`

究竟出了什么问题?

另一答案

您可以从"game_summaries" div锚定您的搜索:

import requests, json
from bs4 import BeautifulSoup as soup
d = soup(requests.get('https://www.pro-football-reference.com/boxscores/').text, 'html.parser')
def get_data(_soup_obj, _headers):
  _d = [(lambda x:[c.text for c in x.find_all('td')] if x is not None else [])(_soup_obj.find(a, {'class':b})) for a, b in _headers]
  if all(_d):
    [date], [t1, val, _], [t2, val2, _] = _d
    return {'date':date, 'winner':{'team':t1, 'score':int(val)}, 'loser':{'team':t2, 'score':int(val2)}}
  return {}

headers = [['tr', 'date'], ['tr', 'winner'], ['tr', 'loser']]
games = [get_data(i, headers) for i in d.find('div', {'class':'game_summaries'}).find_all('div', {'class':'game_summary'})]
print(json.dumps(games, indent=4))

输出:

[
  {
    "date": "Sep 6, 2018",
    "winner": {
        "team": "Philadelphia Eagles",
        "score": 18
    },
    "loser": {
        "team": "Atlanta Falcons",
        "score": 12
    }
 },
  {
    "date": "Sep 9, 2018",
    "winner": {
        "team": "New England Patriots",
        "score": 27
    },
    "loser": {
        "team": "Houston Texans",
        "score": 20
    }
 },
 {
    "date": "Sep 9, 2018",
    "winner": {
        "team": "Tampa Bay Buccaneers",
        "score": 48
    },
    "loser": {
        "team": "New Orleans Saints",
        "score": 40
    }
 },
 {
    "date": "Sep 9, 2018",
    "winner": {
        "team": "Minnesota Vikings",
        "score": 24
    },
    "loser": {
        "team": "San Francisco 49ers",
        "score": 16
    }
 },
 {
    "date": "Sep 9, 2018",
    "winner": {
        "team": "Miami Dolphins",
        "score": 27
    },
    "loser": {
        "team": "Tennessee Titans",
        "score": 20
    }
},
{
    "date": "Sep 9, 2018",
    "winner": {
        "team": "Cincinnati Bengals",
        "score": 34
    },
    "loser": {
        "team": "Indianapolis Colts",
        "score": 23
    }
},
{},
{
    "date": "Sep 9, 2018",
    "winner": {
        "team": "Baltimore Ravens",
        "score": 47
    },
    "loser": {
        "team": "Buffalo Bills",
        "score": 3
    }
},
{
    "date": "Sep 9, 2018",
    "winner": {
        "team": "Jacksonville Jaguars",
        "score": 20
    },
    "loser": {
        "team": "New York Giants",
        "score": 15
    }
},
{
    "date": "Sep 9, 2018",
    "winner": {
        "team": "Kansas City Chiefs",
        "score": 38
    },
    "loser": {
        "team": "Los Angeles Chargers",
        "score": 28
    }
},
{
    "date": "Sep 9, 2018",
    "winner": {
        "team": "Denver Broncos",
        "score": 27
    },
    "loser": {
        "team": "Seattle Seahawks",
        "score": 24
    }
},
{
    "date": "Sep 9, 2018",
    "winner": {
        "team": "Washington Redskins",
        "score": 24
    },
    "loser": {
        "team": "Arizona Cardinals",
        "score": 6
    }
},
{
    "date": "Sep 9, 2018",
    "winner": {
        "team": "Carolina Panthers",
        "score": 16
    },
    "loser": {
        "team": "Dallas Cowboys",
        "score": 8
    }
},
{
    "date": "Sep 9, 2018",
    "winner": {
        "team": "Green Bay Packers",
        "score": 24
    },
    "loser": {
        "team": "Chicago Bears",
        "score": 23
    }
},
{
    "date": "Sep 10, 2018",
    "winner": {
        "team": "New York Jets",
        "score": 48
    },
    "loser": {
        "team": "Detroit Lions",
        "score": 17
    }
},
{
    "date": "Sep 10, 2018",
    "winner": {
        "team": "Los Angeles Rams",
        "score": 33
    },
    "loser": {
        "team": "Oakland Raiders",
        "score": 13
     }
  }
]
另一答案

你可能会遇到本周存在平局的问题。在匹兹堡/克利夫兰的比赛中没有冠军TD。运行此应输出所有游戏,包括领带游戏:

for game in games:
    date_block = game.findAll("tr",{"class":"date"})
    date_val = date_block[0].text
    print "Game Date: %s" % (date_val)
    #Test if a winner is defined
    if game.find("tr",{"class":"winner"}) is not None:        


        winner_block = game.findAll("tr",{"class":"winner"})
        #Get the winner from the first TD and print text only
        winner = winner_block[0].findAll("td")
        print "Winner: %s" % (winner[0].get_text())

        loser_block = game.findAll("tr",{"class":"loser"})
        #Get the loser from the first TD and print text only
        loser = loser_block[0].findAll("td")
        print "Loser: %s" % (loser[0].get_text())
    else:
        #If no winner is listed, it must be a tie. Get both teams and print them.
        print "Its a tie!"
        draw_block  = game.findAll("tr",{"class":"draw"})
        for team in draw_block:
            print "Draw : %s"   % (team.findAll("td")[0].get_text())

以上是关于用python进行网络抓取。无法访问td元素的主要内容,如果未能解决你的问题,请参考以下文章

将 Javascript 生成的表上的 <td> 值抓取到 Python

无法处理网页抓取中的空 <td> 值

使用 jQuery gt() 和 lt() 访问无法正常工作的元素范围

python网络爬虫抓取动态网页并将数据存入数据库MySQL

Python爬虫怎么抓取html网页的代码块

如何利用python抓取美股数据