BeautifulSoup 4 - 网络抓取“今天”的足球比赛
Posted
技术标签:
【中文标题】BeautifulSoup 4 - 网络抓取“今天”的足球比赛【英文标题】:BeautifulSoup 4 - Web scraping soccer matches for 'today' 【发布时间】:2022-01-21 01:39:44 【问题描述】:我对 python 非常陌生,并试图从 fox 体育网站https://www.foxsports.com/scores/soccer 上为“今天”网络抓取足球比赛。不幸的是,我一直遇到问题
'AttributeError: 'NoneType' 对象没有属性 'find_all''
并且似乎无法找到当天的团队。这是我目前所拥有的:
import bs4
import requests
res = requests.get('foxsports.com/scores/soccer')
soup = bs4.BeautifulSoup(res.text, 'html.parser')
results = soup.find("div", class_="scores-date")
games = results.find("div", class_="scores")
print(games)
【问题讨论】:
您应该检查 res.text 是否包含网站的内容。可能是这里的问题 【参考方案1】:会发生什么?
内容不是静态的,它是由网站动态提供的,因此请求不会获得您可以在开发工具中看到的信息。
如何解决?
使用提供的 api 或 selenium
,像浏览器一样处理内容,并可以提供您正在寻找的 page_source
。
由于并非所有内容都是直接提供的,因此您必须使用 selenium waits 来定位 <span>
与“title-text”类的存在。
示例
注意 示例使用selenium 4,因此请检查您的版本,自行更新或调整所需的依赖项到较低版本
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service as ChromeService
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
service = ChromeService(executable_path='ENTER YOUR PATH TO CHROMEDRIVER')
driver = webdriver.Chrome(service=service)
driver.get('https://www.foxsports.com/scores/soccer')
WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.XPATH, '//span[contains(@class, "title-text") and text() = "Today"]')))
soup = BeautifulSoup(driver.page_source, 'lxml')
for g in soup.select('.scores-date:not(:has(div)) + div .score-chip-content'):
print(list(g.stripped_strings))
输出
['SERIE A', 'JUVENTUS', '9-4-5', 'JUV', '9-4-5', 'CAGLIARI', '1-7-10', 'CAG', '1-7-10', '8:45PM', 'Paramount+', 'JUV -455', 'CAG +1100']
['LG CUP', 'ARSENAL', '0-0-0', 'ARS', '0-0-0', 'SUNDERLAND', '0-0-0', 'SUN', '0-0-0', '8:45PM', 'ARS -454', 'SUN +1243']
['LA LIGA', 'SEVILLA', '11-4-2', 'SEV', '11-4-2', 'BARCELONA', '7-6-4', 'BAR', '7-6-4', '9:30PM', 'ESPN+', 'SEV +155', 'BAR +180']
【讨论】:
这太棒了。如果你不介意走过你做的一些步骤?我熟悉 selenium,但不熟悉您所做的导入。 添加了一些指向 selenium 4 的导入依赖项的附加链接 - 很乐意提供帮助,如果此答案或任何其他答案解决了您的问题,请将其标记为已接受 - someone-answers - 谢谢跨度> 【参考方案2】:您必须提供与 http 协议的链接。此代码有效:
import bs4
import requests
res = requests.get('https://foxsports.com/scores/soccer')
soup = bs4.BeautifulSoup(res.text, 'html.parser')
results = soup.find("div", class_="scores-date")
games = results.find("div", class_="scores")
print(results)
print(games)
但是,games
是 None
,因为 bs4 在 results
中找不到任何类为 scores
的 div
【讨论】:
这些标签中有元素,但由于某种原因我无法从中提取文本。【参考方案3】:如果您通过 api,效率会高得多。所有的数据都在那里,包括更多(但我只拿出分数打印出来)。您必须先访问该站点才能获取用作参数的 apikey。
我还添加了选择组/联盟的选项。所以你需要pip install choice
import requests
import datetime
from bs4 import BeautifulSoup
import re
#pip install choice
import choice
# Get the apikey
url = 'https://www.foxsports.com/scores/soccer'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
apikey = soup.find_all('div', 'data-scoreboard':re.compile("^https://"))[0]['data-scoreboard'].split('apikey=')[-1]
# Get the group Ids and correpsonding titles
url = 'https://api.foxsports.com/bifrost/v1/soccer/scoreboard/main'
payload = 'apikey':apikey
jsonData = requests.get(url, params=payload).json()
groupsTitle_list = [ x['title'] for x in jsonData['groupList']]
groupsId_list = [ x['id'] for x in jsonData['groupList']]
groups_dict = dict(zip(groupsTitle_list, groupsId_list))
user_input = choice.Menu(groups_dict.keys()).ask()
groupId = groups_dict[user_input]
# Get the date of the score you are after
date_param = input('Enter date in YYYYMMDD format\nEx: 20220109\n-> ')
# If you prefer to always just grab todays score, use line below
#date_param = datetime.datetime.now().strftime("%Y%m%d")
# Pull the score for the date and group
url = f'https://api.foxsports.com/bifrost/v1/soccer/scoreboard/segment/cgroupIdddate_param'
payload =
'apikey':apikey,
'groupId':groupId
jsonData = requests.get(url, params=payload).json()
if len(jsonData['sectionList']) == 0:
print(f'No score available on date_param for user_input')
else:
returnDate = jsonData['sectionList'][0]['menuTitle']
print(f'\n returnDate - user_input')
events = jsonData['sectionList'][0]['events']
for event in events:
lowerTeamName = event['lowerTeam']['longName']
lowerTeamScore = event['lowerTeam']['score']
upperTeamName = event['upperTeam']['longName']
upperTeamScore = event['upperTeam']['score']
print(f'\tupperTeamName upperTeamScore')
print(f'\tlowerTeamName lowerTeamScore\n')
输出:
Make a choice:
0: FEATURED MATCHES
1: ENGLISH PREMIER LEAGUE
2: MLS
3: LA LIGA
4: LIGUE 1
5: BUNDESLIGA
6: UEFA CHAMPIONS LEAGUE
7: LIGA MX
8: SERIE A
9: WCQ - CONCACAF
Enter number or name; return for next page
? 0
Enter date in YYYYMMDD format
Ex: 20220109
-> 20220109
SUN, JAN 9 - FEATURED MATCHES
LIVERPOOL 4
SHREWSBURY 1
TOTTENHAM 3
MORECAMBE 1
WOLVES 3
SHEFFIELD UTD 0
WEST HAM 2
LEEDS UNITED 0
NOTTINGHAM 1
ARSENAL 0
ROMA 3
JUVENTUS 4
LYON 1
PARIS SG 1
VILLARREAL 2
ATLÉTICO MADRID 2
GUADALAJARA 3
MAZATLÁN FC 0
【讨论】:
以上是关于BeautifulSoup 4 - 网络抓取“今天”的足球比赛的主要内容,如果未能解决你的问题,请参考以下文章
尝试使用 BeautifulSoup 从我的代码中使用 Xpath 进行网络抓取 [重复]
使用beautifulsoup 和selenium webdriver 需要帮助Web 抓取表