使用Extruct获取json-id格式的节点值项
Posted
技术标签:
【中文标题】使用Extruct获取json-id格式的节点值项【英文标题】:Use Extruct to get node value items in json-id format 【发布时间】:2021-12-31 16:20:54 【问题描述】:下面的代码没有错误。但是,它没有返回所需的元素。当我遍历数据项列表时,项目在那里,但我不明白为什么我的 SportsEvent 循环离开团队和主队、体育场和开始日期是空白的。此处的链接没有第二页,因此您可以删除 selenium 和 get_next_page 函数,如果您没有安装这些函数进行测试。
问题出在这一行
if "SportsEvent" in item:
这里是整个脚本
import pandas as pd
import extruct as ex
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
urls = [
'https://www.oddsshark.com/nfl/odds',
'https://www.oddsshark.com/nba/odds'
]
def get_driver():
options = Options()
options.add_argument('--headless')
driver = webdriver.Chrome(options=options)
return driver
def get_source(driver, url):
driver.get(url)
return driver.page_source
def get_json(source):
return ex.extract(source, syntaxes=['json-ld'])
def get_next_page(driver, source):
"""IN the event teams are on more than 1 page Parse the page source and
return the URL for the next page of results.
:param driver: Selenium webdriver
:param source: Page source code from Selenium
:return
URL of next paginated page
"""
elements = driver.find_elements_by_xpath('//link[@rel="next"]')
if elements:
return driver.find_element_by_xpath('//link[@rel="next"]').get_attribute('href')
else:
return ''
df = pd.DataFrame(columns = ['awayTeam', 'homeTeam','location','startDate'])
def save_teams(data, df):
"""Scrape the teams from a schema.org JSON-LD tag and save the contents in
the df Pandas dataframe.
:param data: JSON-LD source containing schema.org SportsEvent markup
:param df: Name of Pandas dataframe to which to append SportsEvent
:return
df with teams appended
"""
for item in data['json-ld']:
print(item)
if "SportsEvent" in item: #issue is here it does not see SportsEvent in item so it wont continue doing the inner loops
for SportsEvent in item['SportsEvent']:
#print(item['SportsEvent'])
row =
'awayTeam': SportsEvent.get('awayTeam', ).get('name'),
'homeTeam': SportsEvent.get('homeTeam', ).get('name'),
'location': SportsEvent.get('location', ).get('name'),
'startDate': SportsEvent.get('startDate')
print(row)
df = df.append(row, ignore_index=True)
return df
for url in urls:
print(url)
# Save the teams from the first page
driver = get_driver()
source = get_source(driver, url)
json = get_json(source)
df = save_teams(json, df)
# Get teams on each paginated page if other pages exists
next_page = get_next_page(driver, source)
paginated_urls = []
paginated_urls.append(next_page)
if paginated_urls:
for url in paginated_urls:
if url:
#print(next_page)
driver = get_driver()
source = get_source(driver, url)
json = get_json(source)
df = save_teams(json, df)
next_page = get_next_page(driver, source)
paginated_urls.append(next_page)
【问题讨论】:
你只是想知道客队、主队、日期和地点吗? 是的,chitown88 是正确的。我选择使用提取,因为该网站上的其他运动篮球和棒球使用相同的布局,因此代码对于提取其他运动也很有用。 看看我下面的解决方案(特别是最后一部分EXTRA:)。这可能是一个更好的方法, 感谢您的好评。硒起作用。如果有更多页面,它将获得下一页并返回数据。在这个例子中,你的权利不需要离开它,因为我不想修改代码和破坏某些东西。询问您使用什么应用程序来显示键、类型、值。这看起来很有用。 嗯,理论上,数据应该都在那里(即使它在多个页面上)。是否有多个页面上的数据示例? 【参考方案1】:那是因为您的 item
中没有密钥 "SportsEvent"
。它是键 '@type'
下的一个值。
因此,您需要将 save_teams()
函数更改为:
def save_teams(data, df):
"""Scrape the teams from a schema.org JSON-LD tag and save the contents in
the df Pandas dataframe.
:param data: JSON-LD source containing schema.org SportsEvent markup
:param df: Name of Pandas dataframe to which to append SportsEvent
:return
df with teams appended
"""
for item in data['json-ld']:
print(item)
if "SportsEvent" in item.values(): #issue is here it does not see SportsEvent in item so it wont continue doing the inner loops
row =
'awayTeam': item.get('awayTeam', ).get('name'),
'homeTeam': item.get('homeTeam', ).get('name'),
'location': item.get('location', ).get('name'),
'startDate': item.get('startDate')
print(row)
df = df.append(row, ignore_index=True)
return df
但看起来你可能会用 Selenium 来复杂化这个问题。您可以通过使用 BeautifulSoup 将其拉出来获取该数据,然后将其读入 json。然后让 pandas 把它弄平:
import pandas as pd
import requests
import json
from bs4 import BeautifulSoup
urls = [
'https://www.oddsshark.com/nfl/odds',
'https://www.oddsshark.com/nba/odds']
for url in urls:
response = requests.get(url).text
soup = BeautifulSoup(response, 'html.parser')
jsonStr = soup.find('script', 'type':'application/ld+json').text
jsonData = json.loads(jsonStr)
df = pd.json_normalize(jsonData)
print(df.to_string())
# or to get just those columns
#print(df[['awayTeam.name','homeTeam.name','location.name','startDate']])
输出:
@type @context inLanguage name url startDate location.@type location.name location.address.@type location.address.addressLocality awayTeam.@type awayTeam.name homeTeam.@type homeTeam.name
0 SportsEvent http://schema.org en-US Tampa Bay Buccaneers vs New York Giants https://www.oddsshark.com/nfl/new-york-tampa-bay-odds-november-22-2021-1411211 2021-11-22T20:15:00-05:00 Place Raymond James Stadium PostalAddress Raymond James Stadium SportsTeam New York Giants SportsTeam Tampa Bay Buccaneers
1 SportsEvent http://schema.org en-US Detroit Lions vs Chicago Bears https://www.oddsshark.com/nfl/chicago-detroit-odds-november-25-2021-1411216 2021-11-25T12:30:00-05:00 Place Ford Field PostalAddress Ford Field SportsTeam Chicago Bears SportsTeam Detroit Lions
2 SportsEvent http://schema.org en-US Dallas Cowboys vs Las Vegas Raiders https://www.oddsshark.com/nfl/las-vegas-dallas-odds-november-25-2021-1411221 2021-11-25T16:30:00-05:00 Place AT&T Stadium PostalAddress AT&T Stadium SportsTeam Las Vegas Raiders SportsTeam Dallas Cowboys
3 SportsEvent http://schema.org en-US New Orleans Saints vs Buffalo Bills https://www.oddsshark.com/nfl/buffalo-new-orleans-odds-november-25-2021-1411226 2021-11-25T20:20:00-05:00 Place Caesars Superdome PostalAddress Caesars Superdome SportsTeam Buffalo Bills SportsTeam New Orleans Saints
4 SportsEvent http://schema.org en-US Houston Texans vs New York Jets https://www.oddsshark.com/nfl/new-york-houston-odds-november-28-2021-1411231 2021-11-28T13:00:00-05:00 Place NRG Stadium PostalAddress NRG Stadium SportsTeam New York Jets SportsTeam Houston Texans
5 SportsEvent http://schema.org en-US Indianapolis Colts vs Tampa Bay Buccaneers https://www.oddsshark.com/nfl/tampa-bay-indianapolis-odds-november-28-2021-1411236 2021-11-28T13:00:00-05:00 Place Lucas Oil Stadium PostalAddress Lucas Oil Stadium SportsTeam Tampa Bay Buccaneers SportsTeam Indianapolis Colts
6 SportsEvent http://schema.org en-US New York Giants vs Philadelphia Eagles https://www.oddsshark.com/nfl/philadelphia-new-york-odds-november-28-2021-1411241 2021-11-28T13:00:00-05:00 Place MetLife Stadium PostalAddress MetLife Stadium SportsTeam Philadelphia Eagles SportsTeam New York Giants
7 SportsEvent http://schema.org en-US Miami Dolphins vs Carolina Panthers https://www.oddsshark.com/nfl/carolina-miami-odds-november-28-2021-1411246 2021-11-28T13:00:00-05:00 Place Hard Rock Stadium PostalAddress Hard Rock Stadium SportsTeam Carolina Panthers SportsTeam Miami Dolphins
8 SportsEvent http://schema.org en-US New England Patriots vs Tennessee Titans https://www.oddsshark.com/nfl/tennessee-new-england-odds-november-28-2021-1411251 2021-11-28T13:00:00-05:00 Place Gillette Stadium PostalAddress Gillette Stadium SportsTeam Tennessee Titans SportsTeam New England Patriots
9 SportsEvent http://schema.org en-US Cincinnati Bengals vs Pittsburgh Steelers https://www.oddsshark.com/nfl/pittsburgh-cincinnati-odds-november-28-2021-1411256 2021-11-28T13:00:00-05:00 Place Paul Brown Stadium PostalAddress Paul Brown Stadium SportsTeam Pittsburgh Steelers SportsTeam Cincinnati Bengals
10 SportsEvent http://schema.org en-US Jacksonville Jaguars vs Atlanta Falcons https://www.oddsshark.com/nfl/atlanta-jacksonville-odds-november-28-2021-1411261 2021-11-28T13:00:00-05:00 Place TIAA Bank Field PostalAddress TIAA Bank Field SportsTeam Atlanta Falcons SportsTeam Jacksonville Jaguars
11 SportsEvent http://schema.org en-US Denver Broncos vs Los Angeles Chargers https://www.oddsshark.com/nfl/los-angeles-denver-odds-november-28-2021-1411266 2021-11-28T16:05:00-05:00 Place Empower Field at Mile High PostalAddress Empower Field at Mile High SportsTeam Los Angeles Chargers SportsTeam Denver Broncos
12 SportsEvent http://schema.org en-US San Francisco 49ers vs Minnesota Vikings https://www.oddsshark.com/nfl/minnesota-san-francisco-odds-november-28-2021-1411271 2021-11-28T16:25:00-05:00 Place Levi's Stadium PostalAddress Levi's Stadium SportsTeam Minnesota Vikings SportsTeam San Francisco 49ers
13 SportsEvent http://schema.org en-US Green Bay Packers vs Los Angeles Rams https://www.oddsshark.com/nfl/los-angeles-green-bay-odds-november-28-2021-1411276 2021-11-28T16:25:00-05:00 Place Lambeau Field PostalAddress Lambeau Field SportsTeam Los Angeles Rams SportsTeam Green Bay Packers
14 SportsEvent http://schema.org en-US Baltimore Ravens vs Cleveland Browns https://www.oddsshark.com/nfl/cleveland-baltimore-odds-november-28-2021-1411281 2021-11-28T20:20:00-05:00 Place M&T Bank Stadium PostalAddress M&T Bank Stadium SportsTeam Cleveland Browns SportsTeam Baltimore Ravens
15 SportsEvent http://schema.org en-US Washington Football Team vs Seattle Seahawks https://www.oddsshark.com/nfl/seattle-washington-odds-november-29-2021-1411286 2021-11-29T20:15:00-05:00 Place FedEx Field PostalAddress FedEx Field SportsTeam Seattle Seahawks SportsTeam Washington Football Team
@type @context inLanguage name url startDate location.@type location.name location.address.@type location.address.addressLocality awayTeam.@type awayTeam.name homeTeam.@type homeTeam.name
0 SportsEvent http://schema.org en-US Washington Wizards vs Charlotte Hornets https://www.oddsshark.com/nba/charlotte-washington-odds-november-22-2021-1460581 2021-11-22T19:00:00-05:00 Place Capital One Arena PostalAddress Capital One Arena SportsTeam Charlotte Hornets SportsTeam Washington Wizards
1 SportsEvent http://schema.org en-US Cleveland Cavaliers vs ***lyn Nets https://www.oddsshark.com/nba/***lyn-cleveland-odds-november-22-2021-1460586 2021-11-22T19:00:00-05:00 Place Rocket Mortgage FieldHouse PostalAddress Rocket Mortgage FieldHouse SportsTeam ***lyn Nets SportsTeam Cleveland Cavaliers
2 SportsEvent http://schema.org en-US Boston Celtics vs Houston Rockets https://www.oddsshark.com/nba/houston-boston-odds-november-22-2021-1460591 2021-11-22T19:30:00-05:00 Place TD Garden PostalAddress TD Garden SportsTeam Houston Rockets SportsTeam Boston Celtics
3 SportsEvent http://schema.org en-US Atlanta Hawks vs Oklahoma City Thunder https://www.oddsshark.com/nba/oklahoma-city-atlanta-odds-november-22-2021-1460596 2021-11-22T19:30:00-05:00 Place State Farm Arena PostalAddress State Farm Arena SportsTeam Oklahoma City Thunder SportsTeam Atlanta Hawks
4 SportsEvent http://schema.org en-US Chicago Bulls vs Indiana Pacers https://www.oddsshark.com/nba/indiana-chicago-odds-november-22-2021-1460601 2021-11-22T20:00:00-05:00 Place United Center PostalAddress United Center SportsTeam Indiana Pacers SportsTeam Chicago Bulls
5 SportsEvent http://schema.org en-US Milwaukee Bucks vs Orlando Magic https://www.oddsshark.com/nba/orlando-milwaukee-odds-november-22-2021-1460606 2021-11-22T20:00:00-05:00 Place Fiserv Forum PostalAddress Fiserv Forum SportsTeam Orlando Magic SportsTeam Milwaukee Bucks
6 SportsEvent http://schema.org en-US New Orleans Pelicans vs Minnesota Timberwolves https://www.oddsshark.com/nba/minnesota-new-orleans-odds-november-22-2021-1460611 2021-11-22T20:00:00-05:00 Place Smoothie King Center PostalAddress Smoothie King Center SportsTeam Minnesota Timberwolves SportsTeam New Orleans Pelicans
7 SportsEvent http://schema.org en-US San Antonio Spurs vs Phoenix Suns https://www.oddsshark.com/nba/phoenix-san-antonio-odds-november-22-2021-1460616 2021-11-22T20:30:00-05:00 Place AT&T Center PostalAddress AT&T Center SportsTeam Phoenix Suns SportsTeam San Antonio Spurs
8 SportsEvent http://schema.org en-US Utah Jazz vs Memphis Grizzlies https://www.oddsshark.com/nba/memphis-utah-odds-november-22-2021-1460621 2021-11-22T21:00:00-05:00 Place Vivint Arena PostalAddress Vivint Arena SportsTeam Memphis Grizzlies SportsTeam Utah Jazz
9 SportsEvent http://schema.org en-US Sacramento Kings vs Philadelphia 76ers https://www.oddsshark.com/nba/philadelphia-sacramento-odds-november-22-2021-1460626 2021-11-22T22:00:00-05:00 Place Golden 1 Center PostalAddress Golden 1 Center SportsTeam Philadelphia 76ers SportsTeam Sacramento Kings
10 SportsEvent http://schema.org en-US Detroit Pistons vs Miami Heat https://www.oddsshark.com/nba/miami-detroit-odds-november-23-2021-1460631 2021-11-23T19:00:00-05:00 Place Little Caesars Arena PostalAddress Little Caesars Arena SportsTeam Miami Heat SportsTeam Detroit Pistons
11 SportsEvent http://schema.org en-US New York Knicks vs Los Angeles Lakers https://www.oddsshark.com/nba/los-angeles-new-york-odds-november-23-2021-1460636 2021-11-23T19:30:00-05:00 Place Madison Square Garden PostalAddress Madison Square Garden SportsTeam Los Angeles Lakers SportsTeam New York Knicks
12 SportsEvent http://schema.org en-US Portland Trail Blazers vs Denver Nuggets https://www.oddsshark.com/nba/denver-portland-odds-november-23-2021-1460641 2021-11-23T22:00:00-05:00 Place Moda Center at the Rose Quarter PostalAddress Moda Center at the Rose Quarter SportsTeam Denver Nuggets SportsTeam Portland Trail Blazers
13 SportsEvent http://schema.org en-US Los Angeles Clippers vs Dallas Mavericks https://www.oddsshark.com/nba/dallas-los-angeles-odds-november-23-2021-1460646 2021-11-23T22:30:00-05:00 Place Staples Center PostalAddress Staples Center SportsTeam Dallas Mavericks SportsTeam Los Angeles Clippers
额外:
我以前从未使用过 struct。我喜欢!谢谢你把它介绍给我。这是一个解决方案:
import pandas as pd
import extruct as ex
import requests
urls = [
'https://www.oddsshark.com/nfl/odds',
'https://www.oddsshark.com/nba/odds']
for url in urls:
response = requests.get(url).text
jsonData = ex.extract(response, syntaxes=['json-ld'])['json-ld']
df = pd.json_normalize(jsonData)
df = df[df['@type'] == 'SportsEvent']
print(df.to_string())
# or to get just those columns
#print(df[['awayTeam.name','homeTeam.name','location.name','startDate']])
【讨论】:
以上是关于使用Extruct获取json-id格式的节点值项的主要内容,如果未能解决你的问题,请参考以下文章