在 python 上使用 selenium 或 beautifulsoup 从带有链接的页面中抓取数据,没有类,没有 id

Posted

技术标签:

【中文标题】在 python 上使用 selenium 或 beautifulsoup 从带有链接的页面中抓取数据,没有类,没有 id【英文标题】:Scraping data from a page with links using selenium or beautifulsoup on python, no class, no id 【发布时间】:2021-01-01 14:15:12 【问题描述】:

我想知道如何抓取这个网站:https://1997-2001.state.gov/briefings/statements/2000/2000_index.html

它只包含'a'和'href',没有类或ID,它的结构非常简单。我想运行一个字符串来抓取页面上所有链接的内容。

我已经使用 chromedriver 尝试过这段代码,但它只打印了链接列表(我是网络抓取的业余爱好者)。任何帮助都会很棒。

    >>> elems = driver.find_elements_by_xpath("//a[@href]")
    >>> for elem in elems:
    ...     print(elem.get_attribute("href"))

【问题讨论】:

【参考方案1】:

我希望我能很好地理解您的问题:此脚本将遍历每个链接,打开并打印其中包含的文档:

import requests 
from bs4 import BeautifulSoup


url = 'https://1997-2001.state.gov/briefings/statements/2000/2000_index.html'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')

for a in soup.select('td[] img + a'):
    u = 'https://1997-2001.state.gov/briefings/statements/2000/' + a['href']
    print(u)
    s = BeautifulSoup(requests.get(u).content, 'html.parser')
    t = s.select_one('td[]').get_text(strip=True, separator='\n')
    print( t.split('[end of document]')[0] )
    print('-' * 80)

打印:

https://1997-2001.state.gov/briefings/statements/2000/ps001227.html
Statement by Philip T. Reeker, Deputy Spokesman
December 27, 2000
China - LUOYANG Fire
We were saddened to learn of the terrible fire that killed hundreds of people in the Chinese city of Luoyang.  The United States offers its sincerest condolences to the families of the victims of the tragic December 25 blaze.  We also offer our best wishes for a speedy recovery to the survivors.

--------------------------------------------------------------------------------
https://1997-2001.state.gov/briefings/statements/2000/ps001226.html
Media Note
December 26, 2000
Renewal of the Secretary of State's Advisory Committee
on
Private International Law
The Department of State has renewed the Charter of the Secretary of State's Advisory Committee on Private International Law (ACPIL), effective as of November 20, 2000.   The Under Secretary for Management has determined that ACPIL is necessary and in the public interest.
ACPIL enables the Department to obtain the expert and considered view of the private sector organizations and interests most knowledgeable of, as well as most affected by, international activities to unify private law.  The committee consists of members from private sector organizations, bar associations, national legal organizations, and federal and state government agency and judicial interests concerned with private international law.  ACPIL will follow the procedures prescribed by the Federal Advisory Committee Act (FACA) (Public Law 92-463).  Meetings will be open to the public unless a determination is made in accordance with Section 10(d) of the FACA, 5 U.S.C. 552b(c)(1) and (4), that a meeting or a portion of the meeting should be closed to the public.
Any questions concerning this committee should be referred to the Executive Director, Harold Burman, at 202-776-8420.

--------------------------------------------------------------------------------
https://1997-2001.state.gov/briefings/statements/2000/ps001225.html
Statement by Philip T. Reeker, Deputy Spokesman
December 25, 2000
Parliamentary Elections in Serbia
The United States congratulates the Democratic Opposition of Serbia on their victory in Saturday's election for the Serbia parliament. Official results indicate that the United Democratic Opposition (DOS) won with 64 percent of the vote to just 13 percent for the Socialist Party.
We also congratulate the Serbian people for their widespread participation in what international observers have stated was a free and fair election.  This is the first time the Serbian people have had a free and fair election in over a decade. As such, it is an important milestone in the ongoing democratic transition that began with Milosevic's defeat in September's federal presidential elections. The Democratic Opposition is now in a stronger position to carry out the reforms needed to fully integrate Serbia into the international community.
We look forward to working with the new Serbian government in the same amicable and cooperative spirit we now enjoy with the federal Yugoslav government.

--------------------------------------------------------------------------------

...and so on.

编辑:更正的代码:

import requests 
from bs4 import BeautifulSoup


url = 'https://1997-2001.state.gov/briefings/statements/2000/2000_index.html'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')

for a in soup.select('td[] img + a'):
    u = 'https://1997-2001.state.gov/briefings/statements/2000/' + a['href']
    print(u)
    s = BeautifulSoup(requests.get(u).content, 'html.parser')
    t = s.select_one('td[], td[], table[]:has(td[colspan="2"])').get_text(strip=True, separator='\n')
    print( t.split('[end of document]')[0] )
    print('-' * 80)

编辑 2(1999 年):

import requests 
from bs4 import BeautifulSoup


url = 'https://1997-2001.state.gov/briefings/statements/1999/1999_index.html'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')

for a in soup.select('td[] img + a'):
    if 'http' not in a['href']:
        u = 'https://1997-2001.state.gov/briefings/statements/1999/' + a['href']
    else:
        u = a['href']
    
    print(u)
    s = BeautifulSoup(requests.get(u).content, 'html.parser')
    tag = s.select_one('td[], td[], table[]:has(td[colspan="2"]), blockquote')
    if tag:
        t = tag.get_text(strip=True, separator='\n')
        print( t.split('[end of document]')[0] )
    print('-' * 80)

【讨论】:

谢谢,这非常有用!一个问题 - 代码在 11 月中旬链接后停止运行,而不是从 1 月至 12 月页面上的每个链接收集,有没有办法包含所有链接的内容? 抱歉我的困惑,但您是否修改了代码以适用于该特定页面?之所以问,是因为我希望使用代码来应用到类似结构的页面,而当我使用它来尝试运行页面上的其他链接(例如,'1997-2001.state.gov/briefings/statements/1999/…)时,代码不起作用。 @Sarah “代码不起作用” 为什么?什么错误? “AttributeError 'NoneType' object has no attribute 'get_text'” ,对于我评论的链接,否则它根本不会为 1998 年或 1997 年的链接运行命令。【参考方案2】:

我不是网络抓取专家,但你可以使用 BeautifulSoup:

from bs4 import BeautifulSoup
import urllib.request
import bs4 as bs

url  = 'https://1997-2001.state.gov/briefings/statements/2000/2000_index.html'
sauce  = urllib.request.urlopen(url).read()
soup = bs.BeautifulSoup(sauce, 'lxml')   

for a in (soup.find_all('a')):
    try:
        print(a['href'])
    except:
        print("a element doesn't have href")

【讨论】:

以上是关于在 python 上使用 selenium 或 beautifulsoup 从带有链接的页面中抓取数据,没有类,没有 id的主要内容,如果未能解决你的问题,请参考以下文章

如何使用 selenium java/python 确定元素是不是在特定时间显示在屏幕上?

python+selenium如何定位多层iframe中元素

python selenium UI自动化操作iframe及返回默认页面

即使在等待之后,Python 中的 Selenium 也无法识别 DOM 中的变化

使用python - selenium模拟登陆b站

大佬教你用Python+Selenium破译B站滑块验证码,信息安全之路