网页刮刮Python BeautifulSoup

Posted

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了网页刮刮Python BeautifulSoup相关的知识,希望对你有一定的参考价值。

我只是Python的初学者。

我试图从网站上抓取数据并设法编写下面的代码。

但是,我不知道如何继续前进,因为我无法获得href标签,以便我可以访问每个列表并获取数据。我也不太清楚HTML标签,所以我怀疑我没有正确识别标签。

这是我的代码:

import requests 
from bs4 import BeautifulSoup

urls = []
for i in range(1,5):
    pages = "https://directory.singaporefintech.org/?p={0}&category=0&zoom=15&is_mile=0&directory_radius=0&view=list&hide_searchbox=0&hide_nav=0&hide_nav_views=0&hide_pager=0&featured_only=0&feature=1&perpage=20&sort=random".format(i)
    urls.append(pages)

Data = []
for info in urls:
    page = requests.get(info)
    soup = BeautifulSoup(page.content, 'html.parser')
    links = soup.find_all('a', attrs ={'class' :'sabai-directory-title'})
    hrefs = [link['href'] for link in links]

上面的代码生成hrefs作为空白列表。任何帮助将非常感谢!!

谢谢!!!

答案

代码很好,您正在寻找的类在这些页面上不存在。例如,在检查https://directory.singaporefintech.org/hello-world/?category=0&zoom=15&is_mile=0&directory_radius=0&view=list&hide_searchbox=0&hide_nav=0&hide_nav_views=0&hide_pager=0&featured_only=0&feature=1&perpage=20&sort=random后使用comment-reply-link替换sabai-directory-title类,并在添加print语句时得到结果

另一答案

您可以使用CSS选择器来删除链接。选择器div.sabai-directory-title a将在<a>标签内找到任何<div>标签与类sabai-directory-title(我更新了URL,你的给我错误页面):

from bs4 import BeautifulSoup
import requests
from pprint import pprint

r = requests.get('https://directory.singaporefintech.org/')
soup = BeautifulSoup(r.text, 'lxml')

hrefs = [a['href'] for a in soup.select('div.sabai-directory-title a')]

pprint(hrefs)

这将打印:

['https://directory.singaporefintech.org/directory/listing/silent-eight',
 'https://directory.singaporefintech.org/directory/listing/incomlend',
 'https://directory.singaporefintech.org/directory/listing/bizgrow',
 'https://directory.singaporefintech.org/directory/listing/makerscut',
 'https://directory.singaporefintech.org/directory/listing/soho-fintech',
 'https://directory.singaporefintech.org/directory/listing/dxmarkets',
 'https://directory.singaporefintech.org/directory/listing/fundrevo',
 'https://directory.singaporefintech.org/directory/listing/money4money',
 'https://directory.singaporefintech.org/directory/listing/onelyst',
 'https://directory.singaporefintech.org/directory/listing/hearti-lab',
 'https://directory.singaporefintech.org/directory/listing/samurai-fintech-singapore-pte-ltd',
 'https://directory.singaporefintech.org/directory/listing/ceo-1',
 'https://directory.singaporefintech.org/directory/listing/arcadier',
 'https://directory.singaporefintech.org/directory/listing/plmp-fintech-pte-ltd',
 'https://directory.singaporefintech.org/directory/listing/cash-in-asia',
 'https://directory.singaporefintech.org/directory/listing/grc-systems',
 'https://directory.singaporefintech.org/directory/listing/sendexpense',
 'https://directory.singaporefintech.org/directory/listing/jinjerjade',
 'https://directory.singaporefintech.org/directory/listing/hatcher',
 'https://directory.singaporefintech.org/directory/listing/fintech-consortium']
另一答案

嗨,我对代码做了一些更改:

import requests
from bs4 import BeautifulSoup
from pprint import pprint

urls = []
for i in range(1,5):
    pages = "https://directory.singaporefintech.org"
    urls.append(pages)

Data = []
hrefs = []
for info in urls:
    page = requests.get(info)
    soup = BeautifulSoup(page.content, 'html.parser')
    links = soup.find_all('div', attrs ={'class' :'sabai-directory-title'})
    for link in links:
        Data.extend([a['href'].encode('ascii') for a in link.find_all('a', href=True) if a.text])
pprint (Data)

输出:

     ['https://directory.singaporefintech.org/directory/listing/silent-eight',
     'https://directory.singaporefintech.org/directory/listing/moolahsense',
     'https://directory.singaporefintech.org/directory/listing/myfinb',
     'https://directory.singaporefintech.org/directory/listing/wefinance',
     'https://directory.singaporefintech.org/directory/listing/quber',
     'https://directory.singaporefintech.org/directory/listing/ayondo-asia-pte-ltd',
     'https://directory.singaporefintech.org/directory/listing/ceo-1',
     'https://directory.singaporefintech.org/directory/listing/acekards',
     'https://directory.singaporefintech.org/directory/listing/paper-ink-pte-ltd',
     'https://directory.singaporefintech.org/directory/listing/alpha-payments-cloud',
     'https://directory.singaporefintech.org/directory/listing/samurai-fintech-singapore-pte-ltd',
     'https://directory.singaporefintech.org/directory/listing/corris-asset-management-pte-ltd',
     'https://directory.singaporefintech.org/directory/listing/fundmylife',
     'https://directory.singaporefintech.org/directory/listing/mooments',
     'https://directory.singaporefintech.org/directory/listing/venture-capital-network-pte-ltd',
     'https://directory.singaporefintech.org/directory/listing/junotele_',
     'https://directory.singaporefintech.org/directory/listing/mobilecover',
     'https://directory.singaporefintech.org/directory/listing/cherrypay',
     'https://directory.singaporefintech.org/directory/listing/toast',
     'https://directory.singaporefintech.org/directory/listing/cashdab',
     'https://directory.singaporefintech.org/directory/listing/silent-eight',
     'https://directory.singaporefintech.org/directory/listing/moolahsense',
     'https://directory.singaporefintech.org/directory/listing/myfinb',
     'https://directory.singaporefintech.org/directory/listing/wefinance',
     'https://directory.singaporefintech.org/directory/listing/quber',
     'https://directory.singaporefintech.org/directory/listing/ayondo-asia-pte-ltd',
     'https://directory.singaporefintech.org/directory/listing/ceo-1',
     'https://directory.singaporefintech.org/directory/listing/acekards',
     'https://directory.singaporefintech.org/directory/listing/paper-ink-pte-ltd',
     'https://directory.singaporefintech.org/directory/listing/alpha-payments-cloud',
     'https://directory.singaporefintech.org/directory/listing/samurai-fintech-singapore-pte-ltd',
     'https://directory.singaporefintech.org/directory/listing/corris-asset-management-pte-ltd',
     'https://directory.singaporefintech.org/directory/listing/fundmylife',
     'https://directory.singaporefintech.org/directory/listing/mooments',
     'https://directory.singaporefintech.org/directory/listing/venture-capital-network-pte-ltd',
     'https://directory.singaporefintech.org/directory/listing/junotele_',
     'https://directory.singaporefintech.org/directory/listing/mobilecover',
     'https://directory.singaporefintech.org/directory/listing/cherrypay',
     'https://directory.singaporefintech.org/directory/listing/toast',
     'https://directory.singaporefintech.org/directory/listing/cashdab',
     'https://directory.singaporefintech.org/directory/listing/silent-eight',
     'https://directory.singaporefintech.org/directory/listing/moolahsense',
     'https://directory.singaporefintech.org/directory/listing/myfinb',
     'https://directory.singaporefintech.org/directory/listing/wefinance',
     'https://directory.singaporefintech.org/directory/listing/quber',
     'https://directory.singaporefintech.org/directory/listing/ayondo-asia-pte-ltd',
     'https://directory.singaporefintech.org/directory/listing/ceo-1',
     'https://directory.singaporefintech.org/directory/listing/acekards',
     'https://directory.singaporefintech.org/directory/listing/paper-ink-pte-ltd',
     'https://directory.singaporefintech.org/directory/listing/alpha-payments-cloud',
     'https://directory.singaporefintech.org/directory/listing/samurai-fintech-singapore-pte-ltd',
     'https://directory.singaporefintech.org/directory/listing/corris-asset-management-pte-ltd',
     'https://directory.singaporefintech.org/directory/listing/fundmylife',
     'https://directory.singaporefintech.org/directory/listing/mooments',
     'https://directory.singaporefintech.org/directory/listing/venture-capital-network-pte-ltd',
     'https://directory.singaporefintech.org/directory/listing/junotele_',
     'https://directory.singaporefintech.org/directory/listing/mobilecover',
     'https://directory.singaporefintech.org/directory/listing/cherrypay',
     'https://directory.singaporefintech.org/director

以上是关于网页刮刮Python BeautifulSoup的主要内容,如果未能解决你的问题,请参考以下文章

Python使用BeautifulSoup爬取网页信息

网页闯关游戏(riddle webgame)--H5刮刮卡的原理和实践

Python3.x:BeautifulSoup()解析网页内容出现乱码

python爬虫--解析网页几种方法之BeautifulSoup

20行js代码制作网页刮刮乐

Python之爬取网页时遇到的问题——BeautifulSoup