谷歌中以特定单词结尾的 Python 搜索网站

Posted 2023-02-23

技术标签:

【中文标题】谷歌中以特定单词结尾的 Python 搜索网站【英文标题】：Python search website in google that end with specific word 【发布时间】：2022-01-22 01:17:47 【问题描述】：

我尝试在 Google 中搜索所有以“gencat.cat”结尾的网站。

我的代码：

import requests, lxml
from bs4 import BeautifulSoup

headers = 
    "User-Agent":
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (Khtml, like Gecko) Chrome/72.0.3538.102 Safari/537.36 Edge/18.19582"


params = 'q': 'gencat.cat'
html = requests.get('https://www.google.com/search', headers=headers, params=params).text
soup = BeautifulSoup(html, 'lxml')

# containver with all needed data
for result in soup.select('.tF2Cxc'):
    link = result.a['href'] # or ('.yuRUbf a')['href']
    print(link)

我的输出：

问题是只搜索了几个网站，而且它需要一些没有“gencat.cat”的网址或重复来自同一站点的页面：

https://web.gencat.cat/ca/inici
https://web.gencat.cat/es/inici/
https://web.gencat.cat/ca/tramits
https://web.gencat.cat/en/inici/index.html
https://govern.cat/
https://govern.cat/salapremsa/
http://www.gencat.es/
http://www.regencos.cat/promocio-variable/preguntes-mes-frequents-sobre-el-coronavirus/
https://tauler.seu.cat/inici.do?idens=1

我想要的输出：

https://web.gencat.cat
http://agricultura.gencat.cat
http://cultura.gencat.cat
https://dretssocials.gencat.cat
http://economia.gencat.cat

【问题讨论】：

提交程序化搜索查询违反谷歌的Webmaster Guidelines和terms of service。对 Google 运行此代码可能会导致 Google 显示来自您 IP 地址的搜索的验证码。嗨@StephenOstermiller，我不知道，对不起。但是我有一个同事被谴责做非常体力的工作，这可以帮助减轻工作量。那么...我怎样才能做到并遵守服务条款？ 【参考方案1】：

如果您想要***域，您可以在 link 变量中的所有“/”实例上拆分链接。

for result in soup.select('.tF2Cxc'):
link = result.a['href'] # or ('.yuRUbf a')['href']
print(link)

string_splt = link.split("/")
TLD = f"https://string_splt[2]"

print(TLD)

我确信有更好的方法可以将它们重新组合在一起，但这似乎有效。您还需要处理重复项。

【讨论】：

嗨 A.Patterson，它的作品，但我不知道为什么只打印几个网站，使用此搜索有一些限制？还有另一个 start 参数，我相信它控制着你在结果中从哪个页面开始。它以 10 递增。这似乎是获得更多结果的方法。

以上是关于谷歌中以特定单词结尾的 Python 搜索网站的主要内容，如果未能解决你的问题，请参考以下文章