用requests库和BeautifulSoup4库爬取新闻列表

Posted 2020-10-08 43张正杰

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了用requests库和BeautifulSoup4库爬取新闻列表相关的知识，希望对你有一定的参考价值。

1、用requests库和BeautifulSoup4库，爬取校园新闻列表的时间、标题、链接、来源、详细内容。

要求：（1）将其中的时间str转换成datetime类型。（2）将取得详细内容的代码包装成函数。

 1 import requests
 2 from bs4 import BeautifulSoup
 3 from datetime import datetime
 4 
 5 webs="http://news.gzcc.cn/html/xiaoyuanxinwen/"
 6 res=requests.get(webs)
 7 res.encoding=\'utf-8\'  #编码转换，避免中文乱码输出
 8 soup=BeautifulSoup(res.text,"html.parser")  #html.parser是指定解析器
 9 
10 #下面函数是输出新闻的详细内容
11 def getdetail(url):
12     resd=requests.get(url)
13     resd.encoding=\'utf-8\'
14     soupd=BeautifulSoup(resd.text,\'html.parser\')
15     return (soupd.select(\'.show-content\')[0].text)
16 
17 #下面函数是输出新闻的时间，类型为datetime
18 def gettime(url):
19     resd=requests.get(url)
20     resd.encoding=\'utf-8\'
21     soupd=BeautifulSoup(resd.text,\'html.parser\')
22     tx1=soupd.select(\'.show-info\')[0].text
23     tx2="{0:.24}".format(tx1[5:24])
24     time=datetime.strptime(tx2,\'%Y-%m-%d %H:%M:%S\') #把字符串类型转换成时间类型
25     return (time)
26 
27 for news in soup.select(\'li\'):
28     if len(news.select(\'.news-list-title\'))>0:
29         #如果存在新闻列表标题的话（有内容则会大于0）
30         title=(news.select(\'.news-list-title\')[0].text)
31         #输出标题的内容
32         url=news.select(\'a\')[0][\'href\']
33         #输出a标签中的href内容（即网址）
34         
35         time=gettime(url)
36         #用列表列出子标签后取出第一个元素的内容（时间）
37         sorce=(news.select(\'.news-list-info\')[0].contents[1].text)
38         #用列表列出子标签后取出第二个元素的内容（来源）
39         detail=getdetail(url)
40         #输出详细内容
41         print(time,sorce,title,\'\\n\',url,\'\\n\',detail)
42         #输出新闻时间、来源、标题、链接、和内容

2、一个自己感兴趣的主题，做类似的操作，为后面“爬取网络数据并进行文本分析”做准备。

 1 import requests
 2 from bs4 import BeautifulSoup
 3 
 4 mt="http://gz.meituan.com/shop/2380968"
 5 res=requests.get(mt)
 6 res.encoding=\'utf-8\'
 7 soup=BeautifulSoup(res.text,"html.parser")
 8 
 9 for news in soup.select(\'li\'):
10     if len(news.select(\'.title\'))>0:
11         titles=(news.select(\'.title\'))
12      
13         print(titles)

以上是关于用requests库和BeautifulSoup4库爬取新闻列表的主要内容，如果未能解决你的问题，请参考以下文章

用requests库和BeautifulSoup4库爬取新闻列表