爬虫系列之股票信息爬取

Posted 2020-11-09 谦谦君子，陌上其华

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了爬虫系列之股票信息爬取相关的知识，希望对你有一定的参考价值。

1. 总述

慕课中这段代码的功能是首先从东方财富网上获得所有股票的代码，再利用我们所获得的股票代码输入url中进入百度股票页面爬取该只股票的详细信息。

 1 import requests
 2 from bs4 import BeautifulSoup
 3 import traceback
 4 import re
 5 
 6 
 7 def gethtmlText(url):
 8     try:
 9         r = requests.get(url)
10         r.raise_for_status()
11         r.encoding = r.apparent_encoding
12         return r.text
13     except:
14         return ""
15 
16 
17 def getStockList(lst, stockURL):
18     html = getHTMLText(stockURL)
19     soup = BeautifulSoup(html, \'html.parser\')
20     a = soup.find_all(\'a\')
21     for i in a:
22         try:
23             href = i.attrs[\'href\']
24             lst.append(re.findall(r\'[s][hz]\\d{6}\', href)[0])
25         except:
26             continue
27 
28 
29 def getStockInfo(lst, stockURL, fpath):
30     for stock in lst:
31         url = stockURL + stock + ".html"
32         html = getHTMLText(url)
33         try:
34             if html == "":
35                 continue
36             infoDict = {}
37             soup = BeautifulSoup(html, \'html.parser\')
38             stockInfo = soup.find(\'div\', attrs={\'class\': \'stock-bets\'})
39 
40             name = stockInfo.find_all(attrs={\'class\': \'bets-name\'})[0]
41             infoDict.update({\'股票名称\': name.text.split()[0]})
42 
43             keyList = stockInfo.find_all(\'dt\')
44             valueList = stockInfo.find_all(\'dd\')
45             for i in range(len(keyList)):
46                 key = keyList[i].text
47                 val = valueList[i].text
48                 infoDict[key] = val
49 
50             with open(fpath, \'a\', encoding=\'utf-8\') as f:
51                 f.write(str(infoDict) + \'\\n\')
52         except:
53             traceback.print_exc()
54             continue
55 
56 
57 def main():
58     stock_list_url = \'http://quote.eastmoney.com/stocklist.html\'
59     stock_info_url = \'http://gupiao.baidu.com/stock/\'
60     output_file = \'D:/BaiduStockInfo.txt\'
61     slist = []
62     getStockList(slist, stock_list_url)
63     getStockInfo(slist, stock_info_url, output_file)
64 
65 
66 main()

2. 具体分析

2.1 获取源码

这段代码的功能就是使用requests库直接获得网页的所有源代码。

1 def getHTMLText(url):
2     try:
3         r = requests.get(url)
4         r.raise_for_status()
5         r.encoding = r.apparent_encoding
6         return r.text
7     except:
8         return ""

2.2 获取股票代码

在源码中可以看到每支股票都对应着一个6位数字的代码，这部分要做的工作就是获取这代码编号。这编号在a标签中，所有首先用BeautifulSoup选出所有的a标签，接下来我们在用attrs[href]来获取a标签的href属性值，最后用正则表达式筛选出我们想要的代码值。

 1 def getStockList(lst, stockURL):
 2     html = getHTMLText(stockURL)
 3     soup = BeautifulSoup(html, \'html.parser\')
 4     a = soup.find_all(\'a\')
 5     for i in a:
 6         try:
 7             href = i.attrs[\'href\']
 8             lst.append(re.findall(r\'[s][hz]\\d{6}\', href)[0])  #findall返回的是一个列表，所有这里[0]的作用就是append一个字符串，而不是一个列表进去
 9         except:
10             continue

2.3 获取股票信息

同样的原理，最后用字典来保存。

 1 def getStockInfo(lst, stockURL, fpath):
 2     for stock in lst:
 3         url = stockURL + stock + ".html"
 4         html = getHTMLText(url)
 5         try:
 6             if html == "":
 7                 continue
 8             infoDict = {}
 9             soup = BeautifulSoup(html, \'html.parser\')
10             stockInfo = soup.find(\'div\', attrs={\'class\': \'stock-bets\'})
11 
12             name = stockInfo.find_all(attrs={\'class\': \'bets-name\'})[0]  
13             infoDict.update({\'股票名称\': name.text.split()[0]})  #text是requests的方法
14 
15             keyList = stockInfo.find_all(\'dt\')
16             valueList = stockInfo.find_all(\'dd\')
17             for i in range(len(keyList)):
18                 key = keyList[i].text
19                 val = valueList[i].text
20                 infoDict[key] = val
21 
22             with open(fpath, \'a\', encoding=\'utf-8\') as f:
23                 f.write(str(infoDict) + \'\\n\')
24         except:
25             traceback.print_exc()
26             continue

3. 增加进度条显示

进度条的显示只需要首先将count赋0，然后在下面的位置加入如下语句即可，\\r转译不换行。

1             with open(fpath, \'a\', encoding=\'utf-8\') as f:
2                 f.write(str(infoDict) + \'\\n\')
3                 count = count+1
4                 print(\'\\r当前速度:{:.2f}%\'.format(count*100/len(lst)), end=\'\')
5         except:
6             count = count + 1
7             print(\'\\r当前速度:{:.2f}%\'.format(count * 100 / len(lst)), end=\'\')
8             traceback.print_exc()
9             continue

以上是关于爬虫系列之股票信息爬取的主要内容，如果未能解决你的问题，请参考以下文章