python瀛︿範涔嬫姄鍙栫尗鐪肩數褰盩op100姒滃崟
Posted
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了python瀛︿範涔嬫姄鍙栫尗鐪肩數褰盩op100姒滃崟相关的知识,希望对你有一定的参考价值。
鏍囩锛?a href='http://www.mamicode.com/so/1/font' title='font'>font board desktop read reverse cli int black for
鐩綍
- 1 鏈瘒鐩爣
- 2 url鍒嗘瀽
- 3 椤甸潰鎶撳彇
- 4 椤甸潰鍒嗘瀽
- 5 浠g爜鏁村悎
- 6 浼樺寲
- 1 鏈瘒鐩爣
- 鎶撳彇鐚溂鐢靛奖鎬绘帓琛屾Top100鐢靛奖鍗?/li>
- 鏍规嵁鐢靛奖婕斿憳琛ㄧ粺璁℃紨鍛樹笂姒滄鏁?/li>
2 url鍒嗘瀽
鐩爣绔欑偣涓?code>https://maoyan.com/board/4锛屾墦寮€涔嬪悗灏卞彲浠ョ湅鍒版帓琛屾淇℃伅锛屽鍥炬墍绀?br />
椤甸潰涓婃樉绀?0閮ㄧ數褰憋紝鏈夊悕娆°€佸奖鐗囧悕绉般€佹紨鍛樹俊鎭瓑淇℃伅銆傚綋鎷夊埌鏈€涓嬮潰鐐瑰嚮绗簩椤电殑鏃跺€欙紝鍙戠幇url鍙樻垚浜?code>https://maoyan.com/board/4?offset=10锛屽姣斿師鍏堝浜嗕釜offset=10锛岀浜岄〉鏄樉绀烘帓鍚?1~20鐨勭數褰憋紝鍙互鎺ㄦ柇杩欐槸涓€涓亸绉婚噺锛屾墍浠ョ涓€椤靛簲璇ユ槸offset=0锛岀浜岄〉鏄痮ffset=10锛屼緷娆$被鎺ㄣ€?/p>
3 椤甸潰鎶撳彇
url鍒嗘瀽瀹屼箣鍚庯紝鍒╃敤request妯″潡锛屾垜浠氨鍙互璇曡瘯鎶撳彇椤甸潰銆?/p>
import requests # 鎶撳彇涓€椤电數褰变俊鎭?def get_one_page(page_index): url = 'https://maoyan.com/board/4?offset=' + str(page_index) headers = 'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:34.0) Gecko/20100101 Firefox/34.0' response = requests.get(url=url, headers=headers)
4 椤甸潰鍒嗘瀽
椤甸潰鎴愬姛鎶撳彇鍚庨渶瑕佽В鏋愭彁鍙栦俊鎭紝鎵撳紑椤甸潰鐨勫紑鍙戣€呮ā寮忥紝鍦∟etwork鐩戝惉缁勪欢涓煡鐪嬫簮浠g爜(娉ㄦ剰锛氫笉鍦‥lements閫夐」涓洿鎺ユ煡鐪嬫簮鐮佹槸鍥犱负璇ユ簮鐮佸彲鑳界粡杩噅avascirpt娓叉煋)锛屽鍥撅細
鐢卞浘鍙煡锛屼竴涓奖鐗囨墍鏈変俊鎭槸鍦ㄤ竴涓?lt;dd>鏍囩閲岄潰锛屼竴椤垫湁10涓€?/p>
鍏朵腑鍚嶆淇℃伅浣嶇疆鏄?/p>
<i class="board-index board-index-11">11</i>,
鐢靛奖鍚嶇О淇℃伅浣嶇疆鏄?
<p class="name"> <a href="/films/9025" title="鍠滃墽涔嬬帇" data-act="boarditem-click" data-val="movieId:9025">鍠滃墽涔嬬帇</a> </p>
婕斿憳淇℃伅浣嶇疆鏄細
<p class="star">涓绘紨锛氬懆鏄熼┌,鑾枃钄?寮犳煆鑺?lt;/p>
鐭ラ亾浜嗙浉鍏充俊鎭殑浣嶇疆锛屽氨鍙互鍒╃敤Pyquery妯″潡瀵硅祫婧愯繘琛屽畾浣嶅拰鎶撳彇銆傜户缁畬鍠勫垰鎵嶇殑鏂规硶
import requests from pyquery import PyQuery as pq # 鎶撳彇涓€椤电數褰变俊鎭?def get_one_page(page_index): url = 'https://maoyan.com/board/4?offset=' + str(page_index) headers = 'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:34.0) Gecko/20100101 Firefox/34.0' response = requests.get(url=url, headers=headers) doc = pq(response.text) page_info = '' name_list = [] # 閬嶅巻<dd>鏍囩锛屼竴椤垫姄鍙?0閮ㄧ數褰? for i in doc('dd').items(): # 璁$畻绌烘牸锛岀敤浜庣編鍖栨牸寮? name_len = len(i('.name').children().text()) other_len = 15 - name_len space = '' for j in range(other_len): space += ' ' # 鎸夌収鈥樻帓搴?鐢靛奖鍚嶇О 涓绘紨鈥欑殑鏂瑰紡杩斿洖鏂囨湰 page_info += i('.board-index').text() + ' ' + i('.name').children().text() + space + i('.star').text() + '\n' name_list += i('.star').text().split('锛?#39;)[1].split(',') # 杩斿洖涓€椤电數褰变俊鎭拰婕斿憳淇℃伅 return page_info, name_list
5 浠g爜鏁村悎
鍦ㄦ垚鍔熸姄鍙栦竴椤典俊鎭箣鍚庯紝鏁村悎浠g爜锛屽皢鎵€鏈変俊鎭姄鍙栧苟澶勭悊銆?/p>
def info_handle(): # 瀛樺偍鐢靛奖淇℃伅 movie_info = '' # 瀛樺偍鍑虹幇杩囩殑婕斿憳淇℃伅锛屾湁閲嶅 name_info_list = [] for index in range(10): movie_info += get_one_page(index * 10)[0] name_info_list += get_one_page(index * 10)[1] # 缁熻浜哄悕鍑虹幇娆℃暟 name_count_list = [] for i in set(name_info_list): dict_name_count = (i, name_info_list.count(i)) name_count_list.append(dict_name_count) # 鏍规嵁浜哄悕鍑虹幇娆℃暟鎺掕 name_count_list.sort(key=lambda k: k[1], reverse=True) # 杈撳嚭鐢靛奖淇℃伅鍒版枃鏈? with open('C:\\Users\\d\\Desktop\\xxx.txt', 'w') as f: f.write(movie_info) # 鎵撳嵃婕斿憳鍑虹幇娆℃暟 for k, v in name_count_list: print(k, v)
鎵ц缁撴灉濡備笅
txt鏂囨湰鍐呭锛?br />
鎺у埗鍙版墦鍗扮殑鎺掑悕濡備笅锛?/p>
浠庣粨鏋滃彲浠ョ湅鍑猴紝100閮ㄧ數褰憋紝寮犲浗鑽d竴浜哄氨鍗犱簡7閮紝鎺掑悕绗竴銆?/p>
6 浼樺寲
浠庡浘鐗囧彲浠ョ湅鍑猴紝婕斿憳鎺掑悕鍏跺疄涓嶆槸寰堢洿瑙傦紝鏈€濂芥槸鏈変竴涓浘琛ㄧ殑鏂瑰紡銆俻ython鐨刴atplotlib妯″潡鏄竴涓暟鎹彲瑙嗗寲妯″潡锛屾嫢鏈夊緢寮虹殑鍔熻兘銆備笉杩囩洰鍓嶆垜鍙槸鍒濇瀛︿範鍩虹妯″潡锛屽苟娌℃湁娣卞叆浜嗚Вmatplotlib锛屾墍浠ュ彧鑳戒粠缃戜笂鎵惧埌灏廳emo锛岀畝鍗曚簡瑙g敤娉曚箣鍚庡姞浠ユ敼閫犮€傚叿浣撶敤娉曞拰鍘熺悊锛屽緟鍚庣画娣卞叆瀛︿範銆?/p>
澹版槑涓€涓猟raw.py鏂囦欢
import matplotlib.pyplot as plt import numpy as np class NameCount(): # 姝ゅ嚱鏁扮敤浜庡瀭鐩存潯褰㈠浘 def show_name_bard(self, name_list_sort, name_list_count): plt.rcdefaults() fig, ax = plt.subplots() y_pos = np.arange(len(name_list_sort)) ax.barh(y_pos, name_list_count, align='center', color='green', ecolor='black') plt.rcParams['font.sans-serif'] = ['SimHei'] ax.set_yticks(y_pos) ax.set_yticklabels(name_list_sort) ax.invert_yaxis() # labels read top-to-bottom # ax.set_xlabel('') ax.set_title('Top100鐢靛奖婕斿憳鍗犳湁閮ㄦ暟缁熻') # 鍦ㄥ浘鐢讳笂鏄剧ず鏁板瓧 for x, y in enumerate(name_list_count): plt.text(y, x + 0.1, '%s' % y) plt.show() def getNameTimesSort(self, name_list): name_list.sort(key=lambda k: k[1], reverse=True) # 鎸夊嚭鐜版鏁版帓搴忓悗鐨勪汉鍚嶅垪琛? name_list_sort = [] # 鎸夊嚭鐜版鏁版帓搴忓悗鐨勪汉鍚嶆鏁板垪琛紝鍙栧墠20鍚? name_list_count = [] for k, v in name_list[0:20]: name_list_sort.append(k) name_list_count.append(v) # 缁樺埗鏉″舰鍥? self.show_name_bard(name_list_sort, name_list_count)
鐒跺悗鍦╥nfo_handle鏂规硶涓紩鍏etNameTimesSort鏂规硶锛岃鍘熷厛鐢ㄤ簬鎵撳嵃鐨勬帓濂藉簭鐨刵ame_count_list浼犲叆
# 鐢诲嚭鍨傜洿鏉″舰鍥? statistics = draw.NameCount() statistics.getNameTimesSort(name_count_list)
杩欐牱灏卞彲浠ョ敓鎴愬浘鍍忥紝涓€鐩簡鐒讹細
Python瀛︿範涔嬭矾鈥?018/7/11