Python鐖彇鍙ュ瓙杩?鑾庡+姣斾簹璇綍

Posted

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了Python鐖彇鍙ュ瓙杩?鑾庡+姣斾簹璇綍相关的知识,希望对你有一定的参考价值。

鏍囩锛?a href='http://www.mamicode.com/so/1/%e5%ae%8c%e6%95%b4' title='瀹屾暣'>瀹屾暣   list   瀛樺偍   涓嬭浇   鍘熺悊   line   鎵嬫満   name   璁剧疆   

宸ュ叿浣跨敤鐨勬槸 Python3.7 + requests + BeautifulSoup4 + 绾跨▼

棣栧厛鍙ュ瓙杩锋槸鍏锋湁涓€瀹氬弽鐖▼搴忕殑锛屽叿浣撴槸鎬庢牱鐨勫氨涓嶆竻妤氾紝浣嗘槸鏂囧瓧杩樻槸鏁存暣榻愰綈鎽嗘斁鐫€锛岃繖涓繕鏄緢鍙嬪ソ鐨勶紝鍓嶇椤甸潰鍒嗘瀽瀹屽氨寮€濮嬫垜鏁版鐖彇灏濊瘯銆?/p>

绗簩銆佷笁娆″皾璇?/h3>

鎴戜竴鏃╄捣鏉ワ紝鍙戠幇鍙ュ瓙杩疯繖涓綉绔欏張鑳借闂簡锛屽紑蹇冪殑寰堝憪锛岃繕浠ヤ负琚皝浜嗗氨涓嶈兘鍐嶈闂簡鍛紝鐒跺悗灏卞紑濮嬩簡鎴戠浜屾灏濊瘯锛岃繖娆℃垜鎯崇潃涓嶈兘杩欎箞鏄庣洰寮犺儐浜嗭紝寰楃尌鐞愪竴鐐圭殑锛岀敤鏃堕棿鎹㈠彇瀹屾暣涓旀垚鍔熺殑缁撴灉銆?/p>

褰撶劧杩欎釜鏂规硶灏辨槸鍒╃敤sleep杩涜浼鍜紝鍥犱负涓€鍏?29椤碉紝鎴戣缃瘡鐖彇5椤靛氨闅忔満浼戞伅5浠ュ唴绉掑摝锛?code>time.sleep(random.randint(1,5)) 锛岃繖鏍锋垜涓嶈繃鍒嗗惂锛佸彲鏄憿锛岀粨鏋滃張璁╂垜澶辨湜浜嗕竴娉紝IP鍐嶄竴娆¤灏侊紝褰撶劧鎴戜篃涓嶄細鏀惧純锛岃В灏佷箣鏃跺氨鏄垜鍐嶆垬涔嬫椂锛屽綋鐒惰繖鏍疯缃竴涓嬭繕鏄湁鏀惰幏鐨勶紝鐖彇鍒扮浉褰撲簬涓婁竴娆?strong>涓ゅ€?/strong>鐨勬暟鎹噺銆傜涓夋灏濊瘯锛屾垜涓轰簡閬垮厤浼戠湢鏁板瓧闅忔満鎬ц妫€娴嬪埌锛屾垜涓撻棬璁剧疆浜嗗皬鏁板姞鍦ㄥ悗闈紝鐒惰€屽苟娌℃湁浠€涔堢敤銆?/p>

绗簲娆″皾璇?/h3>

杩欎竴娆℃垜鎯充娇鐢ㄧ殑鏄嚎绋嬶紝閫氳繃html浠g爜涓?code>pager-lastclass涓嬬殑a鏍囩鑾峰彇鎬婚〉鏁帮紝鐒跺悗鍒涘缓瀵瑰簲鏁伴噺鐨勭嚎绋嬶紝鍚屾椂鐖彇姣忎竴椤电殑鍙ュ瓙骞跺瓨鍌ㄥ湪 瀵瑰簲椤垫暟.txt 鏂囦欢涓紝姣忎釜txt鏂囦欢鐩稿綋浜?0涓彞瀛愩€傝繖涓€娆℃垜缁堜簬鎴愬姛鎷垮埌鎴戞兂瑕佺殑鎵€鏈夋暟鎹紝杩樿洰鍏村鍜紝宸偣閫煎緱鎴戣浣跨敤浠g悊浜嗐€?/p>

閮ㄥ垎浠g爜濡備笅

#coding=utf-8
import requests
from requests import codes
import io
import bs4
import os
import random
import time
import threading
from bs4 import BeautifulSoup

webUrl = 鈥榟ttps://www.juzimi.com/鈥?headers = {鈥榰ser-agent鈥?鈥楳ozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36鈥榼

def downLoadPageExtra(url):
    try:
        realUrl = webUrl + 鈥?writer/%E8%8E%8E%E5%A3%AB%E6%AF%94%E4%BA%9A鈥?"?page="+str(url)
        content = requests.get(realUrl,headers = headers)
        content.encoding = "UTF-8"
        if codes.ok == content.status_code:
            soup = BeautifulSoup(content.text,鈥榟tml.parser鈥?
            downLoadTxt = soup.findAll(鈥榓鈥?class_=鈥榵listju鈥?
            f = open("juzimi"+ "/" + str(url) + ".txt", 鈥榓鈥? encoding=鈥榰tf-8鈥?
            for dlt in downLoadTxt:
                f.writelines((dlt.text)+"

")
            print(str(url)+"OK")
            f.close()
        else:
            print("鐘舵€佺爜鍑洪敊")
    except:
        print("getPageCodeError1")
def downLoadPage(url):
    try:
        content = requests.get(url,headers = headers)
        content.encoding = "UTF-8"
        if codes.ok == content.status_code:
            soup = BeautifulSoup(content.text,鈥榟tml.parser鈥?
            downLoadTxt = soup.findAll(鈥榓鈥?class_=鈥榵listju鈥?
            #鑾峰彇鏍囬鍒涘缓鏂囦欢
            title = str(soup.title.string).strip(鈥?鈥?.split(鈥?鈥?[0]
            #鑾峰彇椤垫暟
            pageCounts = soup.find(鈥榣i鈥?class_=鈥榩ager-last鈥?.find_all(鈥榓鈥?[0].text
            f=open("juzimi"+"/"+ title + ".txt", 鈥榓鈥?encoding=鈥榰tf-8鈥?
            for dlt in downLoadTxt:
                f.writelines(str(dlt.text)+"

")
            f.close()
            print("绗竴鍙ュ畬鎴?)
            for num in range(1,int(pageCounts)):
                t = threading.Thread(target=downLoadPageExtra,args=(num,))
                t.start()
            print(鈥樼埇鍙栧畬鎴?..鈥?
    except requests.ConnectionError:
        print("getPageCodeError2")



if __name__ == 鈥榑_main__鈥?
    dirSave = 鈥榡uzimi/鈥?    if os.path.exists(dirSave) is False:
        os.makedirs(dirSave)
    downLoadPage(webUrl + 鈥?writer/%E8%8E%8E%E5%A3%AB%E6%AF%94%E4%BA%9A鈥?

浠g爜鍏蜂綋濡備笂锛屼笉淇濊瘉鐧惧垎鐧剧洿鎺ヨ兘杩愯锛屽洜涓烘槸浠庨」鐩腑澶嶅埗杩囨潵骞朵笖鍋氫簡閮ㄥ垎淇敼锛屼絾鏄ぇ鑷村師鐞嗗凡璇存槑锛屼娇鐢ㄥ绾跨▼锛屽洜涓烘槸IO瀵嗛泦鍨嬶紝鎵€浠ュ绾跨▼杩樻槸鎸哄ソ浣跨殑锛岄€熷害閭f槸鐩稿綋鐨勫揩鍟婏紝鍑犲崄绉掑氨閫氶€氭悶瀹氾紝褰撶劧鎴戜滑濡傛灉闇€瑕佸皢鎵€鏈塼xt鏂囦欢鍚堝苟鎴愪竴涓殑璇濓紝鎴戝張鍐欎簡涓€涓悎骞剁▼搴忋€?/p>

浠g爜濡備笅


import io
# 閫氳繃html椤甸潰鏍囬鍛藉悕鐨則xt鏂囦欢锛屽苟涓斿瓨鍌ㄧ殑绗竴椤电殑鍙ュ瓙
mergeTxtName = 鈥樿帋澹瘮浜氱粡鍏歌褰昣鍚嶈█_鍚嶅彞璧忔瀽_鍙ュ瓙杩?txt鈥?# 寮€濮嬫爣璁?start = 1
# 缁撴潫鏍囪
end = 228
f = open(mergeTxtName,鈥榓鈥?encoding=鈥榰tf-8鈥?
for n in range(start,end + 1):
    g = open(str(n)+鈥?txt鈥?鈥榬鈥?encoding=鈥榰tf-8鈥?
    f.write(g.read())
    print(str(n)+"瀹屾垚")
    g.close()
f.close()
print("ok")

姝や唬鐮佹枃浠跺簲褰撴斁缃湪涓嬭浇鐩綍涓紝鍚﹀垯璇蜂慨鏀瑰搴旇矾寰勶紝鑻ュ苟涓嶆槸鑾峰彇鑾庡+姣斾簹鐨勫彞瀛愶紝杩橀渶瑕佷慨鏀筸ergeTxtName瀛楃涓诧紝姝や唬鐮佷笉鎻愪緵鍒犻櫎宸插悎骞剁殑txt鏂囦欢锛岃嫢闇€瑕佸彲鑷澧炲姞...鑷啓绋嬪簭鏃惰娉ㄦ剰鏂囦欢璇诲啓鎿嶄綔鏃剁殑瀛椾綋缂栫爜闂...

鎬荤粨

鍐欑埇铏綋澶氬缁冩墜锛屽熀纭€鐨勭啛缁冿紝鍐嶆湁閽堝鐨勬彁楂橀毦搴︼紝寰佹湇鏇村缃戠珯锛岀涓€娆″皾璇曚娇鐢ㄥ崥瀹㈠洯鐨刴arkdown椋庢牸锛岀浉姣旇緝鏈夐亾浜戠瑪璁版湰鐨勪功鍐欙紝杩欓噷涓嶆槸寰堜範鎯紝鍏堣瘯鐫€涓€娆★紝涓汉鎰熻鏁堟灉濂界殑璇濅互鍚庡氨鐢ㄨ繖绉嶆柟寮忓啓浜嗭紝瑕佹槸鏈変竴涓疄鏃堕瑙堣鏈夊濂藉晩銆?/p>

鐖彇浜嗚帋澹瘮浜氱殑鍙ュ瓙鍚庡苟涓嶆槸婊¤冻浜庢妧鏈殑鎻愬崌锛岃鍒板簳骞舵病鏈夊灏戞彁鍗囷紝鎬诲綊鐨勬潵璇磋繕鏄睘浜庨潤鎬佺晫闈紝鍒濆績鍒欐槸鎯抽槄瑙堜竴涓嬭繖浜涙湁瓒g殑鍙ュ瓙锛屾湁涓€浜涜瘽璇啓鐨勭湡鐨勫緢濂斤紝骞舵垚涓烘垜鐨勫骇鍙抽摥锛?/strong>

鎴戣崚搴熶簡鏃堕棿锛屾椂闂翠究鎶婃垜鑽掑簾浜嗐€?/p>

以上是关于Python鐖彇鍙ュ瓙杩?鑾庡+姣斾簹璇綍的主要内容,如果未能解决你的问题,请参考以下文章

Python--鐖櫕鍩虹

python鐖创鍚ф暟鎹瓨mysql瀹屾暣浠g爜妗堜緥

python 鐖彇涔屼簯鎵€鏈夊巶鍟嗗悕瀛楋紝url锛屾紡娲炴€绘暟 骞跺瓨鍏ユ暟鎹簱

Python鐖櫕瀹炶返 鈥斺€?3.鍒╃敤鐖櫕鎻愬彇杩斿洖鍊硷紝妯℃嫙鏈夐亾璇嶅吀鎺ュ彛

銆愬皬鐧藉涔燙++ 鏁欑▼銆戜笁銆丆++鐢ㄦ埛杈撳叆銆佸垽鏂鍙ュ拰switch

銆岀71鏈熴€? 鐖櫕鎶€鏈?鎶撳寘

(c)2006-2024 SYSTEM All Rights Reserved IT常识