web crawling

Posted 2020-10-09 兔子的尾巴_Mini

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了web crawling相关的知识，希望对你有一定的参考价值。

common web craling: scr url --- server----- url------database----server----...---read---got info --- achieve goal

spective web craling scr url(specific)---web craling 1--- url---save---filter---database---read----got info---achieve goal

**********************************

\w:leter, num, "_"

\d: num(10)

\s: string(empty)

\W: all character but "\w"

(\D, \S)

*****************

".":any char

"^":head match

"$":end match

"*": any time

"?"ome time or zero eg:"s" or "ss"

"+" one time or more than two times

t{7}:ttttttt

t{7，}more than 6 times

t{4,7}: 4<times<7

t|s: t or s

():

************************

I: A or a

M: more than one raw

L: local match

U:unical cod role

S:"." could match "\n"

************************************

eg:

pat7="p.*y"
string7="pppppppsssspsyyyyyy"
pat7_1="p.?y"
res8=re.search(pat7,string7)
print(res8)
res8_1=re.search(pat7_1,string7)
print(res8_1)

<_sre.SRE_Match object; span=(0, 19), match=‘pppppppsssspsyyyyyy‘>
<_sre.SRE_Match object; span=(11, 14), match=‘psy‘>

***************************************************************

.match:one res;the first char have to match ,,or iy will be "none"

res8_3=re.compile(pat7_1).findall(string7)print(res8_3)

*************************************************************

#Author：Mini
#！/usr/bin/env python
import urllib
import re
pat="hao"
string="http://2345.hao3603.com/"
res1=re.search(pat,string)
print(res1)
pat1="\n"
string1="""you
u"""
res2=re.search(pat1,string1)
print(res2)
pat2="\w\dp\w"
string2="abd3p13spe3p3p4ap3"
res3=re.search(pat2,string2)
print(res3)
pat3="pyth[jsz]n"
string3="pathpythsnpythznpythzn"
res4=re.search(pat3,string3)
print(res4)
pat4=".pat..."
string4="tpatttttt"
res5=re.search(pat4,string4)
print(res5)
pat5="abc|aaa"
string5="abdsdfabc"
res6=re.search(pat5,string5)
print(res6)
pat6="ppppp"
string6="PPPPPPP"
res7=re.search(pat6,string6,re.I)
print(res7)
pat7="p.*y"
string7="pppppppsssspsyyyyyy"
pat7_1="p.?y"
res8=re.search(pat7,string7)
print(res8)
res8_1=re.search(pat7_1,string7)
print(res8_1)
res8_2=re.match(pat7_1,string7)
print(res8_2)
res8_3=re.compile(pat7_1).findall(string7)
print(res8_3)
pat8="[a-zA-Z]+://[^\s]*[.com|.cn]"
string8=‘<a href="http://2345.hao3603.com">hasghj</a>‘
res9=re.compile(pat8).findall(string8)
print(res9)
from urllib.request import urlopen
string8_1=urlopen("https://www.baidu.com").read()
res10=re.compile(pat8).findall(str(string8_1))
print("you know",string8_1,"\n",res10)

以上是关于web crawling的主要内容，如果未能解决你的问题，请参考以下文章

web crawling(plus5) news crawling and proxy

web crawling

web crawling(plus1)

阅读OReilly.Web.Scraping.with.Python.2015.6笔记---Crawl

web crawling(plus10)scrapy 4

web crawling(plus3) errors solution