web crawling
Posted 兔子的尾巴_Mini
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了web crawling相关的知识,希望对你有一定的参考价值。
common web craling: scr url --- server----- url------database----server----...---read---got info --- achieve goal
spective web craling scr url(specific)---web craling 1--- url---save---filter---database---read----got info---achieve goal
**********************************
\w:leter, num, "_"
\d: num(10)
\s: string(empty)
\W: all character but "\w"
(\D, \S)
*****************
".":any char
"^":head match
"$":end match
"*": any time
"?"ome time or zero eg:"s" or "ss"
"+" one time or more than two times
t{7}:ttttttt
t{7,}more than 6 times
t{4,7}: 4<times<7
t|s: t or s
():
************************
I: A or a
M: more than one raw
L: local match
U:unical cod role
S:"." could match "\n"
eg:
pat7="p.*y"
string7="pppppppsssspsyyyyyy"
pat7_1="p.?y"
res8=re.search(pat7,string7)
print(res8)
res8_1=re.search(pat7_1,string7)
print(res8_1)
<_sre.SRE_Match object; span=(0, 19), match=‘pppppppsssspsyyyyyy‘>
<_sre.SRE_Match object; span=(11, 14), match=‘psy‘>
***************************************************************
.match:one res;the first char have to match ,,or iy will be "none"
*************************************************************
#Author:Mini
#!/usr/bin/env python
import urllib
import re
pat="hao"
string="http://2345.hao3603.com/"
res1=re.search(pat,string)
print(res1)
pat1="\n"
string1="""you
u"""
res2=re.search(pat1,string1)
print(res2)
pat2="\w\dp\w"
string2="abd3p13spe3p3p4ap3"
res3=re.search(pat2,string2)
print(res3)
pat3="pyth[jsz]n"
string3="pathpythsnpythznpythzn"
res4=re.search(pat3,string3)
print(res4)
pat4=".pat..."
string4="tpatttttt"
res5=re.search(pat4,string4)
print(res5)
pat5="abc|aaa"
string5="abdsdfabc"
res6=re.search(pat5,string5)
print(res6)
pat6="ppppp"
string6="PPPPPPP"
res7=re.search(pat6,string6,re.I)
print(res7)
pat7="p.*y"
string7="pppppppsssspsyyyyyy"
pat7_1="p.?y"
res8=re.search(pat7,string7)
print(res8)
res8_1=re.search(pat7_1,string7)
print(res8_1)
res8_2=re.match(pat7_1,string7)
print(res8_2)
res8_3=re.compile(pat7_1).findall(string7)
print(res8_3)
pat8="[a-zA-Z]+://[^\s]*[.com|.cn]"
string8=‘<a href="http://2345.hao3603.com">hasghj</a>‘
res9=re.compile(pat8).findall(string8)
print(res9)
from urllib.request import urlopen
string8_1=urlopen("https://www.baidu.com").read()
res10=re.compile(pat8).findall(str(string8_1))
print("you know",string8_1,"\n",res10)
以上是关于web crawling的主要内容,如果未能解决你的问题,请参考以下文章
web crawling(plus5) news crawling and proxy