python爬虫的一些小小问题python动态正则表达式

Posted 2020-09-27

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了python爬虫的一些小小问题python动态正则表达式相关的知识，希望对你有一定的参考价值。

1.首先urllib不能用了，需要引入的是urllib2，正则re。

#coding=utf-8
# import urllib
import urllib2
import re

def getHtml(url):
    page = urllib2.urlopen(url)
    html = page.read()
    return html



def getCountry(html):
    reg = r‘<td>(.*?)</td>‘
    #imgre = re.compile(reg)#编译会出错，不要再编译了。
    imglist = re.findall(reg, html, re.S|re.M)
    #re.S|re.M   ‘i‘、‘L‘、‘m‘、‘s‘、‘u‘、‘x‘里的一个或多个字母。
    # 表达式不匹配任何字符，但是指定相应的标志：re.I(忽略大小写)、re.L(依赖locale)、re.M(多行模式)、re.S(.匹配所有字符)、re.U(依赖Unicode)、re.X(详细模式)。
    return imglist

html = getHtml("https://en.wikipedia.org/wiki/List_of_countries_by_electricity_consumption")
print getCountry(html)

要注意一下注释里面的内容。

2.python动态正则表达式写法：

import re
f = open("b.txt")
ll = f.read(1000000)
print ll
for i in range(1,220):
    reg = "‘"+ str(i) + "‘" + ‘(.*?)‘+ "‘"+str(i+1)+"‘"#这里可以实现动态匹配
    reg2 = re.compile(r‘‘+reg+‘‘)#每次编译的正则表达式都不一样
    list = re.findall(reg2,ll)
    # print i,reg
    print list

注意看写法。

以上是关于python爬虫的一些小小问题python动态正则表达式的主要内容，如果未能解决你的问题，请参考以下文章

关于python爬虫经常要用到的一些Re.正则表达式