9 Apr 18 shelve模块 xml模块 re模块

Posted 2020-10-30 zhangyaqian

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了9 Apr 18 shelve模块 xml模块 re模块相关的知识，希望对你有一定的参考价值。

9 Apr 18

上节课复习：

一、shelve模块

Shelve（了解），是更高程度的封装。使用时只针对之前设计生成的文件，可以无视不同平台自动生成的其他文件。

Json的中间格式为字符串，用w写入文件

Pickle的中间格式为bytes，用b写入文件

序列化时更常用Json

import shelve

info1={\'age\':18,\'height\':180,\'weight\':80}

info2={\'age\':73,\'height\':150,\'weight\':80}

d=shelve.open(\'db.shv\')

d[\'egon\']=info1

d[\'alex\']=info2

d.close()

d=shelve.open(\'db.shv\')

print(d[\'egon\'])

print(d[\'alex\'])

d.close()

d=shelve.open(\'db.shv\',writeback=True)

d[\'alex\'][\'age\']=10000

print(d[\'alex\'])

d.close()

d=shelve.open(\'db.shv\',writeback=True) #如果想改写，需设置writeback=True

print(d[\'alex\'])

d.close()

二、xml模块

xml时一种组织数据的形式

xml下的元素对应三个特质，tag， attrib， text

#==========================================>查

import xml.etree.ElementTree as ET

tree=ET.parse(\'a.xml\')

root=tree.getroot()

三种查找节点的方式

res=root.iter(\'rank\') # 会在整个树中进行查找，而且是查找到所有

for item in res:

    print(\'=\'*50)

    print(item.tag) # 标签名

    print(item.attrib) #属性

    print(item.text) #文本内容

res=root.find(\'country\') # 只能在当前元素的下一级开始查找。并且只找到一个就结束

print(res.tag)

print(res.attrib)

print(res.text)

nh=res.find(\'neighbor\')

print(nh.attrib)

cy=root.findall(\'country\') # 只能在当前元素的下一级开始查找,

print([item.attrib for item in cy])

#==========================================>改

import xml.etree.ElementTree as ET

tree=ET.parse(\'a.xml\')

root=tree.getroot()

for year in root.iter(\'year\'):

    year.text=str(int(year.text) + 10)

    year.attrib={\'updated\':\'yes\'}   #一般不会改tag

tree.write(\'a.xml\')

#==========================================>增

import xml.etree.ElementTree as ET

tree=ET.parse(\'a.xml\')

root=tree.getroot()

for country in root.iter(\'country\'):

    year=country.find(\'year\')

    if int(year.text) > 2020:

        print(country.attrib)

        ele=ET.Element(\'egon\')

        ele.attrib={\'nb\':\'yes\'}

        ele.text=\'非常帅\'

        country.append(ele)

        country.remove(year)

tree.write(\'b.xml\')

三、re模块（正则）

正则---在爬虫中最为常用；使用爬虫时有其他模块可以导入帮助clear数据，正则也可用于其他方面

import re

print(re.findall(\'\\w\',\'ab 12\\+- *&_\'))

print(re.findall(\'\\W\',\'ab 12\\+- *&_\'))

print(re.findall(\'\\s\',\'ab \\r1\\n2\\t\\+- *&_\'))

print(re.findall(\'\\S\',\'ab \\r1\\n2\\t\\+- *&_\'))

print(re.findall(\'\\d\',\'ab \\r1\\n2\\t\\+- *&_\'))

print(re.findall(\'\\D\',\'ab \\r1\\n2\\t\\+- *&_\'))

print(re.findall(\'\\w_sb\',\'egon alex_sb123123wxx_sb,lxx_sb\'))

print(re.findall(\'\\Aalex\',\'abcalex is salexb\'))

print(re.findall(\'\\Aalex\',\'alex is salexb\'))

print(re.findall(\'^alex\',\'alex is salexb\'))

print(re.findall(\'sb\\Z\',\'alexsb is sbalexbsb\'))

print(re.findall(\'sb$\',\'alexsb is sbalexbsb\'))

print(re.findall(\'^ebn$\',\'ebn1\')) #^ebn$ 筛出的就是ebn（以ebn开头，以ebn结尾）

print(re.findall(\'a\\nc\',\'a\\nc a\\tc a1c\'))

\\t为制表符，在不同平台表示不同的空个数

\\A ó ^     #使用^

\\Z ó $     #使用$

# 重复匹配：

#.   ?   *   +  {m,n}  .*  .*?

1、.:代表除了换行符外的任意一个字符

. 除了换行符之外的任意一个字符， 如果想不除换行符，后加re.DOTALL

print(re.findall(\'a.c\',\'abc a1c aAc aaaaaca\\nc\'))

print(re.findall(\'a.c\',\'abc a1c aAc aaaaaca\\nc\',re.DOTALL))

2、？：代表左边那一个字符重复0次或1次

？不能单独使用

print(re.findall(\'ab?\',\'a ab abb abbb abbbb abbbb\'))

3、*：代表左边那一个字符出现0次或无穷次

print(re.findall(\'ab*\',\'a ab abb abbb abbbb abbbb a1bbbbbbb\'))

4、+ ：代表左边那一个字符出现1次或无穷次

print(re.findall(\'ab+\',\'a ab abb abbb abbbb abbbb a1bbbbbbb\'))

5、{m,n}:代表左边那一个字符出现m次到n次

print(re.findall(\'ab?\',\'a ab abb abbb abbbb abbbb\'))

print(re.findall(\'ab{0,1}\',\'a ab abb abbb abbbb abbbb\'))

print(re.findall(\'ab*\',\'a ab abb abbb abbbb abbbb a1bbbbbbb\'))

print(re.findall(\'ab{0,}\',\'a ab abb abbb abbbb abbbb a1bbbbbbb\'))

print(re.findall(\'ab+\',\'a ab abb abbb abbbb abbbb a1bbbbbbb\'))

print(re.findall(\'ab{1,}\',\'a ab abb abbb abbbb abbbb a1bbbbbbb\'))

print(re.findall(\'ab{1,3}\',\'a ab abb abbb abbbb abbbb a1bbbbbbb\'))

6、.*：匹配任意长度，任意的字符=====》贪婪匹配

print(re.findall(\'a.*c\',\'ac a123c aaaac a *123)()c asdfasfdsadf\'))

7、.*？：非贪婪匹配

print(re.findall(\'a.*?c\',\'a123c456c\'))

():分组

print(re.findall(\'(alex)_sb\',\'alex_sb asdfsafdafdaalex_sb\'))

print(re.findall(

    \'href="(.*?)"\',

    \'<li><a id="blog_nav_sitehome" class="menu" href="http://www.cnblogs.com/">博客园</a></li>\')

[]:匹配一个指定范围内的字符（这一个字符来自于括号内定义的）

[] 内写什么就是其单独的意义， 可写0-9 a-zA-Z

print(re.findall(\'a[0-9][0-9]c\',\'a1c a+c a2c a9c a11c a-c acc aAc\'))

当-需要被当中普通符号匹配时，只能放到[]的最左边或最 右边

a-b有特别的意思，所以如果想让-表示它本身，要将其放在最左或最右

print(re.findall(\'a[-+*]c\',\'a1c a+c a2c a9c a*c a11c a-c acc aAc\'))

print(re.findall(\'a[a-zA-Z]c\',\'a1c a+c a2c a9c a*c a11c a-c acc aAc\'))

[]内的^代表取反的意思 （^在[]中表示取反）

print(re.findall(\'a[^a-zA-Z]c\',\'a c a1c a+c a2c a9c a*c a11c a-c acc aAc\'))

print(re.findall(\'a[^0-9]c\',\'a c a1c a+c a2c a9c a*c a11c a-c acc aAc\'))

print(re.findall(\'([a-z]+)_sb\',\'egon alex_sb123123wxxxxxxxxxxxxx_sb,lxx_sb\'))

| :或者

print(re.findall(\'compan(ies|y)\',\'Too many companies have gone bankrupt, and the next one is my company\'))

(?:   ):代表取匹配成功的所有内容，而不仅仅只是括号内的内容 （（？：   ）表示匹配的结果都要，不单单要（）内的）

print(re.findall(\'compan(?:ies|y)\',\'Too many companies have gone bankrupt, and the next one is my company\'))

print(re.findall(\'alex|sb\',\'alex sb sadfsadfasdfegon alex sb egon\'))

re模块的其他方法：

print(re.findall(\'alex|sb\',\'123123 alex sb sadfsadfasdfegon alex sb egon\'))

print(re.search(\'alex|sb\',\'123213 alex sb sadfsadfasdfegon alex sb egon\').group())

print(re.search(\'^alex\',\'123213 alex sb sadfsadfasdfegon alex sb egon\'))

print(re.search(\'^alex\',\'alex sb sadfsadfasdfegon alex sb egon\').group())

re.search, 取第一个结果，若没有返回None；若想让结果直接显示后加group（）；返回None时用group（）会报错

print(re.match(\'alex\',\'alex sb sadfsadfasdfegon alex sb egon\').group())

print(re.match(\'alex\',\'123213 alex sb sadfsadfasdfegon alex sb egon\'))

re.match 相当于^版本的search

info=\'a:b:c:d\'

print(info.split(\':\'))

print(re.split(\':\',info))

info=r\'get :a.txt\\3333/rwx\'

print(re.split(\'[ :\\\\\\/]\',info))

re.split与split相比，内部可以使用正则表达式

print(\'egon is beutifull egon\'.replace(\'egon\',\'EGON\',1))

print(re.sub(\'(.*?)(egon)(.*?)(egon)(.*?)\',r\'\\1\\2\\3EGON\\5\',\'123 egon is beutifull egon 123\'))

print(re.sub(\'(lqz)(.*?)(SB)\',r\'\\3\\2\\1\',r\'lqz is SB\'))

print(re.sub(\'([a-zA-Z]+)([^a-zA-Z]+)([a-zA-Z]+)([^a-zA-Z]+)([a-zA-Z]+)\',r\'\\5\\2\\3\\4\\1\',r\'lqzzzz123+ is SB\'))

re.sub 与replace相比，内部可以使用正则表达式

pattern=re.compile(\'alex\')

print(pattern.findall(\'alex is alex alex\'))

print(pattern.findall(\'alexasdfsadfsadfasdfasdfasfd is alex alex\'))

以上是关于9 Apr 18 shelve模块 xml模块 re模块的主要内容，如果未能解决你的问题，请参考以下文章