python 正则表达式提取网页中标签的中文

Posted 2020-10-05

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了python 正则表达式提取网页中标签的中文相关的知识，希望对你有一定的参考价值。

转载请注明出处 http://www.cnblogs.com/pengwang57/。

>>> p= re.compile(r‘\\<div class="comment-content comment-content_new"\\>([^x00-xff]*)\\<\\/div\\>‘)
>>> text=‘<div class="comment-content comment-content_new">测试</div> <div class="comment-content comment-content_new">学习正则</div>‘
>>> for m in p.finditer(text):
...     print m.group(1)
...
测试
学习正则


如果 用findall 输出为中文字符编码
>>> m = re.findall(r‘\\<div class="comment-content comment-content_new"\\>([^x00-xff]*)\\<\\/div\\>‘,‘<div class="comment-content comment-content_new">测试</div> <div class="comment-content comment-content_new">学习正则</div>‘)
>>> print m
[‘\\xe6\\xb5\\x8b\\xe8\\xaf\\x95‘, ‘\\xe5\\xad\\xa6\\xe4\\xb9\\xa0\\xe6\\xad\\xa3\\xe5\\x88\\x99‘]

以上是关于python 正则表达式提取网页中标签的中文的主要内容，如果未能解决你的问题，请参考以下文章

python 正则表达式 提取网页中标签的中文

python 正则表达式提取网页中标签的中文